[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2017-05-02 Thread Ramgopal N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992745#comment-15992745
 ] 

Ramgopal N commented on SPARK-3528:
---

I have spark running on Mesos. 
Mesos agents are running on node1,node2,node3 and datanodes on node4,node5 and 
node6.
I see 3 executors running one on each of the 3 Mesos agents. So in my case 
PROCESS_LOCAL and NODE_LOCAL are same i believe.
Basically i am trying to check the spark sql performance when there is no 
datalocality.

When i execute spark sql, all the tasks are showing as PROCESS_LOCAL.
what is importance of "spark.locality.wait.process" for spark on mesos. Is this 
configuration applicable for standalone spark?

> Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
> 
>
> Key: SPARK-3528
> URL: https://issues.apache.org/jira/browse/SPARK-3528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Priority: Critical
>
> Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
> {noformat}
> spark> sc.textFile("pom.xml").count
> ...
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:20862+20863
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:0+20862
> {noformat}
> There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
> {noformat}
>   override def getPreferredLocations(split: Partition): Seq[String] = {
> // TODO: Filtering out "localhost" in case of file:// URLs
> val hadoopSplit = split.asInstanceOf[HadoopPartition]
> hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2016-12-26 Thread Umesh Chaudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779436#comment-15779436
 ] 

Umesh Chaudhary commented on SPARK-3528:


[~gip] looks like when we provide file:/// as URI, executors should be able to 
find it from their respective host, so makes sense for NODE_LOCAL. Found 
this[1] user list discussion worth mentioning. IMHO, executors read the data 
into same JVM (from the mentioned URI) and after that data becomes 
PROCESS_LOCAL for the executors. 

[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/When-does-Spark-switch-from-PROCESS-LOCAL-to-NODE-LOCAL-or-RACK-LOCAL-td7091.html



> Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
> 
>
> Key: SPARK-3528
> URL: https://issues.apache.org/jira/browse/SPARK-3528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Priority: Critical
>
> Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
> {noformat}
> spark> sc.textFile("pom.xml").count
> ...
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:20862+20863
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:0+20862
> {noformat}
> There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
> {noformat}
>   override def getPreferredLocations(split: Partition): Seq[String] = {
> // TODO: Filtering out "localhost" in case of file:// URLs
> val hadoopSplit = split.asInstanceOf[HadoopPartition]
> hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2015-06-26 Thread Perinkulam I Ganesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603576#comment-14603576
 ] 

Perinkulam I Ganesh commented on SPARK-3528:


Have a question:

If the driver is on one node and the slave on another node. Then the file may 
be local to the driver node but it won't be local on the slave. So is it proper 
to tag the file as NODE_LOCAL?

thanks

- P. I. 

 Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
 

 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash
Priority: Critical

 Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
 {noformat}
 spark sc.textFile(pom.xml).count
 ...
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:20862+20863
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:0+20862
 {noformat}
 There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
 {noformat}
   override def getPreferredLocations(split: Partition): Seq[String] = {
 // TODO: Filtering out localhost in case of file:// URLs
 val hadoopSplit = split.asInstanceOf[HadoopPartition]
 hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2014-09-16 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135781#comment-14135781
 ] 

Andrew Ash commented on SPARK-3528:
---

[~nchammas] it does look like S3 also has the issue.  Let's let someone more 
knowledgeable decide if the two are related, and if they're not we can create a 
new ticket to track S3 locality being incorrect.

 Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
 

 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash
Priority: Critical

 Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
 {noformat}
 spark sc.textFile(pom.xml).count
 ...
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:20862+20863
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:0+20862
 {noformat}
 There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
 {noformat}
   override def getPreferredLocations(split: Partition): Seq[String] = {
 // TODO: Filtering out localhost in case of file:// URLs
 val hadoopSplit = split.asInstanceOf[HadoopPartition]
 hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2014-09-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133927#comment-14133927
 ] 

Nicholas Chammas commented on SPARK-3528:
-

[~aash] - How about for data read from S3? I see that being marked as 
{{PROCESS_LOCAL}} as well. 

{code}
 sc.textFile('s3n://...').count()
14/09/15 10:12:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1242 bytes)
14/09/15 10:12:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1242 bytes)
{code}


 Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
 

 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash

 Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
 {noformat}
 spark sc.textFile(pom.xml).count
 ...
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:20862+20863
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:0+20862
 {noformat}
 There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
 {noformat}
   override def getPreferredLocations(split: Partition): Seq[String] = {
 // TODO: Filtering out localhost in case of file:// URLs
 val hadoopSplit = split.asInstanceOf[HadoopPartition]
 hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org