[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992745#comment-15992745 ] Ramgopal N commented on SPARK-3528: --- I have spark running on Mesos. Mesos agents are running on node1,node2,node3 and datanodes on node4,node5 and node6. I see 3 executors running one on each of the 3 Mesos agents. So in my case PROCESS_LOCAL and NODE_LOCAL are same i believe. Basically i am trying to check the spark sql performance when there is no datalocality. When i execute spark sql, all the tasks are showing as PROCESS_LOCAL. what is importance of "spark.locality.wait.process" for spark on mesos. Is this configuration applicable for standalone spark? > Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL > > > Key: SPARK-3528 > URL: https://issues.apache.org/jira/browse/SPARK-3528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Priority: Critical > > Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task > {noformat} > spark> sc.textFile("pom.xml").count > ... > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:20862+20863 > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:0+20862 > {noformat} > There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: > {noformat} > override def getPreferredLocations(split: Partition): Seq[String] = { > // TODO: Filtering out "localhost" in case of file:// URLs > val hadoopSplit = split.asInstanceOf[HadoopPartition] > hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost") > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779436#comment-15779436 ] Umesh Chaudhary commented on SPARK-3528: [~gip] looks like when we provide file:/// as URI, executors should be able to find it from their respective host, so makes sense for NODE_LOCAL. Found this[1] user list discussion worth mentioning. IMHO, executors read the data into same JVM (from the mentioned URI) and after that data becomes PROCESS_LOCAL for the executors. [1] http://apache-spark-user-list.1001560.n3.nabble.com/When-does-Spark-switch-from-PROCESS-LOCAL-to-NODE-LOCAL-or-RACK-LOCAL-td7091.html > Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL > > > Key: SPARK-3528 > URL: https://issues.apache.org/jira/browse/SPARK-3528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Priority: Critical > > Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task > {noformat} > spark> sc.textFile("pom.xml").count > ... > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:20862+20863 > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:0+20862 > {noformat} > There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: > {noformat} > override def getPreferredLocations(split: Partition): Seq[String] = { > // TODO: Filtering out "localhost" in case of file:// URLs > val hadoopSplit = split.asInstanceOf[HadoopPartition] > hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost") > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603576#comment-14603576 ] Perinkulam I Ganesh commented on SPARK-3528: Have a question: If the driver is on one node and the slave on another node. Then the file may be local to the driver node but it won't be local on the slave. So is it proper to tag the file as NODE_LOCAL? thanks - P. I. Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL Key: SPARK-3528 URL: https://issues.apache.org/jira/browse/SPARK-3528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Priority: Critical Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} spark sc.textFile(pom.xml).count ... 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135781#comment-14135781 ] Andrew Ash commented on SPARK-3528: --- [~nchammas] it does look like S3 also has the issue. Let's let someone more knowledgeable decide if the two are related, and if they're not we can create a new ticket to track S3 locality being incorrect. Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL Key: SPARK-3528 URL: https://issues.apache.org/jira/browse/SPARK-3528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Priority: Critical Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} spark sc.textFile(pom.xml).count ... 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133927#comment-14133927 ] Nicholas Chammas commented on SPARK-3528: - [~aash] - How about for data read from S3? I see that being marked as {{PROCESS_LOCAL}} as well. {code} sc.textFile('s3n://...').count() 14/09/15 10:12:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1242 bytes) 14/09/15 10:12:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1242 bytes) {code} Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL Key: SPARK-3528 URL: https://issues.apache.org/jira/browse/SPARK-3528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} spark sc.textFile(pom.xml).count ... 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org