Hello ..... I am seeing some unexpected issues with achieving HDFS data locality. I expect the tasks to be executed only on the node which has the data but this is not happening (ofcourse, unless the node is busy in which case, I understand tasks can go to some other node). Could anyone clarify whats wrong with the way I am trying or what I should rather do? Below is the cluster configuration and experiments that I have tried. Any help will be appreciated. If you would like to recreate the below scenario, then you may use the JavaWordCount.java example given within the spark.
*Cluster configuration:* 1. spark-1.4.0 and hadoop-2.7.1 2. Machines --> Master node (master) and 6 worker nodes (node1 to node6) 3. master acts as --> spark master, HDFS name node & sec name node, Yarn resource manager 4. Each of the 6 worker nodes act as --> spark worker node, HDFS data node, node manager *Data on HDFS:* 20Mb text file is stored in single block. With the replication factor of 3, the text file is stored on nodes 2, 3 & 4. *Test-1 (Spark stand alone mode):* Application being run is the standard Java word count count example with the above text file in HDFS, as input. On job submission, I see in the spark web-UI that, stage-0(i.e mapToPair) is being run on random nodes (i.e. node1, node 2, node 6, etc.). By random I mean that, stage 0 executes on the very first worker node that gets registered to the application (this can be looked from the event timeline graph). Rather, I am expecting the stage-0 to be run only on any one of the three nodes 2, 3, or 4. * Test-2 (Yarn cluster mode): * Same as above. No data locality seen. * Additional info: * No other spark applications are running and I have even tried by setting the /spark.locality.wait/ to 10s, but still no difference. Thanks and regards, Sunil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-with-HDFS-not-being-seen-tp24361.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org