Hello .....      I am seeing some unexpected issues with achieving HDFS data
locality. I expect the tasks to be executed only on the node which has the
data but this is not happening (ofcourse, unless the node is busy in which
case, I understand tasks can go to some other node). Could anyone clarify
whats wrong with the way I am trying or what I should rather do? Below is
the cluster configuration and experiments that I have tried. Any help will
be appreciated. If you would like to recreate the below scenario, then you
may use the JavaWordCount.java example given within the spark.

*Cluster configuration:*

1. spark-1.4.0 and hadoop-2.7.1
2. Machines --> Master node (master) and 6 worker nodes (node1 to node6) 
3. master acts as --> spark master, HDFS name node & sec name node, Yarn
resource manager
4. Each of the 6 worker nodes act as --> spark worker node, HDFS data node,
node manager

*Data on HDFS:*

20Mb text file is stored in single block. With the replication factor of 3,
the text file is stored on nodes 2, 3 & 4.

*Test-1 (Spark stand alone mode):*

Application being run is the standard Java word count count example with the
above text file in HDFS, as input. On job submission, I see in the spark
web-UI that, stage-0(i.e mapToPair) is being run on random nodes (i.e.
node1, node 2, node 6, etc.). By random I mean that, stage 0 executes on the
very first worker node that gets registered to the application (this can be
looked from the event timeline graph). Rather, I am expecting the stage-0 to
be run only on any one of the three nodes 2, 3, or 4. 

* Test-2 (Yarn cluster mode): *
Same as above. No data locality seen. 

* Additional info: *
No other spark applications are running and I have even tried by setting the
/spark.locality.wait/ to 10s, but still no difference.

Thanks and regards,
Sunil



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-with-HDFS-not-being-seen-tp24361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to