Hi All I am new to Spark, workin on 3 node test cluster. I am trying to explore Spark scope in analytics, my Spark codes interacts with HDFS mostly.
I have a confusion that how Spark choose on which node it will distribute its work. Since we assume that it can be an alternative to Hadoop MapReduce. In MapReduce we know that internally framework will distribute code (or logic) to the nearest TaskTracker which are co-located with DataNode or in same rack or probably nearest to the data blocks. My confusion is when I give HDFS path inside a Spark program how it choose which node is nearest (if it does). If it does not then how it will work when I have TBs of data where high network latency will be involved. My apologies for asking basic question, please suggest. TIA -- Anish Sneh "Experience is the best teacher." http://www.anishsneh.com