Hi All

I am new to Spark, workin on 3 node test cluster. I am trying to explore Spark 
scope in analytics, my Spark codes interacts with HDFS mostly.

I have a confusion that how Spark choose on which node it will distribute its 
work.

Since we assume that it can be an alternative to Hadoop MapReduce. In MapReduce 
we know that internally framework will distribute code (or logic) to the 
nearest TaskTracker which are co-located with DataNode or in same rack or 
probably nearest to the data blocks.

My confusion is when I give HDFS path inside a Spark program how it choose 
which node is nearest (if it does).

If it does not then how it will work when I have TBs of data where high network 
latency will be involved.

My apologies for asking basic question, please suggest.

TIA
-- 
Anish Sneh
"Experience is the best teacher."
http://www.anishsneh.com

Reply via email to