Thanks Rishitesh !! 1. I get that driver doesn't need to be on master but there is lot of communication between driver and cluster. That's why co-located gateway was recommended. How much is the impact of driver not being co-located with cluster?
4. How does hdfs split get assigned to worker node to read data from remote hadoop cluster? I am more interested to know how mapr NFS layer is accessed in parallel. - Swapnil On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra <rishi80.mis...@gmail.com> wrote: > Hi Swapnil, > Let me try to answer some of the questions. Answers inline. Hope it helps. > > On Thursday, August 27, 2015, Swapnil Shinde <swapnilushi...@gmail.com> > wrote: > >> Hello >> I am new to spark world and started to explore recently in standalone >> mode. It would be great if I get clarifications on below doubts- >> >> 1. Driver locality - It is mentioned in documentation that "client" >> deploy-mode is not good if machine running "spark-submit" is not co-located >> with worker machines. cluster mode is not available with standalone >> clusters. Therefore, do we have to submit all applications on master >> machine? (Assuming we don't have separate co-located gateway machine) >> > > No. In standalone mode also your master and driver machines can be > different. > >> Driver should have access to Master as well as worker machines. >> > > >> 2. How does above driver locality work with spark shell running on local >> machine ? >> > > Spark shell itself acts as driver. This means your local machine should > have access to all the cluster machines. > >> >> 3. I am little confused with role of driver program. Does driver do any >> computation in spark app life cycle? For instance, in simple row count app, >> worker node calculates local row counts. Does driver sum up local row >> counts? In short where does reduce phase runs in this case? >> > > Role of driver is to co-ordinate with cluster manager for initial resource > allocation. After that it needs to schedule tasks to different executors > assigned to it. It does not do any computation.(unless the application > itself does something on its own ). Reduce phase is also a bunch of tasks, > which gets assigned to one or more executors. > >> >> 4. In case of accessing hdfs data over network, do worker nodes read data >> in parallel? How does hdfs data over network get accessed in spark >> application? >> > > >> Yes. All worker will get a split to read. They read their own split in >> parallel.This means all worker nodes should have access to Hadoop file >> system. >> > > >> Sorry if these questions were already discussed.. >> >> Thanks >> Swapnil >> >