Thanks.. On Aug 28, 2015 4:55 AM, "Rishitesh Mishra" <rishi80.mis...@gmail.com> wrote:
> Hi Swapnil, > > 1. All the task scheduling/retry happens from Driver. So you are right > that a lot of communication happens from driver to cluster. It all depends > on the how you want to go about your Spark application, whether your > application has direct access to Spark cluster or its routed through a > gateway machine. Accordingly you can take your decision. > > 2. I am not familiar with NFS layer concurrency. But parallel reads should > be OK I think. Some one with the knowledge of NFS workings should correct > if I am wrong. > > > On Fri, Aug 28, 2015 at 1:12 AM, Swapnil Shinde <swapnilushi...@gmail.com> > wrote: > >> Thanks Rishitesh !! >> 1. I get that driver doesn't need to be on master but there is lot of >> communication between driver and cluster. That's why co-located gateway was >> recommended. How much is the impact of driver not being co-located with >> cluster? >> >> 4. How does hdfs split get assigned to worker node to read data from >> remote hadoop cluster? I am more interested to know how mapr NFS layer is >> accessed in parallel. >> >> - >> Swapnil >> >> >> On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra < >> rishi80.mis...@gmail.com> wrote: >> >>> Hi Swapnil, >>> Let me try to answer some of the questions. Answers inline. Hope it >>> helps. >>> >>> On Thursday, August 27, 2015, Swapnil Shinde <swapnilushi...@gmail.com> >>> wrote: >>> >>>> Hello >>>> I am new to spark world and started to explore recently in standalone >>>> mode. It would be great if I get clarifications on below doubts- >>>> >>>> 1. Driver locality - It is mentioned in documentation that "client" >>>> deploy-mode is not good if machine running "spark-submit" is not co-located >>>> with worker machines. cluster mode is not available with standalone >>>> clusters. Therefore, do we have to submit all applications on master >>>> machine? (Assuming we don't have separate co-located gateway machine) >>>> >>> >>> No. In standalone mode also your master and driver machines can be >>> different. >>> >>>> Driver should have access to Master as well as worker machines. >>>> >>> >>> >>>> 2. How does above driver locality work with spark shell running on >>>> local machine ? >>>> >>> >>> Spark shell itself acts as driver. This means your local machine should >>> have access to all the cluster machines. >>> >>>> >>>> 3. I am little confused with role of driver program. Does driver do any >>>> computation in spark app life cycle? For instance, in simple row count app, >>>> worker node calculates local row counts. Does driver sum up local row >>>> counts? In short where does reduce phase runs in this case? >>>> >>> >>> Role of driver is to co-ordinate with cluster manager for initial >>> resource allocation. After that it needs to schedule tasks to different >>> executors assigned to it. It does not do any computation.(unless the >>> application itself does something on its own ). Reduce phase is also a >>> bunch of tasks, which gets assigned to one or more executors. >>> >>>> >>>> 4. In case of accessing hdfs data over network, do worker nodes read >>>> data in parallel? How does hdfs data over network get accessed in spark >>>> application? >>>> >>> >>> >>>> Yes. All worker will get a split to read. They read their own split in >>>> parallel.This means all worker nodes should have access to Hadoop file >>>> system. >>>> >>> >>> >>>> Sorry if these questions were already discussed.. >>>> >>>> Thanks >>>> Swapnil >>>> >>> >> >