Re: Spark driver locality

Swapnil Shinde Thu, 27 Aug 2015 12:42:59 -0700

Thanks Rishitesh !!
1. I get that driver doesn't need to be on master but there is lot of
communication between driver and cluster. That's why co-located gateway was
recommended. How much is the impact of driver not being co-located with
cluster?


4. How does hdfs split get assigned to worker node to read data from remote
hadoop cluster? I am more interested to know how mapr NFS layer is accessed
in parallel.

-
Swapnil


On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra <rishi80.mis...@gmail.com>
wrote:

> Hi Swapnil,
> Let me try to answer some of the questions. Answers inline. Hope it helps.
>
> On Thursday, August 27, 2015, Swapnil Shinde <swapnilushi...@gmail.com>
> wrote:
>
>> Hello
>> I am new to spark world and started to explore recently in standalone
>> mode. It would be great if I get clarifications on below doubts-
>>
>> 1. Driver locality - It is mentioned in documentation that "client"
>> deploy-mode is not good if machine running "spark-submit" is not co-located
>> with worker machines. cluster mode is not available with standalone
>> clusters. Therefore, do we have to submit all applications on master
>> machine? (Assuming we don't have separate co-located gateway machine)
>>
>
> No. In standalone mode also your master and driver machines can be
> different.
>
>> Driver should have access to Master as well as worker machines.
>>
>
>
>> 2. How does above driver locality work with spark shell running on local
>> machine ?
>>
>
> Spark shell itself acts as driver. This means your local machine should
> have access to all the cluster machines.
>
>>
>> 3. I am little confused with role of driver program. Does driver do any
>> computation in spark app life cycle? For instance, in simple row count app,
>> worker node calculates local row counts. Does driver sum up local row
>> counts? In short where does reduce phase runs in this case?
>>
>
> Role of driver is to co-ordinate with cluster manager for initial resource
> allocation. After that it needs to schedule tasks to different executors
> assigned to it. It does not do any computation.(unless the application
> itself does something on its own ). Reduce phase is also a bunch of tasks,
> which gets assigned to one or more executors.
>
>>
>> 4. In case of accessing hdfs data over network, do worker nodes read data
>> in parallel? How does hdfs data over network get accessed in spark
>> application?
>>
>
>
>> Yes. All worker will get a split to read. They read their own split in
>> parallel.This means all worker nodes should have access to Hadoop file
>> system.
>>
>
>
>> Sorry if these questions were already discussed..
>>
>> Thanks
>> Swapnil
>>
>

Re: Spark driver locality

Reply via email to