Re: Problem understanding spark word count execution

Nicolae Marasoiu Thu, 01 Oct 2015 12:40:49 -0700

Hi,

So you say " sc.textFile -> flatMap -> Map".


My understanding is like this:
First step is a number of partitions are determined, p of them. You can give 
hint on this.
Then the nodes which will load partitions p, that is n nodes (where n<=p).

Relatively at the same time or not, the n nodes start opening different 
sections of the file - the physical equivalent of the partitions: for instance 
in HDFS they would do an open and a seek I guess and just read from the stream 
there, convert to whatever the InputFormat dictates.

The shuffle can only be the part when a node opens an HDFS file for instance 
but the node does not have a local replica of the blocks which it needs to read 
(those pertaining to his assigned partitions). So he needs to pick them up from 
remote nodes which do have replicas of that data.

After blocks are read into memory, flatMap and Map are local computations 
generating new RDDs and in the end the result is sent to the driver (whatever 
termination computation does on the RDD like the result of reduce, or side 
effects of rdd.foreach, etc).

Maybe you can share more of your context if still unclear.
I just made assumptions to give clarity on a similar thing.

Nicu
________________________________
From: Kartik Mathur <kar...@bluedata.com>
Sent: Thursday, October 1, 2015 10:25 PM
To: Nicolae Marasoiu
Cc: user
Subject: Re: Problem understanding spark word count execution

Thanks Nicolae ,
So In my case all executers are sending results back to the driver and and 
"shuffle is just sending out the textFile to distribute the partitions", could 
you please elaborate on this  ? what exactly is in this file ?


On Wed, Sep 30, 2015 at 9:57 PM, Nicolae Marasoiu 
<nicolae.maras...@adswizz.com<mailto:nicolae.maras...@adswizz.com>> wrote:


Hi,

2- the end results are sent back to the driver; the shuffles are transmission 
of intermediate results between nodes such as the -> which are all intermediate 
transformations.

More precisely, since flatMap and map are narrow dependencies, meaning they can 
usually happen on the local node, I bet shuffle is just sending out the 
textFile to a few nodes to distribute the partitions.


________________________________
From: Kartik Mathur <kar...@bluedata.com<mailto:kar...@bluedata.com>>
Sent: Thursday, October 1, 2015 12:42 AM
To: user
Subject: Problem understanding spark word count execution

Hi All,

I tried running spark word count and I have couple of questions -

I am analyzing stage 0 , i.e
 sc.textFile -> flatMap -> Map (Word count example)

1) In the Stage logs under Application UI details for every task I am seeing 
Shuffle write as 2.7 KB, question - how can I know where all did this task 
write ? like how many bytes to which executer ?

2) In the executer's log when I look for same task it says 2000 bytes of result 
is sent to driver , my question is , if the results were directly sent to 
driver what is this shuffle write ?

Thanks,
Kartik

Re: Problem understanding spark word count execution

Reply via email to