Hi, So you say " sc.textFile -> flatMap -> Map".
My understanding is like this: First step is a number of partitions are determined, p of them. You can give hint on this. Then the nodes which will load partitions p, that is n nodes (where n<=p). Relatively at the same time or not, the n nodes start opening different sections of the file - the physical equivalent of the partitions: for instance in HDFS they would do an open and a seek I guess and just read from the stream there, convert to whatever the InputFormat dictates. The shuffle can only be the part when a node opens an HDFS file for instance but the node does not have a local replica of the blocks which it needs to read (those pertaining to his assigned partitions). So he needs to pick them up from remote nodes which do have replicas of that data. After blocks are read into memory, flatMap and Map are local computations generating new RDDs and in the end the result is sent to the driver (whatever termination computation does on the RDD like the result of reduce, or side effects of rdd.foreach, etc). Maybe you can share more of your context if still unclear. I just made assumptions to give clarity on a similar thing. Nicu ________________________________ From: Kartik Mathur <kar...@bluedata.com> Sent: Thursday, October 1, 2015 10:25 PM To: Nicolae Marasoiu Cc: user Subject: Re: Problem understanding spark word count execution Thanks Nicolae , So In my case all executers are sending results back to the driver and and "shuffle is just sending out the textFile to distribute the partitions", could you please elaborate on this ? what exactly is in this file ? On Wed, Sep 30, 2015 at 9:57 PM, Nicolae Marasoiu <nicolae.maras...@adswizz.com<mailto:nicolae.maras...@adswizz.com>> wrote: Hi, 2- the end results are sent back to the driver; the shuffles are transmission of intermediate results between nodes such as the -> which are all intermediate transformations. More precisely, since flatMap and map are narrow dependencies, meaning they can usually happen on the local node, I bet shuffle is just sending out the textFile to a few nodes to distribute the partitions. ________________________________ From: Kartik Mathur <kar...@bluedata.com<mailto:kar...@bluedata.com>> Sent: Thursday, October 1, 2015 12:42 AM To: user Subject: Problem understanding spark word count execution Hi All, I tried running spark word count and I have couple of questions - I am analyzing stage 0 , i.e sc.textFile -> flatMap -> Map (Word count example) 1) In the Stage logs under Application UI details for every task I am seeing Shuffle write as 2.7 KB, question - how can I know where all did this task write ? like how many bytes to which executer ? 2) In the executer's log when I look for same task it says 2000 bytes of result is sent to driver , my question is , if the results were directly sent to driver what is this shuffle write ? Thanks, Kartik