Hello all,

I have a question regarding the pipelining and parallelism of transformations 
in Spark. I couldn’t find any documentation about it and I would really 
appreciate your help if you could help me with it.  I just started using and 
reading Spark, so I guess my description may not be very clear to you.. Please 
tell me if you don’t understand anything.

Let me use a figure of the Spark paper to help me illustrate the problem.


Firstly, while applying transformations to a single partition of any RDD, is 
there any parallelism? I guess the answer is no, but I will be more assured if 
anyone can confirm it.

Secondly, many transformations in Spark can be pipelined. For example, 
transformations from C to D and from D to F should be pipelined in 
element-granuality, as these partitions are in the same machine. As the Spark 
paper says, ’narrow dependencies allow for pipelined execution on one cluster 
node’. However, does pipelining always work for transformations of narrow 
dependencies? Or involved partitions all have to reside in the same node? And 
is there any limitation for the length of pipelining?

Moreover, considering transformation with wide dependencies like from A to B 
and from F, B to G. They are require shuffling. In the figure, a partition of B 
requires input from all three partitions from A. So do partitions in B only 
start processing after it has received all data from all partitions in A? And 
do partitions of B only output data after the transformation is finished for 
all its keys, or it can output individual results by key for pipelining (to 
join G)?

Thank you in advance,
Zhongmiao

Reply via email to