Hello all, I have a question regarding the pipelining and parallelism of transformations in Spark. I couldn’t find any documentation about it and I would really appreciate your help if you could help me with it. I just started using and reading Spark, so I guess my description may not be very clear to you.. Please tell me if you don’t understand anything.
Let me use a figure of the Spark paper to help me illustrate the problem. Firstly, while applying transformations to a single partition of any RDD, is there any parallelism? I guess the answer is no, but I will be more assured if anyone can confirm it. Secondly, many transformations in Spark can be pipelined. For example, transformations from C to D and from D to F should be pipelined in element-granuality, as these partitions are in the same machine. As the Spark paper says, ’narrow dependencies allow for pipelined execution on one cluster node’. However, does pipelining always work for transformations of narrow dependencies? Or involved partitions all have to reside in the same node? And is there any limitation for the length of pipelining? Moreover, considering transformation with wide dependencies like from A to B and from F, B to G. They are require shuffling. In the figure, a partition of B requires input from all three partitions from A. So do partitions in B only start processing after it has received all data from all partitions in A? And do partitions of B only output data after the transformation is finished for all its keys, or it can output individual results by key for pipelining (to join G)? Thank you in advance, Zhongmiao