[Spark Core] details of persisting RDDs

Stefano Pettini Fri, 23 Mar 2018 08:54:24 -0700

Hi,

couple of questions about the internals of the persist mechanism (RDD, but
maybe applicable also to DS/DF).


Data is processed stage by stage. So what actually runs in worker nodes is
the calculation of the partitions of the result of a stage, not the single
RDDs. Operation of all the RDDs that form a stage are run together. That at
least how I interpret the UI and the logs.

Then, what does "persisting an RDD" that is in the middle of a stage
actually mean? Let's say the result of a map, that is located before
another map, located before a reduce. Persisting A that is inside the stage
A -> B -> C.

Also the hint "persist an RDD if it's used more than once and you don't
want it to be calculated twice" is not precise. For example, if inside a
stage we have:

A -> B -> C -> E -> F
     |         |
      --> D -->

So basically a diamond, where B is used twice, as input of C and D, but
then the workflow re-joins in E, all inside the same stage, no shuffling. I
tested and B is not calculated twice. And again the original question: what
does actually happen when B is marked to be persisted?

Regards,
Stefano

[Spark Core] details of persisting RDDs

Reply via email to