Hi, couple of questions about the internals of the persist mechanism (RDD, but maybe applicable also to DS/DF).
Data is processed stage by stage. So what actually runs in worker nodes is the calculation of the partitions of the result of a stage, not the single RDDs. Operation of all the RDDs that form a stage are run together. That at least how I interpret the UI and the logs. Then, what does "persisting an RDD" that is in the middle of a stage actually mean? Let's say the result of a map, that is located before another map, located before a reduce. Persisting A that is inside the stage A -> B -> C. Also the hint "persist an RDD if it's used more than once and you don't want it to be calculated twice" is not precise. For example, if inside a stage we have: A -> B -> C -> E -> F | | --> D --> So basically a diamond, where B is used twice, as input of C and D, but then the workflow re-joins in E, all inside the same stage, no shuffling. I tested and B is not calculated twice. And again the original question: what does actually happen when B is marked to be persisted? Regards, Stefano