Hi,
If I implement a manner to have an up-to-date version of my RDD by ingesting some new events, called RDD_inc (from increment), and I provide a "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve the state of my RDD by constructing new RDDs all the time, and doing it in a manner that hopes to reuse as much data from the past RDD and make the rest garbage collectable. An example merge function would be a join on some ids, and creating a merged state for each element. The type of the result of m(RDD, RDD_inc) is the same type as that of RDD. My question on this is how does the recomputation work for such an RDD, which is not the direct result of hdfs load, but is the result of a long lineage of such functions/transformations: Lets say my RDD is now after 2 merge iterations like this: RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2) When recomputing a part of RDD_new here are my assumptions: - only full partitions are recomputed, nothing more granular? - the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed - the function are applied And this seems more simplistic, since the partitions do not fully align in the general case between all these RDDs. The other aspect is the potentially redundant load of data which is in fact not required anymore (the data ruled out in the merge). A more detailed version of this question is at https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/ Thanks, Nicu