partition recomputation in big lineage RDDs

Nicolae Marasoiu Wed, 30 Sep 2015 05:05:52 -0700

Hi,


If I implement a manner to have an up-to-date version of my RDD by ingesting 
some new events, called RDD_inc (from increment), and I provide a "merge" 
function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve 
the state of my RDD by constructing new RDDs all the time, and doing it in a 
manner that hopes to reuse as much data from the past RDD and make the rest 
garbage collectable. An example merge function would be a join on some ids, and 
creating a merged state for each element. The type of the result of m(RDD, 
RDD_inc) is the same type as that of RDD.


My question on this is how does the recomputation work for such an RDD, which 
is not the direct result of hdfs load, but is the result of a long lineage of 
such functions/transformations:


Lets say my RDD is now after 2 merge iterations like this:

RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)


When recomputing a part of RDD_new here are my assumptions:

- only full partitions are recomputed, nothing more granular?

- the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed

- the function are applied


And this seems more simplistic, since the partitions do not fully align in the 
general case between all these RDDs. The other aspect is the potentially 
redundant load of data which is in fact not required anymore (the data ruled 
out in the merge).


A more detailed version of this question is at 
https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/


Thanks,

Nicu

partition recomputation in big lineage RDDs

Reply via email to