Hi, In fact, my RDD will get a new version (a new RDD assigned to the same var) quite frequently, by merging bulks of 1000 events of events of last 10s.
But recomputation would be more efficient to do not by reading initial RDD partition(s) and reapplying deltas, but by reading from HBase the latest data, and just compute on top of that if anything. Basically I guess I need to write my own RDD and implement compute method by sliding on hbase. Thanks, Nicu ________________________________ From: Nicolae Marasoiu <nicolae.maras...@adswizz.com> Sent: Wednesday, September 30, 2015 3:05 PM To: user@spark.apache.org Subject: partition recomputation in big lineage RDDs Hi, If I implement a manner to have an up-to-date version of my RDD by ingesting some new events, called RDD_inc (from increment), and I provide a "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve the state of my RDD by constructing new RDDs all the time, and doing it in a manner that hopes to reuse as much data from the past RDD and make the rest garbage collectable. An example merge function would be a join on some ids, and creating a merged state for each element. The type of the result of m(RDD, RDD_inc) is the same type as that of RDD. My question on this is how does the recomputation work for such an RDD, which is not the direct result of hdfs load, but is the result of a long lineage of such functions/transformations: Lets say my RDD is now after 2 merge iterations like this: RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2) When recomputing a part of RDD_new here are my assumptions: - only full partitions are recomputed, nothing more granular? - the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed - the function are applied And this seems more simplistic, since the partitions do not fully align in the general case between all these RDDs. The other aspect is the potentially redundant load of data which is in fact not required anymore (the data ruled out in the merge). A more detailed version of this question is at https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/ Thanks, Nicu