As I understand, you don't need merge of your historical data RDD with your RDD_inc, what you need is merge of the computation results of the your historical RDD with RDD_inc and so on.
IMO, you should consider having an external row store to hold your computations. I say this because you need to update the rows of prior computation based on the new data. Spark cached batches are column oriented and any update to a spark cached batch is a costly op. On Wed, Sep 30, 2015 at 10:59 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > An equivalent question would be: can the memory cache be selectively > evicted from within a component run in the driver? I know it is breaking > some abstraction/encapsulation, but clearly I need to evict part of the > cache so that it is reloaded with newer values from DB. > > > Because what I basically need is invalidating some portions of the data > which have newer values. The "compute" method should be the same (read with > TableInputFormat). > > Thanks > Nicu > ------------------------------ > *From:* Nicolae Marasoiu <nicolae.maras...@adswizz.com> > *Sent:* Wednesday, September 30, 2015 4:07 PM > *To:* user@spark.apache.org > *Subject:* Re: partition recomputation in big lineage RDDs > > > Hi, > > In fact, my RDD will get a new version (a new RDD assigned to the same > var) quite frequently, by merging bulks of 1000 events of events of last > 10s. > > But recomputation would be more efficient to do not by reading initial RDD > partition(s) and reapplying deltas, but by reading from HBase the latest > data, and just compute on top of that if anything. > > Basically I guess I need to write my own RDD and implement compute method > by sliding on hbase. > > Thanks, > Nicu > ------------------------------ > *From:* Nicolae Marasoiu <nicolae.maras...@adswizz.com> > *Sent:* Wednesday, September 30, 2015 3:05 PM > *To:* user@spark.apache.org > *Subject:* partition recomputation in big lineage RDDs > > > Hi, > > > If I implement a manner to have an up-to-date version of my RDD by > ingesting some new events, called RDD_inc (from increment), and I provide a > "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like > I can evolve the state of my RDD by constructing new RDDs all the time, and > doing it in a manner that hopes to reuse as much data from the past RDD and > make the rest garbage collectable. An example merge function would be a > join on some ids, and creating a merged state for each element. The type of > the result of m(RDD, RDD_inc) is the same type as that of RDD. > > > My question on this is how does the recomputation work for such an RDD, > which is not the direct result of hdfs load, but is the result of a long > lineage of such functions/transformations: > > > Lets say my RDD is now after 2 merge iterations like this: > > RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2) > > > When recomputing a part of RDD_new here are my assumptions: > > - only full partitions are recomputed, nothing more granular? > > - the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed > > - the function are applied > > > And this seems more simplistic, since the partitions do not fully align in > the general case between all these RDDs. The other aspect is the > potentially redundant load of data which is in fact not required anymore > (the data ruled out in the merge). > > > A more detailed version of this question is at > https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/ > > > Thanks, > > Nicu >