Re: partition recomputation in big lineage RDDs

Nicolae Marasoiu Wed, 30 Sep 2015 06:08:06 -0700

Hi,

In fact, my RDD will get a new version (a new RDD assigned to the same var) 
quite frequently, by merging bulks of 1000 events of events of last 10s.


But recomputation would be more efficient to do not by reading initial RDD 
partition(s) and reapplying deltas, but by reading from HBase the latest data, 
and just compute on top of that if anything.

Basically I guess I need to write my own RDD and implement compute method by 
sliding on hbase.

Thanks,
Nicu
________________________________
From: Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Sent: Wednesday, September 30, 2015 3:05 PM
To: user@spark.apache.org
Subject: partition recomputation in big lineage RDDs


Hi,


If I implement a manner to have an up-to-date version of my RDD by ingesting 
some new events, called RDD_inc (from increment), and I provide a "merge" 
function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve 
the state of my RDD by constructing new RDDs all the time, and doing it in a 
manner that hopes to reuse as much data from the past RDD and make the rest 
garbage collectable. An example merge function would be a join on some ids, and 
creating a merged state for each element. The type of the result of m(RDD, 
RDD_inc) is the same type as that of RDD.


My question on this is how does the recomputation work for such an RDD, which 
is not the direct result of hdfs load, but is the result of a long lineage of 
such functions/transformations:


Lets say my RDD is now after 2 merge iterations like this:

RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)


When recomputing a part of RDD_new here are my assumptions:

- only full partitions are recomputed, nothing more granular?

- the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed

- the function are applied


And this seems more simplistic, since the partitions do not fully align in the 
general case between all these RDDs. The other aspect is the potentially 
redundant load of data which is in fact not required anymore (the data ruled 
out in the merge).


A more detailed version of this question is at 
https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/


Thanks,

Nicu

Re: partition recomputation in big lineage RDDs

Reply via email to