Re: [cache eviction] partition recomputation in big lineage RDDs

2015-10-01 Thread Hemant Bhanawat
As I understand, you don't need merge of  your historical data RDD with
your RDD_inc, what you need is merge of the computation results of the your
historical RDD with RDD_inc and so on.

IMO, you should consider having an external row store to hold your
computations. I say this because you need to update the rows of prior
computation based on the new data. Spark cached batches are column oriented
and any update to a spark cached batch is a costly op.


On Wed, Sep 30, 2015 at 10:59 PM, Nicolae Marasoiu <
nicolae.maras...@adswizz.com> wrote:

> Hi,
>
>
> An equivalent question would be: can the memory cache be selectively
> evicted from within a component run in the driver? I know it is breaking
> some abstraction/encapsulation, but clearly I need to evict part of the
> cache so that it is reloaded with newer values from DB.
>
>
> Because what I basically need is invalidating some portions of the data
> which have newer values. The "compute" method should be the same (read with
> TableInputFormat).
>
> Thanks
> Nicu
> --
> *From:* Nicolae Marasoiu <nicolae.maras...@adswizz.com>
> *Sent:* Wednesday, September 30, 2015 4:07 PM
> *To:* user@spark.apache.org
> *Subject:* Re: partition recomputation in big lineage RDDs
>
>
> Hi,
>
> In fact, my RDD will get a new version (a new RDD assigned to the same
> var) quite frequently, by merging bulks of 1000 events of events of last
> 10s.
>
> But recomputation would be more efficient to do not by reading initial RDD
> partition(s) and reapplying deltas, but by reading from HBase the latest
> data, and just compute on top of that if anything.
>
> Basically I guess I need to write my own RDD and implement compute method
> by sliding on hbase.
>
> Thanks,
> Nicu
> --
> *From:* Nicolae Marasoiu <nicolae.maras...@adswizz.com>
> *Sent:* Wednesday, September 30, 2015 3:05 PM
> *To:* user@spark.apache.org
> *Subject:* partition recomputation in big lineage RDDs
>
>
> Hi,
>
>
> If I implement a manner to have an up-to-date version of my RDD by
> ingesting some new events, called RDD_inc (from increment), and I provide a
> "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like
> I can evolve the state of my RDD by constructing new RDDs all the time, and
> doing it in a manner that hopes to reuse as much data from the past RDD and
> make the rest garbage collectable. An example merge function would be a
> join on some ids, and creating a merged state for each element. The type of
> the result of m(RDD, RDD_inc) is the same type as that of RDD.
>
>
> My question on this is how does the recomputation work for such an RDD,
> which is not the direct result of hdfs load, but is the result of a long
> lineage of such functions/transformations:
>
>
> Lets say my RDD is now after 2 merge iterations like this:
>
> RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)
>
>
> When recomputing a part of RDD_new here are my assumptions:
>
> - only full partitions are recomputed, nothing more granular?
>
> - the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed
>
> - the function are applied
>
>
> And this seems more simplistic, since the partitions do not fully align in
> the general case between all these RDDs. The other aspect is the
> potentially redundant load of data which is in fact not required anymore
> (the data ruled out in the merge).
>
>
> A more detailed version of this question is at
> https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/
>
>
> Thanks,
>
> Nicu
>


Re: [cache eviction] partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi,


An equivalent question would be: can the memory cache be selectively evicted 
from within a component run in the driver? I know it is breaking some 
abstraction/encapsulation, but clearly I need to evict part of the cache so 
that it is reloaded with newer values from DB.


Because what I basically need is invalidating some portions of the data which 
have newer values. The "compute" method should be the same (read with 
TableInputFormat).

Thanks
Nicu

From: Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Sent: Wednesday, September 30, 2015 4:07 PM
To: user@spark.apache.org
Subject: Re: partition recomputation in big lineage RDDs


Hi,

In fact, my RDD will get a new version (a new RDD assigned to the same var) 
quite frequently, by merging bulks of 1000 events of events of last 10s.

But recomputation would be more efficient to do not by reading initial RDD 
partition(s) and reapplying deltas, but by reading from HBase the latest data, 
and just compute on top of that if anything.

Basically I guess I need to write my own RDD and implement compute method by 
sliding on hbase.

Thanks,
Nicu

From: Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Sent: Wednesday, September 30, 2015 3:05 PM
To: user@spark.apache.org
Subject: partition recomputation in big lineage RDDs


Hi,


If I implement a manner to have an up-to-date version of my RDD by ingesting 
some new events, called RDD_inc (from increment), and I provide a "merge" 
function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve 
the state of my RDD by constructing new RDDs all the time, and doing it in a 
manner that hopes to reuse as much data from the past RDD and make the rest 
garbage collectable. An example merge function would be a join on some ids, and 
creating a merged state for each element. The type of the result of m(RDD, 
RDD_inc) is the same type as that of RDD.


My question on this is how does the recomputation work for such an RDD, which 
is not the direct result of hdfs load, but is the result of a long lineage of 
such functions/transformations:


Lets say my RDD is now after 2 merge iterations like this:

RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)


When recomputing a part of RDD_new here are my assumptions:

- only full partitions are recomputed, nothing more granular?

- the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed

- the function are applied


And this seems more simplistic, since the partitions do not fully align in the 
general case between all these RDDs. The other aspect is the potentially 
redundant load of data which is in fact not required anymore (the data ruled 
out in the merge).


A more detailed version of this question is at 
https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/


Thanks,

Nicu


Re: partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi,

In fact, my RDD will get a new version (a new RDD assigned to the same var) 
quite frequently, by merging bulks of 1000 events of events of last 10s.

But recomputation would be more efficient to do not by reading initial RDD 
partition(s) and reapplying deltas, but by reading from HBase the latest data, 
and just compute on top of that if anything.

Basically I guess I need to write my own RDD and implement compute method by 
sliding on hbase.

Thanks,
Nicu

From: Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Sent: Wednesday, September 30, 2015 3:05 PM
To: user@spark.apache.org
Subject: partition recomputation in big lineage RDDs


Hi,


If I implement a manner to have an up-to-date version of my RDD by ingesting 
some new events, called RDD_inc (from increment), and I provide a "merge" 
function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve 
the state of my RDD by constructing new RDDs all the time, and doing it in a 
manner that hopes to reuse as much data from the past RDD and make the rest 
garbage collectable. An example merge function would be a join on some ids, and 
creating a merged state for each element. The type of the result of m(RDD, 
RDD_inc) is the same type as that of RDD.


My question on this is how does the recomputation work for such an RDD, which 
is not the direct result of hdfs load, but is the result of a long lineage of 
such functions/transformations:


Lets say my RDD is now after 2 merge iterations like this:

RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)


When recomputing a part of RDD_new here are my assumptions:

- only full partitions are recomputed, nothing more granular?

- the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed

- the function are applied


And this seems more simplistic, since the partitions do not fully align in the 
general case between all these RDDs. The other aspect is the potentially 
redundant load of data which is in fact not required anymore (the data ruled 
out in the merge).


A more detailed version of this question is at 
https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/


Thanks,

Nicu


partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi,


If I implement a manner to have an up-to-date version of my RDD by ingesting 
some new events, called RDD_inc (from increment), and I provide a "merge" 
function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve 
the state of my RDD by constructing new RDDs all the time, and doing it in a 
manner that hopes to reuse as much data from the past RDD and make the rest 
garbage collectable. An example merge function would be a join on some ids, and 
creating a merged state for each element. The type of the result of m(RDD, 
RDD_inc) is the same type as that of RDD.


My question on this is how does the recomputation work for such an RDD, which 
is not the direct result of hdfs load, but is the result of a long lineage of 
such functions/transformations:


Lets say my RDD is now after 2 merge iterations like this:

RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)


When recomputing a part of RDD_new here are my assumptions:

- only full partitions are recomputed, nothing more granular?

- the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed

- the function are applied


And this seems more simplistic, since the partitions do not fully align in the 
general case between all these RDDs. The other aspect is the potentially 
redundant load of data which is in fact not required anymore (the data ruled 
out in the merge).


A more detailed version of this question is at 
https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/


Thanks,

Nicu