Hi Grega,

You'll need to create a new cached RDD for each batch, and then create the 
union of those on every window. So for example, if you have rdd0, rdd1, and 
rdd2, you might first take the union of 0 and 1, then of 1 and 2. This will let 
you use just the subset of RDDs you care about instead of trying to subtract 
stuff.

In terms of uncaching, in the master branch, you can use RDD.unpersist() to 
uncache an old RDD. In Spark 0.7, there's no user control over uncaching so you 
just have to wait for it to fall out of the LRU cache.

Matei

On Aug 13, 2013, at 4:07 AM, Grega Kešpret <[email protected]> wrote:

> Hi all,
> I would need some tips regarding how to go about doing a "rolling window" 
> with Spark. We would like to make something like this:
> 
> [----------- rolling window data -----------][ new ]
> [ old ][----------- rolling window data -----------][ new ]
> 
> Effectively, the size of the data in rolling window will be much larger than 
> additional data in each iteration, so we want to cache it in memory. 
> 
> val rolling = sc.textFile(...).cache()
> rolling.count()
> // rolling window should be in cache
> 
> val new = sc.textFile(...)
> val data = rolling.union(new).cache()
> data.count() 
> // rolling window + new should be in cache
> 
> We can add new data (and cache it) with unioning rolling window RDD with new 
> RDD. But how can we forget old data / remove it from cache? 
> If it's of any help, we have the data segmented by small intervals so we know 
> the file names beforehand.
> 
> Thanks,
> Grega
> 

Reply via email to