Hi all,
I would need some tips regarding how to go about doing a "rolling window"
with Spark. We would like to make something like this:

[----------- rolling window data -----------][ new ]
[ old ][----------- rolling window data -----------][ new ]

Effectively, the size of the data in rolling window will be much larger
than additional data in each iteration, so we want to cache it in memory.

val rolling = sc.textFile(...).cache()
rolling.count()
// rolling window should be in cache

val new = sc.textFile(...)
val data = rolling.union(new).cache()
data.count()
// rolling window + new should be in cache

We can add new data (and cache it) with unioning rolling window RDD with
new RDD. But how can we forget old data / remove it from cache?
If it's of any help, we have the data segmented by small intervals so we
know the file names beforehand.

Thanks,
Grega

Reply via email to