Hi all, I would need some tips regarding how to go about doing a "rolling window" with Spark. We would like to make something like this:
[----------- rolling window data -----------][ new ] [ old ][----------- rolling window data -----------][ new ] Effectively, the size of the data in rolling window will be much larger than additional data in each iteration, so we want to cache it in memory. val rolling = sc.textFile(...).cache() rolling.count() // rolling window should be in cache val new = sc.textFile(...) val data = rolling.union(new).cache() data.count() // rolling window + new should be in cache We can add new data (and cache it) with unioning rolling window RDD with new RDD. But how can we forget old data / remove it from cache? If it's of any help, we have the data segmented by small intervals so we know the file names beforehand. Thanks, Grega
