Wes, it depends on what you mean by "sliding window" as related to "RDD":
1. Some operation over multiple rows of data within a single, large RDD, for which the operations are required to be temporally sequential. This may be the case where you're computing a running average over historical time-based data. 2. Some operation over multiple rows of data within a single, large RDD, for which the operations may be run in parallel, even out of order. This may be the case where your RDD represents a two-dimensional geospatial map and you're computing something (e.g., population average) over a grid. 3. Some operation on data streaming in, over a fixed-size window, and you would like the representation of that windowed data to be an RDD. For #1 and #2, there's only one "static" RDD and the task is largely bookkeeping: tracking which window you're working on when, and dealing with partition boundaries (*mapPartitions* or *mapPartitionsWithIndex *would be a useful interface here as it allows you to see multiple rows at a time, as well as know what partition # you're working with at any given time). For #3, that's what Spark Streaming does, and it does so by introducing a higher-level concept of a DStream, which is a sequence of RDDs, where each RDD is one data sample. Given that it is a collection of RDDs, the windowing management task simply involves maintaining what RDDs are contained that sequence. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Tue, Dec 10, 2013 at 12:01 PM, Wes Mitchell <w...@dancingtrout.net>wrote: > So, does that mean that if I want to do a sliding window, then I have to, > in some fashion, > build a stream from the RDD, push a new value on the head, filter out the > oldest value, and > re-persist as an RDD? > > > > > On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen <c...@adatao.com>wrote: > >> Kyle, the fundamental contract of a Spark RDD is that it is immutable. >> This follows the paradigm where data is (functionally) transformed into >> other data, rather than mutated. This allows these systems to make certain >> assumptions and guarantees that otherwise they wouldn't be able to. >> >> Now we've been able to get mutative behavior with RDDs---for fun, >> almost---but that's implementation dependent and may break at any time. >> >> It turns out this behavior is quite appropriate for the analytic stack, >> where you typically apply the same transform/operator to all data. You're >> finding that transactional systems are the exact opposite, where you >> typically apply a different operation to individual pieces of the data. >> Incidentally this is also the dichotomy between column- and row-based >> storage being optimal for each respective pattern. >> >> Spark is intended for the analytic stack. To use Spark as the persistence >> layer of a transaction system is going to be very awkward. I know there are >> some vendors who position their in-memory databases as good for both OLTP >> and OLAP use cases, but when you talk to them in depth they will readily >> admit that it's really optimal for one and not the other. >> >> If you want to make a project out of making a special Spark RDD that >> supports this behavior, it might be interesting. But there will be no >> simple shortcuts to get there from here. >> >> -- >> Christopher T. Nguyen >> Co-founder & CEO, Adatao <http://adatao.com> >> linkedin.com/in/ctnguyen >> >> >> >> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott <kellr...@soe.ucsc.edu>wrote: >> >>> I'm trying to figure out if I can use an RDD to backend an interactive >>> server. One of the requirements would be to have incremental updates to >>> elements in the RDD, ie transforms that change/add/delete a single element >>> in the RDD. >>> It seems pretty drastic to do a full RDD filter to remove a single >>> element, or do the union of the RDD with another one of size 1 to add an >>> element. (Or is it?) Is there an efficient way to do this in Spark? Are >>> there any example of this kind of usage? >>> >>> Thank you, >>> Kyle >>> >> >> >