Hi, On Thu, Jan 29, 2015 at 1:54 AM, YaoPau <jonrgr...@gmail.com> wrote: > > My thinking is to maintain state in an RDD and update it an persist it with > each 2-second pass, but this also seems like it could get messy. Any > thoughts or examples that might help me? >
I have just implemented some timestamp-based windowing on DStreams (can't share the code now, but will be published a couple of months ahead), although with the assumption that items are in correct order. The main challenge (rather technical) was to keep proper state across RDD boundaries and to tell the state "you can mark this partial window from the last interval as 'complete' now" without shuffling too much data around. For example, if there are some empty intervals, you don't know when the next item to go into the partial window will arrive, or if there will be one at all. I guess if you want to have out-of-order tolerance, that will become even trickier, as you need to define and think about some timeout for partial windows in your state... Tobias