Hey, Geoffry, We have started some work in SAMZA-552 to create a window operator API in samza, as part of effort to implement support for a high-level language. I will probably be able to have something to share in a few days and would love to get feedbacks regarding to the window operator.
Thanks! On Mon, Feb 23, 2015 at 4:49 PM, Geoffry Sumter <vit...@gmail.com> wrote: > Hey everyone, > > I've been thinking about reprocessing > <http://samza.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html> > when > my job has windowed state > < > http://samza.apache.org/learn/documentation/0.7.0/container/state-management.html#windowed-aggregation > > > and > I have a few questions. > > Context: I have a series of physical sensors that stream partial scans of > their surroundings over the course of ~5-10 minutes (at the end of 5-10 > minutes its provided a full scan of its surroundings and starts over from > the start). Each packet of data has a timestamp that we're just going to > have to trust is 'close enough.' When processing in real-time it's very > natural to window the data every 5 minutes (wall clock) and merge into a > complete view of their collective surroundings. For our purposes, old data > arriving > 5 minutes late is no longer useful for many applications. > > Now, I'd love to be able to reprocess data, especially by increasing > parallelism and processing quickly, but this seems incompatible with using > wall-clock-based windowed state. I could base my windowing/binning on the > timestamps provided by my input data, but then I have to be careful to > handle cases where some of my data may be arbitrarily delayed and the > possibility that one partition will get significantly ahead of other ones > (less interesting surroundings and faster computations) which makes waiting > for a majority of partitions to get to a certain point in time difficult. > > Does anyone have experience with windowing and reprocessing? Any literature > recommendations? > > Thanks! > Geoffry >