Hi Geoffry,

You might find the Google Millwheel paper and recent talk relevant.  That 
system supports windows based on event time as well as reprocessing.

Sent from my iPhone

> On Feb 23, 2015, at 4:49 PM, Geoffry Sumter <vit...@gmail.com> wrote:
> 
> Hey everyone,
> 
> I've been thinking about reprocessing
> <http://samza.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html> 
> when
> my job has windowed state
> <http://samza.apache.org/learn/documentation/0.7.0/container/state-management.html#windowed-aggregation>
> and
> I have a few questions.
> 
> Context: I have a series of physical sensors that stream partial scans of
> their surroundings over the course of ~5-10 minutes (at the end of 5-10
> minutes its provided a full scan of its surroundings and starts over from
> the start). Each packet of data has a timestamp that we're just going to
> have to trust is 'close enough.' When processing in real-time it's very
> natural to window the data every 5 minutes (wall clock) and merge into a
> complete view of their collective surroundings. For our purposes, old data
> arriving > 5 minutes late is no longer useful for many applications.
> 
> Now, I'd love to be able to reprocess data, especially by increasing
> parallelism and processing quickly, but this seems incompatible with using
> wall-clock-based windowed state. I could base my windowing/binning on the
> timestamps provided by my input data, but then I have to be careful to
> handle cases where some of my data may be arbitrarily delayed and the
> possibility that one partition will get significantly ahead of other ones
> (less interesting surroundings and faster computations) which makes waiting
> for a majority of partitions to get to a certain point in time difficult.
> 
> Does anyone have experience with windowing and reprocessing? Any literature
> recommendations?
> 
> Thanks!
> Geoffry

Reply via email to