For your case, setting the window time to some small value (say a second) doesn't provide small enough granularity to check for completeness and produce as necessary? This would generally be how what you're describing would be handled.
SAMZA-23 would be a great contribution, if you're interested. -jg On Mon, Apr 14, 2014 at 4:36 AM, Nicolas Bär <[email protected]> wrote: > Hi > > I'm using Samza to aggregate data within sliding windows. I'm especially > interested in the fault tolerance aspect of Samza and it's ability to > restore from the latest committed offset. I've looked into the > WindowableTask interface, but this implementation relies on the system > time, rather then the timestamp of the messages received. The main problem > is to make sure all data for one window is available and in case of node > failure the complete window is restored. With the current implementation of > offset committing (committed by time interval, window size or upon > `taskCoordinator.commit()`) this is quite cumbersome. I have two different > options in mind to overcome this problem and would like to hear your advice > on this. > > 1. Using the Key-Value store the data of a the current window is cached and > after a restore one could figure out what data to include from the cache > and from Kafka. Unfortunately, this sounds like reimplementing the offset > management of Kafka. > > 2. Set the time based commit interval to Long.MAX_VALUE and handle all > commit intervals through `taskCoordinator.commit()`. In this case I could > commit the offset after each sliding window and in case of failure Samza > would restore from the beginning of the new sliding window. The only > problem with this approach is `taskCoordinator.commit()` will commit the > offsets of all partitions. But the data may be different between partitions > and therefore sliding windows will not match with other partitions. I've > looked into SAMZA-23 [1] and found a proposal to coordinate commits for > each partition individually. Since this would solve my problem, I'd like to > provide a patch for this issue. Is there any interest in this? > > Best > Nicolas > > > [1]: https://issues.apache.org/jira/browse/SAMZA-23 >
