Hi Albert, Perhaps the common denominator is time-based sliding and tumbling windows based on event-time. Flink, Beam and Storm by now produce watermarks to trigger event time windows consistently. One immediate difference I see between Flink and Storm for example is that the default Flink windows operate on a partitioned data stream (by key). A baseline solution would be to start with task level windows which can also be achieved in Flink by keyBy(partitionId) for example. There are several ways to go around every each of these differences.
Count windows are also supported in the task level and as far as I remember they are used in Samoa by several operators (e.g. the ingestion of the VHT model). There are quite a few unique features in each system (e.g. additional triggers and custom windows) but it is safe to ignore them for now. I am not following Samza much lately, perhaps someone from their community can tell us more. I do remember seeing discussions around a similar window scheme, not sure if something is merged yet [1]. Paris [1] https://issues.apache.org/jira/browse/SAMZA-552?jql=project%20%3D%20SAMZA%20AND%20text%20~%20%22window%22<https://issues.apache.org/jira/browse/SAMZA-552?jql=project%20=%20SAMZA%20AND%20text%20~%20"window"> On 08 Feb 2016, at 09:21, Albert Bifet <[email protected]<mailto:[email protected]>> wrote: Thanks Paris! As Gianmarco said, it could be nice to re-work on windowing in the near future. What are the differences in windowing in Google Data Flow, Flink and Storm right now? Any hint on how this is going to evolve in the future? Cheers, Albert On Sun, Feb 7, 2016 at 3:23 PM, tarush grover <[email protected]<mailto:[email protected]>> wrote: Looking forward to be the part of this roadmap. Regards, Tarush On Sunday 7 February 2016, Gianmarco De Francisci Morales <[email protected]<mailto:[email protected]>> wrote: Thanks for the pointer, Paris. Finding the right abstraction level for distributed streaming ML is definitely a worthy (and non-trivial) task. We are currently working on some improvements for VHT. Once that's done, re-working it on a window-based abstraction with proper support for iterations could be a nice project. We wound need to drop support for S4 (not sure about Samza), but that's on the roadmap anyway. Cheers, -- Gianmarco On Sat, Feb 6, 2016 at 1:42 PM, Márton Balassi <[email protected]<mailto:[email protected]> <javascript:;>> wrote: Great suggestion, Paris. I would love to see Samoa building on these concept once they are stable enough in the supported data processing engines. On Fri, Feb 5, 2016 at 6:15 PM, Paris Carbone <[email protected]<mailto:[email protected]> <javascript:;>> wrote: Hello Samoans, It seems that system semantics in stream processing are converging lately. Apache Storm has now explicit state and windows [1], almost identical to Flink and Beam. Samza is also moving in a similar direction. This is really exciting and it feels natural to start moving the Samoa programming model a level up on top these establishing concepts. For example, there is no more need for custom buffering to implement windowing and ML models etc. can be re-defined and engineered as operator state to be durable. There are quite many cool things to be done and I believe there can be a very attractive roadmap for Samoa in that direction. What do you think? [1] https://community.hortonworks.com/articles/14171/windowing-and-state-checkpointing-in-apache-storm.html Paris
