Chris, Thanks you very much for your detailed. Another system for processing real-time data just came to my attention (thanks to Kafka mailing list, again). It's called Druid (more at: http://druid.io).
While I now understand Samza advantages over Storm for building a CEP, I am wondering how Samza compares to Druid. I guess I may not alone wondering about Samza vs. Druid, so you may want to add a Samza vs. Druid" item in Samza documenation :) Thanks, Alex. On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini <[email protected]>wrote: > Hey Alex, > > As I understand it, the CEP pattern you describing is, "look for a series > of events within some bounded time frame, and take an action based on the > combination of events." You use an example of three events arriving within > 10 minutes of each other, consecutively. Wikipedia uses a similar example > (wedding bell event + man in suit event + woman in white dress event + > rice thrown event = wedding) on their CEP page. > > This pattern can be implemented in Samza fairly easily using Samza's > key/value store (or some other StorageEngine, if you choose to implement > it). It's best to use a key/value store for this use case, since the > window might be quite long (10 minutes), and all events in the window > might not fit in memory. If you use Samza's key/value store, you can put > each message (and a timestamp) into the key/value store as the messages > arrive. You can then implement the WindowableTask interface along with the > StreamTask interface, and configure Samza to call window() on your task > every N seconds (say, task.window.ms=60000). The window method could then > do a range query on the key/value store, and check for message chains > (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an > expected message was missing, you could then take some action (send an > alert, or whatever). > > In general, when I think CEP, I think Esper (http://esper.codehaus.org/). > You should be able to implement a lot of CEP/SQL type commands (SELECT, > JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, etc) > using Samza's StreamTask interface, and is state management facilities. > > Beyond state management, most features in Samza enable CEP processing, in > one way or another. From your perspective, you can look at Samza as the > underlying framework with which you might choose to implement a CEP type > system (think MapReduce is to Hive as Samza is to a CEP system). Specific > things that help are its WindowableTask interface, the partitioning model > (which lends itself to distributed joins and aggregation), and Samza's > state management features. > > One thing to be aware of right now is Samza's "at least once" messaging > guarantee when failures occur (inherited from Kafka). You might receive > duplicate messages. This means you can potentially double count, if you're > doing aggregation. In the example you give (E1, E2, E3), this shouldn¹t be > a problem. We have plans to provide exactly once messaging, but we haven't > implemented the feature yet. > > Cheers, > Chris > > On 8/24/13 12:05 PM, "Alex The Rocker" <[email protected]> wrote: > > >Hello, > > > >I just began to read about Samza, and I very excited about it (I was > >warned > >of its existence by Jay Kreps' post in Kafka users list, BTW). > > > >My first reaction is: are you guys using it at LinkedIn for applications > >which lies in the CEP (Complex Event Processing) system domain? > > > >To be more specific, would stateful Samza tasks be used in order to > >compute > >complex states such as "event E1 is followed by E2 then by E3 with less > >than 10 minutes interval between each event" ? > > > >I was looking at Storm for CEP, but as pointed out in Samza Storm page, > >Storm leaves state management to the bolts code, whereas Samza has > >"something". > > > >Beyond state management, what else would make Samza a good building block > >for a CEP? Or a bad one? > > > >Thanks, > >Alex. > >
