Chris,

Thanks you very much for your detailed.
Another system for processing real-time data just came to my attention
(thanks to Kafka mailing list, again).
It's called Druid (more at: http://druid.io).

While I now understand Samza advantages over Storm for building a CEP, I am
wondering how Samza compares to Druid.
I guess I may not alone wondering about Samza vs. Druid, so you may want to
add a Samza vs. Druid" item in Samza documenation :)

Thanks,
Alex.




On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini <[email protected]>wrote:

> Hey Alex,
>
> As I understand it, the CEP pattern you describing is, "look for a series
> of events within some bounded time frame, and take an action based on the
> combination of events." You use an example of three events arriving within
> 10 minutes of each other, consecutively. Wikipedia uses a similar example
> (wedding bell event + man in suit event + woman in white dress event +
> rice thrown event = wedding) on their CEP page.
>
> This pattern can be implemented in Samza fairly easily using Samza's
> key/value store (or some other StorageEngine, if you choose to implement
> it). It's best to use a key/value store for this use case, since the
> window might be quite long (10 minutes), and all events in the window
> might not fit in memory. If you use Samza's key/value store, you can put
> each message (and a timestamp) into the key/value store as the messages
> arrive. You can then implement the WindowableTask interface along with the
> StreamTask interface, and configure Samza to call window() on your task
> every N seconds (say, task.window.ms=60000). The window method could then
> do a range query on the key/value store, and check for message chains
> (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an
> expected message was missing, you could then take some action (send an
> alert, or whatever).
>
> In general, when I think CEP, I think Esper (http://esper.codehaus.org/).
> You should be able to implement a lot of CEP/SQL type commands (SELECT,
> JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, etc)
> using Samza's StreamTask interface, and is state management facilities.
>
> Beyond state management, most features in Samza enable CEP processing, in
> one way or another. From your perspective, you can look at Samza as the
> underlying framework with which you might choose to implement a CEP type
> system (think MapReduce is to Hive as Samza is to a CEP system). Specific
> things that help are its WindowableTask interface, the partitioning model
> (which lends itself to distributed joins and aggregation), and Samza's
> state management features.
>
> One thing to be aware of right now is Samza's "at least once" messaging
> guarantee when failures occur (inherited from Kafka). You might receive
> duplicate messages. This means you can potentially double count, if you're
> doing aggregation. In the example you give (E1, E2, E3), this shouldn¹t be
> a problem. We have plans to provide exactly once messaging, but we haven't
> implemented the feature yet.
>
> Cheers,
> Chris
>
> On 8/24/13 12:05 PM, "Alex The Rocker" <[email protected]> wrote:
>
> >Hello,
> >
> >I just began to read about Samza, and I very excited about it (I was
> >warned
> >of its existence by Jay Kreps' post in Kafka users list, BTW).
> >
> >My first reaction is: are you guys using it at LinkedIn for applications
> >which lies in the CEP (Complex Event Processing) system domain?
> >
> >To be more specific, would stateful Samza tasks be used in order to
> >compute
> >complex states such as "event E1 is followed by E2 then by E3 with less
> >than 10 minutes interval between each event" ?
> >
> >I was looking at Storm for CEP, but as pointed out in Samza Storm page,
> >Storm leaves state management to the bolts code, whereas Samza has
> >"something".
> >
> >Beyond state management, what else would make Samza a good building block
> >for a CEP?  Or a bad one?
> >
> >Thanks,
> >Alex.
>
>

Reply via email to