Hi, Thanks all for answers and sub-topics. One sub-topic that puzzles me (still with the goal of building CEP) is how Samza key/value store scales. I have read couple of scary stories about project using Zookeeper reaching some limits, so I think the question is worth. How large Samza key/value store can be? How its size impacts overall performances? Is there some housekeeping to do, for example to clean orphan key/values?
Thanks, Alex. On Mon, Sep 2, 2013 at 5:20 PM, Jay Kreps <[email protected]> wrote: > Hi Tim, > > What we have in the way of documentation on storage is here: > > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html > > It's a bit sparse. If you help me understand your specific questions I will > make sure I cover them. > > -Jay > > > On Sat, Aug 31, 2013 at 1:58 PM, Timothy Chen <[email protected]> wrote: > > > Hi, > > > > I wonder if there are details about the new key/value store Samza > > provides? Especially the design and how it handles scale, consistency > > guarantees etc. > > > > Tim > > > > On Aug 31, 2013, at 12:56 PM, Alex The Rocker <[email protected]> > > wrote: > > > > > Chris, > > > > > > Thanks you very much for your detailed. > > > Another system for processing real-time data just came to my attention > > > (thanks to Kafka mailing list, again). > > > It's called Druid (more at: http://druid.io). > > > > > > While I now understand Samza advantages over Storm for building a CEP, > I > > am > > > wondering how Samza compares to Druid. > > > I guess I may not alone wondering about Samza vs. Druid, so you may > want > > to > > > add a Samza vs. Druid" item in Samza documenation :) > > > > > > Thanks, > > > Alex. > > > > > > > > > > > > > > > On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini < > > [email protected]>wrote: > > > > > >> Hey Alex, > > >> > > >> As I understand it, the CEP pattern you describing is, "look for a > > series > > >> of events within some bounded time frame, and take an action based on > > the > > >> combination of events." You use an example of three events arriving > > within > > >> 10 minutes of each other, consecutively. Wikipedia uses a similar > > example > > >> (wedding bell event + man in suit event + woman in white dress event + > > >> rice thrown event = wedding) on their CEP page. > > >> > > >> This pattern can be implemented in Samza fairly easily using Samza's > > >> key/value store (or some other StorageEngine, if you choose to > implement > > >> it). It's best to use a key/value store for this use case, since the > > >> window might be quite long (10 minutes), and all events in the window > > >> might not fit in memory. If you use Samza's key/value store, you can > put > > >> each message (and a timestamp) into the key/value store as the > messages > > >> arrive. You can then implement the WindowableTask interface along with > > the > > >> StreamTask interface, and configure Samza to call window() on your > task > > >> every N seconds (say, task.window.ms=60000). The window method could > > then > > >> do a range query on the key/value store, and check for message chains > > >> (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an > > >> expected message was missing, you could then take some action (send an > > >> alert, or whatever). > > >> > > >> In general, when I think CEP, I think Esper ( > http://esper.codehaus.org/ > > ). > > >> You should be able to implement a lot of CEP/SQL type commands > (SELECT, > > >> JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, > etc) > > >> using Samza's StreamTask interface, and is state management > facilities. > > >> > > >> Beyond state management, most features in Samza enable CEP processing, > > in > > >> one way or another. From your perspective, you can look at Samza as > the > > >> underlying framework with which you might choose to implement a CEP > type > > >> system (think MapReduce is to Hive as Samza is to a CEP system). > > Specific > > >> things that help are its WindowableTask interface, the partitioning > > model > > >> (which lends itself to distributed joins and aggregation), and Samza's > > >> state management features. > > >> > > >> One thing to be aware of right now is Samza's "at least once" > messaging > > >> guarantee when failures occur (inherited from Kafka). You might > receive > > >> duplicate messages. This means you can potentially double count, if > > you're > > >> doing aggregation. In the example you give (E1, E2, E3), this > shouldn¹t > > be > > >> a problem. We have plans to provide exactly once messaging, but we > > haven't > > >> implemented the feature yet. > > >> > > >> Cheers, > > >> Chris > > >> > > >> On 8/24/13 12:05 PM, "Alex The Rocker" <[email protected]> wrote: > > >> > > >>> Hello, > > >>> > > >>> I just began to read about Samza, and I very excited about it (I was > > >>> warned > > >>> of its existence by Jay Kreps' post in Kafka users list, BTW). > > >>> > > >>> My first reaction is: are you guys using it at LinkedIn for > > applications > > >>> which lies in the CEP (Complex Event Processing) system domain? > > >>> > > >>> To be more specific, would stateful Samza tasks be used in order to > > >>> compute > > >>> complex states such as "event E1 is followed by E2 then by E3 with > less > > >>> than 10 minutes interval between each event" ? > > >>> > > >>> I was looking at Storm for CEP, but as pointed out in Samza Storm > page, > > >>> Storm leaves state management to the bolts code, whereas Samza has > > >>> "something". > > >>> > > >>> Beyond state management, what else would make Samza a good building > > block > > >>> for a CEP? Or a bad one? > > >>> > > >>> Thanks, > > >>> Alex. > > >> > > >> > > >
