Hi Svend! Thanks for your reply. Bookmarked your blog yesterday, but was hoping I wouldn't have to use Trident. I guess I have no choice.
I have some comments, so feel free to correct me if I'm wrong. I am using fieldsGrouping on my tuples so I think this fixes the problem you mention in your first paragraphs about assuming counting happens within 1 bolt on 1 machine. My understanding is that this grouping sends the same tuple to the same machine. Now I'm not sure if it's done using modulo (like the acks), or in a more consistent way. By consistent I mean, if you use mod then how to make sure that if you add more machines your mod result is not screwed up now. Also I think that even if I were to use shuffleGrouping and emitted the results (from all of my counting bolts) to 1 other bolt which tallied up the results it would be fine. Am I correct? I am assuming 1 thread for the final bolt which would of course bottleneck the system so I wouldn't do this. Will look into Trident now and comment back in a few days. Thanks again! Adrian From: Svend Vanderveken [mailto:[email protected]] Sent: December-12-13 1:13 PM To: [email protected] Subject: Re: persistence of bolts which do periodic aggregation Hi Adrian, Zookeeper does not keep any state we are developing as part of our topologies, that's up to us to do that. Also, keep in mind that holding the counter in memory is not very scalable since it assume that all counting all happen within one Bolt in one machine: what if you get plenty of traffic and decide your topology to scale horizontally (which is one of the main point of using Storm, IMHO)? It would become tricky in that case to coordinate all those in-memory counters happening in parallel in different machines. If you use Trident, Storm lets you process tuples per batch and re-group your counting operation for each word sequencialy always in the same place. Before counting words from a new batch of words, you can first read the counter state from DB, then do the in-memory counting, then go write to DB the new counter value after the batch. Storm actually even adds one supplementary layer of association to optimize even further: it lets us read plenty of previous counters as one shot in from DB, then lets you do your counting thing for each one individually, then lets you write one single write operation that updates plenty of counter values in DB. And yes, all that without any race condition :D I wrote a blog post about that a few months ago, and I love to see traffic to it, so feel free to click like crazy ^__^: http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/ Svend On Thu, Dec 12, 2013 at 6:20 PM, Adrian Mocanu <[email protected]<mailto:[email protected]>> wrote: Hi I've seen several simple aggregation examples on the web. Can't find one that answers my question though. I'm wondering if Zookeeper saves the states of bolts so if 1 aggregation bolt crashes, then when it restarts the worker it will start from previous state. I use acks (and might do batch processing as well.) For example, let's say I have to count every minute how many words of the same type I find and store them in a db. My bolt would keep counters for each work and at the end of every minute dump the counters it holds in memory to db. eg: input: The peanut is great. The ocean is great. Bolt state after input is processed: the:2 peanut:1 is:2 great:2 ocean:1 So if the bolt crashes before it commits to db the counters, does Zookeeper save that state? If not, then do you have suggestions/links on what the best way to do this is? Thanks -Adrian
