Hi Svend!
Thanks for your reply. Bookmarked your blog yesterday, but was hoping I 
wouldn't have to use Trident. I guess I have no choice.

I have some comments, so feel free to correct me if I'm wrong. I am using 
fieldsGrouping on my tuples so I think this fixes the problem you mention in 
your first paragraphs about assuming counting happens within 1 bolt on 1 
machine. My understanding is that this grouping sends the same tuple to the 
same machine. Now I'm not sure if it's done using modulo (like the acks), or in 
a more consistent way. By consistent I mean, if you use mod then how to make 
sure that if you add more machines your mod result is not screwed up now.
Also I think that even if I were to use shuffleGrouping and emitted the results 
(from all of my counting bolts) to 1 other bolt which tallied up the results it 
would be fine. Am I correct? I am assuming 1 thread for the final bolt which 
would of course bottleneck the system so I wouldn't do this.

Will look into Trident now and comment back in a few days.

Thanks again!
Adrian

From: Svend Vanderveken [mailto:[email protected]]
Sent: December-12-13 1:13 PM
To: [email protected]
Subject: Re: persistence of bolts which do periodic aggregation

Hi Adrian,

Zookeeper does not keep any state we are developing as part of our topologies, 
that's up to us to do that.

Also, keep in mind that holding the counter in memory is not very scalable 
since it assume that all counting all happen within one Bolt in one machine: 
what if you get plenty of traffic and decide your topology to scale 
horizontally (which is one of the main point of using Storm, IMHO)? It would 
become tricky in that case to coordinate all those in-memory counters happening 
in parallel in different machines.


If you use Trident, Storm lets you process tuples per batch and re-group your 
counting operation for each word sequencialy always in the same place. Before 
counting words from a new batch of words, you can first read the counter state 
from DB, then do the in-memory counting, then go write to DB the new counter 
value after the batch. Storm actually even adds one supplementary layer of 
association to optimize even further: it lets us read plenty of previous 
counters as one shot in from DB, then lets you do your counting thing for each 
one individually, then lets you write one single write operation that updates 
plenty of counter values in DB. And yes, all that without any race condition :D

I wrote a blog post about that a few months ago, and I love to see traffic to 
it, so feel free to click like crazy ^__^:

http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/



Svend




On Thu, Dec 12, 2013 at 6:20 PM, Adrian Mocanu 
<[email protected]<mailto:[email protected]>> wrote:
Hi
I've seen several simple aggregation examples on the web. Can't find one that 
answers my question though.
I'm wondering if Zookeeper saves the states of bolts so if 1 aggregation bolt 
crashes, then when it restarts the worker it will start from previous state. I 
use acks (and might do batch processing as well.)

For example, let's say I have to count every minute how many words of the same 
type I find and store them in a db.
My bolt would keep counters for each work and at the end of every minute dump 
the counters it holds in memory to db.

eg:
   input: The peanut is great. The ocean is great.
   Bolt state after input is processed:
     the:2
    peanut:1
    is:2
    great:2
    ocean:1

So if the bolt crashes before it commits to db the counters, does Zookeeper 
save that state?
If not, then do you have suggestions/links on what the best way to do this is?

Thanks
-Adrian


Reply via email to