Hi Adrian,

What's wrong with Trident ? :)

With "vanilla" Storm, indeed fieldGrouping provides functionality similar
to Trident groupBy (I guess there's some kind of modulo somewhere to map
grouped values to a list of machines, not sure how exactly though). And the
equivalent of Trident batch would be simply to hold in memory a bunch of
tuples and updating DB + acking them all together, after what you
completely reset you Bolt. Finally, when you bolt receives a tuple and is
in "brand new" (reset) state, start by reading the current counter from DB.
Also, keep track of the Storm transactionId in DB, so you do not update
counter twice with the same tuples (cf doc on Transactional and Opaque
spouts).

Good luck

S




On Thu, Dec 12, 2013 at 9:28 PM, Adrian Mocanu <[email protected]>wrote:

>  Hi Svend!
>
> Thanks for your reply. Bookmarked your blog yesterday, but was hoping I
> wouldn’t have to use Trident. I guess I have no choice.
>
>
>
> I have some comments, so feel free to correct me if I’m wrong. I am using
> fieldsGrouping on my tuples so I think this fixes the problem you mention
> in your first paragraphs about assuming counting happens within 1 bolt on 1
> machine. My understanding is that this grouping sends the same tuple to the
> same machine. Now I’m not sure if it’s done using modulo (like the acks),
> or in a more consistent way. By consistent I mean, if you use mod then how
> to make sure that if you add more machines your mod result is not screwed
> up now.
>
> Also I think that even if I were to use shuffleGrouping and emitted the
> results (from all of my counting bolts) to 1 other bolt which tallied up
> the results it would be fine. Am I correct? I am assuming 1 thread for the
> final bolt which would of course bottleneck the system so I wouldn’t do
> this.
>
>
>
> Will look into Trident now and comment back in a few days.
>
>
>
> Thanks again!
>
> Adrian
>
>
>
> *From:* Svend Vanderveken [mailto:[email protected]]
> *Sent:* December-12-13 1:13 PM
> *To:* [email protected]
> *Subject:* Re: persistence of bolts which do periodic aggregation
>
>
>
> Hi Adrian,
>
>
>
> Zookeeper does not keep any state we are developing as part of our
> topologies, that's up to us to do that.
>
>
>
> Also, keep in mind that holding the counter in memory is not very scalable
> since it assume that all counting all happen within one Bolt in one
> machine: what if you get plenty of traffic and decide your topology to
> scale horizontally (which is one of the main point of using Storm, IMHO)?
> It would become tricky in that case to coordinate all those in-memory
> counters happening in parallel in different machines.
>
>
>
>
>
> If you use Trident, Storm lets you process tuples per batch and re-group
> your counting operation for each word sequencialy always in the same place.
> Before counting words from a new batch of words, you can first read the
> counter state from DB, then do the in-memory counting, then go write to DB
> the new counter value after the batch. Storm actually even adds one
> supplementary layer of association to optimize even further: it lets us
> read plenty of previous counters as one shot in from DB, then lets you do
> your counting thing for each one individually, then lets you write one
> single write operation that updates plenty of counter values in DB. And
> yes, all that without any race condition :D
>
>
>
> I wrote a blog post about that a few months ago, and I love to see traffic
> to it, so feel free to click like crazy ^__^:
>
>
>
>
> http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/
>
>
>
>
>
>
>
> Svend
>
>
>
>
>
>
>
>
>
> On Thu, Dec 12, 2013 at 6:20 PM, Adrian Mocanu <[email protected]>
> wrote:
>
>  Hi
>
> I’ve seen several simple aggregation examples on the web. Can’t find one
> that answers my question though.
>
> I’m wondering if Zookeeper saves the states of bolts so if 1 aggregation
> bolt crashes, then when it restarts the worker it will start from previous
> state. I use acks (and might do batch processing as well.)
>
>
>
> For example, let’s say I have to count every minute how many words of the
> same type I find and store them in a db.
>
> My bolt would keep counters for each work and at the end of every minute
> dump the counters it holds in memory to db.
>
>
>
> eg:
>
>    input: The peanut is great. The ocean is great.
>
>    Bolt state after input is processed:
>
>      the:2
>
>     peanut:1
>
>     is:2
>
>     great:2
>
>     ocean:1
>
>
>
> *So if the bolt crashes before it commits to db the counters, does
> Zookeeper save that state? *
>
> If not, then do you have suggestions/links on what the best way to do this
> is?
>
>
>
> Thanks
>
> -Adrian
>
>
>
>
>

Reply via email to