Trident topology - High Process latency

2015-12-01 Thread Mohan Pandiyan
Hello, I have a trident drpc topology. I am trying to reduce the complete latency which is around 10 seconds for now. I see that the bottleneck looks like the process latency which is 100x more than the execute latency. It is my assumption that we don’t do explicit ack in Trident so I would exp

RE: Using Storm to parse emails and creates batches

2015-12-01 Thread Kalogeropoulos, Andreas
Thanks Stephen. Really appreciated. Kind Regards, Andréas Kalogéropoulos From: Stephen Powis [mailto:spo...@salesforce.com] Sent: Tuesday, December 01, 2015 12:45 PM To: user@storm.apache.org Subject: Re: Using Storm to parse emails and creates batches Yep, sounds like you got it..you'd want

Re: Writing file to storm hdfs

2015-12-01 Thread Gaurav Agarwal
Tx aaron On Dec 1, 2015 1:54 AM, "Aaron.Dossett" wrote: > Well, not all of the reasons were entirely unrelated: > > >- If data stopped flowing from Kafka completely then a rotation might >not happen for a very long time and I wanted to guarantee time bounds on >when I processed files.

Re: [Discussion] storm local-mode event object reuse bug

2015-12-01 Thread Zhang, Edward (GDI Hadoop)
In my opinion, it is not about immutability of an object. It is about the contract between storm framework and storm application. In this case, it looks like application code has to deep copy every object from input because it can’t be reused. I think that is also fine if the contract is that a

Re: [Discussion] storm local-mode event object reuse bug

2015-12-01 Thread Grant Overby (groverby)
Serialization isn't free. By skipping it where possible, even in a cluster, it's worth doing so to conserve CPU resources. Using immutable objects is cheaper. Assuming you're coding in java, consider using ImmutableMap, ImmutableMap.Builder, and similar classes in the Guava library from Google.

Re: [Discussion] storm local-mode event object reuse bug

2015-12-01 Thread Nathan Leung
It is bypassed by design. As noted in https://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html, the emitted objects must be immutable. If you're intent on modifying them, be very careful. On Tue, Dec 1, 2015 at 4:28 AM, Stephen Powis wrote: > I believe anytime tuples are passe

Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
Yep, sounds like you got it..you'd want to use field grouping and group on a field that contains the hash. Then every tuple that has that field with the identical hashes would get sent to the same bolt instance. On Tue, Dec 1, 2015 at 8:23 PM, Kalogeropoulos, Andreas < andreas.kalogeropou...@

RE: Using Storm to parse emails and creates batches

2015-12-01 Thread Kalogeropoulos, Andreas
Making sure that duplicates make it in the same XML file (third bolt). Kind Regards, Andréas Kalogéropoulos From: Stephen Powis [mailto:spo...@salesforce.com] Sent: Tuesday, December 01, 2015 11:59 AM To: user@storm.apache.org Subject: Re: Using Storm to parse emails and creates batches So you w

RE: Using Storm to parse emails and creates batches

2015-12-01 Thread Kalogeropoulos, Andreas
Hello Stephen, To make my example more realistic : The first bolt will analyze a list of 100 tuples And the last bolt, will probably wait for 10 000 list of tuples before creating the XML. Kind Regards, Andréas Kalogéropoulos From: Kalogeropoulos, Andreas [mailto:andreas.kalogeropou...@emc.com

Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
So you want to eliminate duplicates or make sure that duplicates make it into the same XML file (third bolt)? On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas < andreas.kalogeropou...@emc.com> wrote: > Hello Stephen, > > > > Imagine that the spout is providing me 300 000 emails per hour. >

RE: Using Storm to parse emails and creates batches

2015-12-01 Thread Kalogeropoulos, Andreas
Hello Stephen, Imagine that the spout is providing me 300 000 emails per hour. The first bolt will parse/analyze the information (from, to, cc, subject, object, date, has of attachments, … , and probably will find the same hash for some attachments (someone forwarding an email). The last bolt

Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
I'm not sure I follow/understand your question or what you're trying to do. On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas < andreas.kalogeropou...@emc.com> wrote: > You are right. Sorry for making you state the obvious J. > > > > Last question : If my spout has incoming information that

RE: Using Storm to parse emails and creates batches

2015-12-01 Thread Kalogeropoulos, Andreas
You are right. Sorry for making you state the obvious ☺. Last question : If my spout has incoming information that I want to have in the same last bolt (the one creating the XML) for deduplication logic, what is the best way to achieve this ? My instinct says to try to work with Fields groupin

Re: [Discussion] storm local-mode event object reuse bug

2015-12-01 Thread Stephen Powis
I believe anytime tuples are passed between bolts on the same jvm (either in local mode or in remote mode where the upstream and downstream bolt both reside on the same worker) serialization is bypassed by design. On Tue, Dec 1, 2015 at 1:46 PM, Edward Zhang wrote: > Hi Storm developers, > > Tod

Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
If you are using Storm's guaranteed message processing there is no need to 'persist' the collection anywhere other than in memory. IE List myListOfTuples = new ArrayList(); If the third bolt crashes and loses its in memo

Regarding hdfs state in trident

2015-12-01 Thread Gaurav Agarwal
Hello if we use hdfs state in trident topology and use 4 nodes. Will it help us to maintain the same state of file or tuples across multiple nodes.

RE: Using Storm to parse emails and creates batches

2015-12-01 Thread Kalogeropoulos, Andreas
Hello Stephen, I think you got I correctly. Thanks a lot for the idea. If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle persistence of this collection ? If the third bolt failed while populating the collection (size and time has not been reached) we j

RE: Using Storm to parse emails and creates batches

2015-12-01 Thread Kalogeropoulos, Andreas
Hello Nick, I think you are right. It is probably the state that I am not taking into consideration in my logic. And it is probably only in the last step. The first is just “extract”, so as you say, I need a “filtering” bolt to just take out what I need The second is probably going to read from