RE: Using Storm to parse emails and creates batches

Kalogeropoulos, Andreas Tue, 01 Dec 2015 03:00:38 -0800

Hello Stephen,

To make my example more realistic :
The first bolt will analyze a list of 100 tuples
And the last bolt, will probably wait for  10 000 list of tuples before 
creating the XML.

Kind Regards,
Andréas Kalogéropoulos

From: Kalogeropoulos, Andreas [mailto:andreas.kalogeropou...@emc.com]
Sent: Tuesday, December 01, 2015 11:48 AM
To: user@storm.apache.org
Subject: RE: Using Storm to parse emails and creates batches

Hello Stephen,

Imagine that the spout is providing me 300 000 emails per hour.
The first bolt will parse/analyze the information (from, to, cc, subject, 
object, date, has of attachments, …  , and probably will find the same hash for 
some attachments (someone forwarding an email).

The last bolt will create an XML based on all this information, but if I can 
have the tuples containing the same attachment (based on hash) in the same XML, 
I can actually apply a dedup logic : having multiple lines in my xml pointing 
to the same file

Does this make more sense ?

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spo...@salesforce.com]
Sent: Tuesday, December 01, 2015 11:36 AM
To: user@storm.apache.org<mailto:user@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

I'm not sure I follow/understand your question or what you're trying to do.

On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas 
<andreas.kalogeropou...@emc.com<mailto:andreas.kalogeropou...@emc.com>> wrote:
You are right. Sorry for making you state the obvious ☺.

Last question : If my spout has incoming information that I want to have in the 
same last bolt (the one creating the XML) for deduplication logic, what is the 
best way to achieve this ?  My instinct says to try to work with Fields 
grouping and the correct key (probably conversation since I am working with 
emails).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spo...@salesforce.com<mailto:spo...@salesforce.com>]
Sent: Tuesday, December 01, 2015 10:27 AM

To: user@storm.apache.org<mailto:user@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

If you are using Storm's guaranteed message 
processing<http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
 there is no need to 'persist' the collection anywhere other than in memory.  
IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the third bolt 
crashes and loses its in memory collection, after 
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
 the tuples will timeout and be replayed thru your entire storm topology and 
your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas 
<andreas.kalogeropou...@emc.com<mailto:andreas.kalogeropou...@emc.com>> wrote:
Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how 
did you handle persistence of this collection ? If the third bolt failed while 
populating the collection (size and time has not been reached) we just lost 
everything, so I need to have a status loopback of what was really output. 
Right ?

Of course, if you can send me the code of your third bolt (especially the 
collection handling), I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really 
give me example advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spo...@salesforce.com<mailto:spo...@salesforce.com>]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org<mailto:user@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect 
results from multiple tuples and build a single xml for them.  We've done this 
by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the 
collection and check the size of the collection.  Once the size of the 
collection exceeds some number, we then process all of the tuples in one go, 
and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the 
collection size > N OR if we've waited more than X seconds, process the batch.  
This way your output won't stall out if your topology has a lull in data being 
ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held 
by our collection but then no other tuples come in for a long period of time.  
If no tuples enter, that means the size and timeout checks are never executed 
and your bolt will hold onto those tuples for a long time (potentially causing 
timeouts).  To handle this, we made use of tick tuples.  Tick tuples 
essentially allow you to you to send a special tuple to your bolt every Y 
seconds.  We use that to trigger checking the time constraint is checked on a 
regular basis (example being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas 
<andreas.kalogeropou...@emc.com<mailto:andreas.kalogeropou...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the 
maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for 
multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want 
to have step1 and step 2 to work on the smallest number of emails to do it 
faster.
But I still want to be able to have an XML that represent 10 000+ emails at the 
end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?

Kind Regards,
Andréas Kalogéropoulos

RE: Using Storm to parse emails and creates batches

Reply via email to