Best practice to persist data in multiple TridentState

2014-05-02 Thread Laurent Thoulon
Hi, 

What would you say is the best way to persist data to multiple states ? 
Currently i have 3 options in mind: 

1- Process data and use the stream to send data to both state 
Stream stream = ...each...filter...bla 
stream.partitionPersist(state1, ...) 
stream.partitionPersist(state2, ...) 

2- Process data and chain the persists 
Stream stream = ...each...filter...bla 
stream.partitionPersist(state1, ...) .newValuesStream() 
.partitionPersist(state2, ...) 

3- Do a topology for each state which would all mostly does the same thing but 
for the persist part. 

My main concerns here is handling failures and efficiency. 

In my usecase i actually have 3 states. 2 of them can store in a non 
transactionnal way and the other should be opaque transactionnal but actually 
can't as it's just an api call that doesn't recognize duplicates. 
That's no big deal if we could just make sure it's not bound to the failures of 
the other states (meaning that if an other state fails we're sure this one 
hasn't yet processed data). 

This makes option n°1 a bit tricky as i'm never sure of the order in which the 
state will be processed. Or is there a way to be sure ? 
Option 2 would do i guess but i have to pass allong in the first state all the 
data needed for the second. Potentially i would like to filter the tuples that 
goes to state 1 or state 2. I would then have to make my own updater that uses 
a filter for the first persists so that it doesn't send everything to the state 
but still emits everything in the end. 
Options 3 would also do but there i wouldn't be that efficient: reading my 
spout two times, processing data the same way in both topology up until the 
persist part. 

Any ideas on the best way to handle this ? 
Thanks 


Regards 
Laurent 


Kafka spout connection reset

2014-05-02 Thread Carlos Rodriguez
Hi,

I'm using storm 0.9.1 and kafka 0.8 with the kafka spout provided by
wurstmeister at https://github.com/wurstmeister/storm-kafka-0.8-plus.

I don't really know if I should post this in this mailing list or in
kafka's, but this only happens when using storm.

Once every 4-5 secs I get this error on every kafka's server.log file
across the cluster.

[2014-05-02 11:28:27,887] ERROR Closing socket for /192.168.101.202 because
of error (kafka.network.Processor)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at kafka.utils.Utils$.read(Utils.scala:375)
at
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Processor.read(SocketServer.scala:347)
at kafka.network.Processor.run(SocketServer.scala:245)
at java.lang.Thread.run(Unknown Source)

As you can imagine, I have lots of logs files full of these errors.

I've tried with this kafka spout too:
https://github.com/brndnmtthws/storm-kafka-0.8-plus, which is a fork of
wurstmeister's, but the problem persists.

Any of you had this problem before?
Any ideas?

Thank you! :)


How to do delimit persistent aggregate output

2014-05-02 Thread Larry Palmer
I'm kind of new to Trident, and was looking for some help making
persistentAggregates group values over a defined set of tuples.

Groupby() followed by a regular (combiner) aggregate accumulates values for
the duration of a batch, then outputs them all onto an output stream, which
is very convenient.

I'd like to make a persistentAggregate behave similarly, except that it
aggregates across batches, then outputs all its aggregates onto an output
stream when some external signal occurs, and then clears its database to
begin the next aggregation period.

One approach would be to have the spout put a Done flag in each tuple,
e.g. emit tuples like (DIMENSION_1, DIMENSION_2,
METRIC_TO_BE_AGGREGATED, DONE), where it's doing .groupby(new
Fields(DIMENSION_1, DIMENSION_2)), and the final tuple of each
aggregation period has the Done flag set, or maybe better, a final batch
would be sent consisting of a single tuple with null values for all the
other columns and the Done flag set to indicate all previous batches are
complete.

The problem with that is I'm not sure how to get the tuple with the Done
flag to go to all the partitions of the persistentAggregate - is there any
way to broadcast a specific tuple to all partitions?

Another approach would be to split another stream off the spout, that
watches for the Done flag and generates a state query when it sees it.
But it doesn't appear to be possible for a single state query to generate
more than a single tuple of output - it appears to want to be given a list
of keys to query for, and return the tuples for those keys - I don't see a
way for a state query to dump the whole DB.

Is there any way to do what I'm trying to do?


How resilient is Storm w/o supervision

2014-05-02 Thread Albert Chu
I'm attempting to run Storm on a platform that I don't have root on.  So
I won't be able to run it under Redhat's supervisord that's already
installed.

How resilient are the Storm daemons by themselves?  Are they reasonably
resilient or are they programmed to not handle even relatively simple
errors?

I should probably say, this probably wouldn't be run in a production
environment.  Just trying to understand if the documentation writers are
saying, you should really do this for production or it won't work if
you don't do this.

Thanks,

Al

-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




Re: How resilient is Storm w/o supervision

2014-05-02 Thread P. Taylor Goetz
I don’t think you need root to run supervisord:

http://supervisord.org/running.html

If you’re just testing something out, and don’t mind your cluster going down, 
then running without supervision is okay. But I would NEVER suggest someone run 
Storm’s daemons without supervision in a production environment.

- Taylor

On May 2, 2014, at 2:29 PM, Albert Chu ch...@llnl.gov wrote:

 I'm attempting to run Storm on a platform that I don't have root on.  So
 I won't be able to run it under Redhat's supervisord that's already
 installed.
 
 How resilient are the Storm daemons by themselves?  Are they reasonably
 resilient or are they programmed to not handle even relatively simple
 errors?
 
 I should probably say, this probably wouldn't be run in a production
 environment.  Just trying to understand if the documentation writers are
 saying, you should really do this for production or it won't work if
 you don't do this.
 
 Thanks,
 
 Al
 
 -- 
 Albert Chu
 ch...@llnl.gov
 Computer Scientist
 High Performance Systems Division
 Lawrence Livermore National Laboratory
 
 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: How resilient is Storm w/o supervision

2014-05-02 Thread Michael Rose
In practice, we rarely have issues with our Storm processes. We have
Nimbus/supervisors that never restart for months. But when they do have
issues, it's very nice to have them under supervision. Sometimes it's not
even things under their control (e.g. OOM during high memory usage
scenarios).

For development, there shouldn't be an issues foregoing supervision.

Michael Rose (@Xorlev https://twitter.com/xorlev)
Senior Platform Engineer, FullContact http://www.fullcontact.com/
mich...@fullcontact.com


On Fri, May 2, 2014 at 12:41 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 I don’t think you need root to run supervisord:

 http://supervisord.org/running.html

 If you’re just testing something out, and don’t mind your cluster going
 down, then running without supervision is okay. But I would NEVER suggest
 someone run Storm’s daemons without supervision in a production environment.

 - Taylor

 On May 2, 2014, at 2:29 PM, Albert Chu ch...@llnl.gov wrote:

  I'm attempting to run Storm on a platform that I don't have root on.  So
  I won't be able to run it under Redhat's supervisord that's already
  installed.
 
  How resilient are the Storm daemons by themselves?  Are they reasonably
  resilient or are they programmed to not handle even relatively simple
  errors?
 
  I should probably say, this probably wouldn't be run in a production
  environment.  Just trying to understand if the documentation writers are
  saying, you should really do this for production or it won't work if
  you don't do this.
 
  Thanks,
 
  Al
 
  --
  Albert Chu
  ch...@llnl.gov
  Computer Scientist
  High Performance Systems Division
  Lawrence Livermore National Laboratory