Best practice to persist data in multiple TridentState
Hi, What would you say is the best way to persist data to multiple states ? Currently i have 3 options in mind: 1- Process data and use the stream to send data to both state Stream stream = ...each...filter...bla stream.partitionPersist(state1, ...) stream.partitionPersist(state2, ...) 2- Process data and chain the persists Stream stream = ...each...filter...bla stream.partitionPersist(state1, ...) .newValuesStream() .partitionPersist(state2, ...) 3- Do a topology for each state which would all mostly does the same thing but for the persist part. My main concerns here is handling failures and efficiency. In my usecase i actually have 3 states. 2 of them can store in a non transactionnal way and the other should be opaque transactionnal but actually can't as it's just an api call that doesn't recognize duplicates. That's no big deal if we could just make sure it's not bound to the failures of the other states (meaning that if an other state fails we're sure this one hasn't yet processed data). This makes option n°1 a bit tricky as i'm never sure of the order in which the state will be processed. Or is there a way to be sure ? Option 2 would do i guess but i have to pass allong in the first state all the data needed for the second. Potentially i would like to filter the tuples that goes to state 1 or state 2. I would then have to make my own updater that uses a filter for the first persists so that it doesn't send everything to the state but still emits everything in the end. Options 3 would also do but there i wouldn't be that efficient: reading my spout two times, processing data the same way in both topology up until the persist part. Any ideas on the best way to handle this ? Thanks Regards Laurent
Kafka spout connection reset
Hi, I'm using storm 0.9.1 and kafka 0.8 with the kafka spout provided by wurstmeister at https://github.com/wurstmeister/storm-kafka-0.8-plus. I don't really know if I should post this in this mailing list or in kafka's, but this only happens when using storm. Once every 4-5 secs I get this error on every kafka's server.log file across the cluster. [2014-05-02 11:28:27,887] ERROR Closing socket for /192.168.101.202 because of error (kafka.network.Processor) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(Unknown Source) at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.read(Unknown Source) at sun.nio.ch.SocketChannelImpl.read(Unknown Source) at kafka.utils.Utils$.read(Utils.scala:375) at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) at kafka.network.Processor.read(SocketServer.scala:347) at kafka.network.Processor.run(SocketServer.scala:245) at java.lang.Thread.run(Unknown Source) As you can imagine, I have lots of logs files full of these errors. I've tried with this kafka spout too: https://github.com/brndnmtthws/storm-kafka-0.8-plus, which is a fork of wurstmeister's, but the problem persists. Any of you had this problem before? Any ideas? Thank you! :)
How to do delimit persistent aggregate output
I'm kind of new to Trident, and was looking for some help making persistentAggregates group values over a defined set of tuples. Groupby() followed by a regular (combiner) aggregate accumulates values for the duration of a batch, then outputs them all onto an output stream, which is very convenient. I'd like to make a persistentAggregate behave similarly, except that it aggregates across batches, then outputs all its aggregates onto an output stream when some external signal occurs, and then clears its database to begin the next aggregation period. One approach would be to have the spout put a Done flag in each tuple, e.g. emit tuples like (DIMENSION_1, DIMENSION_2, METRIC_TO_BE_AGGREGATED, DONE), where it's doing .groupby(new Fields(DIMENSION_1, DIMENSION_2)), and the final tuple of each aggregation period has the Done flag set, or maybe better, a final batch would be sent consisting of a single tuple with null values for all the other columns and the Done flag set to indicate all previous batches are complete. The problem with that is I'm not sure how to get the tuple with the Done flag to go to all the partitions of the persistentAggregate - is there any way to broadcast a specific tuple to all partitions? Another approach would be to split another stream off the spout, that watches for the Done flag and generates a state query when it sees it. But it doesn't appear to be possible for a single state query to generate more than a single tuple of output - it appears to want to be given a list of keys to query for, and return the tuples for those keys - I don't see a way for a state query to dump the whole DB. Is there any way to do what I'm trying to do?
How resilient is Storm w/o supervision
I'm attempting to run Storm on a platform that I don't have root on. So I won't be able to run it under Redhat's supervisord that's already installed. How resilient are the Storm daemons by themselves? Are they reasonably resilient or are they programmed to not handle even relatively simple errors? I should probably say, this probably wouldn't be run in a production environment. Just trying to understand if the documentation writers are saying, you should really do this for production or it won't work if you don't do this. Thanks, Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory
Re: How resilient is Storm w/o supervision
I don’t think you need root to run supervisord: http://supervisord.org/running.html If you’re just testing something out, and don’t mind your cluster going down, then running without supervision is okay. But I would NEVER suggest someone run Storm’s daemons without supervision in a production environment. - Taylor On May 2, 2014, at 2:29 PM, Albert Chu ch...@llnl.gov wrote: I'm attempting to run Storm on a platform that I don't have root on. So I won't be able to run it under Redhat's supervisord that's already installed. How resilient are the Storm daemons by themselves? Are they reasonably resilient or are they programmed to not handle even relatively simple errors? I should probably say, this probably wouldn't be run in a production environment. Just trying to understand if the documentation writers are saying, you should really do this for production or it won't work if you don't do this. Thanks, Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory signature.asc Description: Message signed with OpenPGP using GPGMail
Re: How resilient is Storm w/o supervision
In practice, we rarely have issues with our Storm processes. We have Nimbus/supervisors that never restart for months. But when they do have issues, it's very nice to have them under supervision. Sometimes it's not even things under their control (e.g. OOM during high memory usage scenarios). For development, there shouldn't be an issues foregoing supervision. Michael Rose (@Xorlev https://twitter.com/xorlev) Senior Platform Engineer, FullContact http://www.fullcontact.com/ mich...@fullcontact.com On Fri, May 2, 2014 at 12:41 PM, P. Taylor Goetz ptgo...@gmail.com wrote: I don’t think you need root to run supervisord: http://supervisord.org/running.html If you’re just testing something out, and don’t mind your cluster going down, then running without supervision is okay. But I would NEVER suggest someone run Storm’s daemons without supervision in a production environment. - Taylor On May 2, 2014, at 2:29 PM, Albert Chu ch...@llnl.gov wrote: I'm attempting to run Storm on a platform that I don't have root on. So I won't be able to run it under Redhat's supervisord that's already installed. How resilient are the Storm daemons by themselves? Are they reasonably resilient or are they programmed to not handle even relatively simple errors? I should probably say, this probably wouldn't be run in a production environment. Just trying to understand if the documentation writers are saying, you should really do this for production or it won't work if you don't do this. Thanks, Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory