Puppet module for deploying Storm released
Hi everyone, I have released a Puppet module to deploy Storm 0.9 in case anyone is interested. The module uses Puppet parameterized classes and as such decouples code (Puppet manifests) from configuration data -- hence you can use Puppet Hiera to configure the way Storm is deployed without having to write or fork/modify Puppet manifests. The module is available under the Apache v2 license. Any code contributions, bug reports, etc. are of course very welcome. The module including docs and examples is available at: https://github.com/miguno/puppet-storm Enjoy! Michael
RE: Storm Applications
Using normal storm, any bolt can output to anything at any time, as each bolt runs arbitrary code. So a bolt in the middle of a topology can write to a database, or file, or anything else you need. It will likely be the last bolt in the topology, but it doesn't have to be. If you use trident, then you use specific abstractions to read and write data - to read, you use a StateFactory and a QueryFunction, and to write, you use a StateFactory with a StateUpdater. If you want to read data from flume, you'll have to write a spout to pull data from flume and emit it into a topology. Start with the IRichSpout interface for normal storm, or ITridentSpout for trident. SimonC From: P lva [mailto:ruvi...@gmail.com] Sent: 26 February 2014 02:44 To: user@storm.incubator.apache.org Subject: Storm Applications Hello Everyone, I came across storm recently and I'm trying to understand it better. Storm, unlike flume, doesn't really have any code for a sink. Read somewhere that storm is a real time stream processing engine where you don't expect data to land anywhere. What kind of a situation would this be ? One example I envision is a situation where you only want to maintain counters without the actual data itself. Is this right ? If yes, I'm assuming that these counters have to be updated in a database. How does this affect the performance ? Can I route flume streams through storm cluster to compute the counters,store the counters in hbase (instead of going flume --- hive .--- top 10 query), effectively decreasing the number of mapreduce jobs on hadoop cluster ?
Storm Message Size
Hi, I have a topology which process events and aggregates them in some form and performs some prediction based on a machine learning (ML) model. Every x events the one of the bolt involved in the normal processing emit an trainModel event, which is routed to a bolt which is just dedicated to the training. One the training is done, the new model should be send back to the prediction bolt. The topology looks like: InputSpout - AggregationBolt - PredictionBolt - OutputBolt | /\ \/ | TrainingBolt -+ The model can get quite large ( 100 mb) so I am not sure how this would impact the performance of my cluster. Does anybody has experiences with transmitting large messages? Also the training might take a while, so the aggregation bolt should not trigger the training bolt if he is busy. Is there an established patterns how to archive this kind of synchronization? I could have some streams to send states, but then I would mix data stream with control stream, what I really would like to avoid. An alternative would be use ZooKeeper and perform the synchronization there. Lats but not least I could also make make the aggregation bolt into a data base and have the training bolt periodically wake up and read the data base. Does anybody has experience with such a setup? Kind Regards, Klaus
Re: Storm Message Size
I can't comment on how large tuples fare, but about the synchronization, would this not make more sense? InputSpout - AggregationBolt - PredictionBolt - OutputBolt | | \/ | Agg. State| /\ | |V TrainingBolt - Model State I.e. AggregationBolt writes to AggregationState, which is polled by TrainingBolt, which writes to ModelState. ModelState is then polled by PredictionBolt. This way, you can get rid of the large tuples as well and use instead something like S3 for these large states. On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho klaus.schaef...@gmail.com wrote: Hi, I have a topology which process events and aggregates them in some form and performs some prediction based on a machine learning (ML) model. Every x events the one of the bolt involved in the normal processing emit an trainModel event, which is routed to a bolt which is just dedicated to the training. One the training is done, the new model should be send back to the prediction bolt. The topology looks like: InputSpout - AggregationBolt - PredictionBolt - OutputBolt | /\ \/ | TrainingBolt -+ The model can get quite large ( 100 mb) so I am not sure how this would impact the performance of my cluster. Does anybody has experiences with transmitting large messages? Also the training might take a while, so the aggregation bolt should not trigger the training bolt if he is busy. Is there an established patterns how to archive this kind of synchronization? I could have some streams to send states, but then I would mix data stream with control stream, what I really would like to avoid. An alternative would be use ZooKeeper and perform the synchronization there. Lats but not least I could also make make the aggregation bolt into a data base and have the training bolt periodically wake up and read the data base. Does anybody has experience with such a setup? Kind Regards, Klaus
Re: Storm Message Size
THX, the idea is good, I will keep that in mind. The only drawback is that it relies on polling, what I do not like to much in the PredictionBolt. Off couse I could also pass S3 or File refernces around in the messages, to trigger an update. But for the sake of simplicity I was thinking of keeping everything in storm and do not rely if possible on other system. Cheers, Klaus On Wed, Feb 26, 2014 at 12:22 PM, Enno Shioji eshi...@gmail.com wrote: I can't comment on how large tuples fare, but about the synchronization, would this not make more sense? InputSpout - AggregationBolt - PredictionBolt - OutputBolt | | \/ | Agg. State| /\ | |V TrainingBolt - Model State I.e. AggregationBolt writes to AggregationState, which is polled by TrainingBolt, which writes to ModelState. ModelState is then polled by PredictionBolt. This way, you can get rid of the large tuples as well and use instead something like S3 for these large states. On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho klaus.schaef...@gmail.com wrote: Hi, I have a topology which process events and aggregates them in some form and performs some prediction based on a machine learning (ML) model. Every x events the one of the bolt involved in the normal processing emit an trainModel event, which is routed to a bolt which is just dedicated to the training. One the training is done, the new model should be send back to the prediction bolt. The topology looks like: InputSpout - AggregationBolt - PredictionBolt - OutputBolt | /\ \/ | TrainingBolt -+ The model can get quite large ( 100 mb) so I am not sure how this would impact the performance of my cluster. Does anybody has experiences with transmitting large messages? Also the training might take a while, so the aggregation bolt should not trigger the training bolt if he is busy. Is there an established patterns how to archive this kind of synchronization? I could have some streams to send states, but then I would mix data stream with control stream, what I really would like to avoid. An alternative would be use ZooKeeper and perform the synchronization there. Lats but not least I could also make make the aggregation bolt into a data base and have the training bolt periodically wake up and read the data base. Does anybody has experience with such a setup? Kind Regards, Klaus
Re: Storm Load Balancing
Well 6700 isnt running at all. There's no uptime so they aren't ever starting. 6701 appears to have died 20 minutes before you took the screenshot, that is going to result in load being shuffled around. So you had 3 functional workers, 6701, 6702, 6703 and 6701 went down leaving 6702 and 6703 Those are both issue to look into. Beyond that, you can try doing a rebalance. What sort of data is being processed? Given you are seeing a wide range on a single worker, it seems like you have data issues. Some set of data takes longer, or you are doing a fields grouping on a field that isn't evenly distributed, etc. On Tue, Feb 25, 2014 at 9:25 AM, An Tran tra...@gmail.com wrote: I am having an issue with Storm Load balancing. I have a bunch of exector (50) spread across 4 workers and it looks like some executor are way over capacity where others are idle . See attached image for more detail. Can you guys explain to me what's going on and how do i fix this? -- Ce n'est pas une signature
STORM with MYSQL optimizations
Dear All, Have any body worked on the configurations/optimizations needed generally for using STORM with MYSQL. Our scenario stores data in MYSQL tables, but as the data rate increases MYSQL starts responding very slow (in some cases connection refused error), resulting in DBWriterBolt to slowdown. All the Topology is bottleneck by this issue. We cannot increase the traffic at source beyond a certain level, the reason we noted is that sink (MYSQL) or the bolt adjacent to sink is performing slow. Any suggestion on how should we proceed will be highly appreciated. Thanks.
Re: STORM with MYSQL optimizations
How much traffic exactly are you pushing at mysql before the load gets to high and it starts to buckle under the weight? On Wed, Feb 26, 2014 at 8:38 AM, masoom alam masoom.a...@gmail.com wrote: Dear All, Have any body worked on the configurations/optimizations needed generally for using STORM with MYSQL. Our scenario stores data in MYSQL tables, but as the data rate increases MYSQL starts responding very slow (in some cases connection refused error), resulting in DBWriterBolt to slowdown. All the Topology is bottleneck by this issue. We cannot increase the traffic at source beyond a certain level, the reason we noted is that sink (MYSQL) or the bolt adjacent to sink is performing slow. Any suggestion on how should we proceed will be highly appreciated. Thanks. -- Ce n'est pas une signature
Re: Setting up Storm Cluster
There are good basic default configurations for each. There's nothing you should have to do. Older versions of storm 0.9.x defaulted to ZeroMQ, the latest defaults to Netty. I would advise not tuning any parameters of either until you need to and understand what you are doing. On Sat, Feb 22, 2014 at 7:01 AM, An Tran tra...@gmail.com wrote: Hi, I am trying to install the latest version of Storm. The documentation I found ( http://storm.incubator.apache.org/documentation/Setting-up-a-Storm-cluster.html) does not mention ZeroMQ or Netty configuration. Is this information correct and the most up to date? -- Ce n'est pas une signature
Re: STORM with MYSQL optimizations
1000 Events per second. On Wed, Feb 26, 2014 at 6:40 PM, Sean Allen s...@monkeysnatchbanana.comwrote: How much traffic exactly are you pushing at mysql before the load gets to high and it starts to buckle under the weight? On Wed, Feb 26, 2014 at 8:38 AM, masoom alam masoom.a...@gmail.comwrote: Dear All, Have any body worked on the configurations/optimizations needed generally for using STORM with MYSQL. Our scenario stores data in MYSQL tables, but as the data rate increases MYSQL starts responding very slow (in some cases connection refused error), resulting in DBWriterBolt to slowdown. All the Topology is bottleneck by this issue. We cannot increase the traffic at source beyond a certain level, the reason we noted is that sink (MYSQL) or the bolt adjacent to sink is performing slow. Any suggestion on how should we proceed will be highly appreciated. Thanks. -- Ce n'est pas une signature
Re: STORM with MYSQL optimizations
Is your mysql set up to handle 1000 writes a second? I'm going to guess no. If that is the case then Klaus' suggestions are good ones. Batch or Shard. On Wed, Feb 26, 2014 at 8:45 AM, masoom alam masoom.a...@gmail.com wrote: 1000 Events per second. On Wed, Feb 26, 2014 at 6:40 PM, Sean Allen s...@monkeysnatchbanana.comwrote: How much traffic exactly are you pushing at mysql before the load gets to high and it starts to buckle under the weight? On Wed, Feb 26, 2014 at 8:38 AM, masoom alam masoom.a...@gmail.comwrote: Dear All, Have any body worked on the configurations/optimizations needed generally for using STORM with MYSQL. Our scenario stores data in MYSQL tables, but as the data rate increases MYSQL starts responding very slow (in some cases connection refused error), resulting in DBWriterBolt to slowdown. All the Topology is bottleneck by this issue. We cannot increase the traffic at source beyond a certain level, the reason we noted is that sink (MYSQL) or the bolt adjacent to sink is performing slow. Any suggestion on how should we proceed will be highly appreciated. Thanks. -- Ce n'est pas une signature -- Ce n'est pas une signature
Re: STORM with MYSQL optimizations
@Sean: You are right, MYSQL is not configured to handle 1000 events per second. I will post the results of Batch, which is also slow in our case.I think we should investigate thoroughly why Batch of for example 1000 is also slow in our case. BTW, How easy it is to configure/implement Shards in MYSQL. Any useful pointers? On Wed, Feb 26, 2014 at 6:48 PM, Sean Allen s...@monkeysnatchbanana.comwrote: Is your mysql set up to handle 1000 writes a second? I'm going to guess no. If that is the case then Klaus' suggestions are good ones. Batch or Shard. On Wed, Feb 26, 2014 at 8:45 AM, masoom alam masoom.a...@gmail.comwrote: 1000 Events per second. On Wed, Feb 26, 2014 at 6:40 PM, Sean Allen s...@monkeysnatchbanana.comwrote: How much traffic exactly are you pushing at mysql before the load gets to high and it starts to buckle under the weight? On Wed, Feb 26, 2014 at 8:38 AM, masoom alam masoom.a...@gmail.comwrote: Dear All, Have any body worked on the configurations/optimizations needed generally for using STORM with MYSQL. Our scenario stores data in MYSQL tables, but as the data rate increases MYSQL starts responding very slow (in some cases connection refused error), resulting in DBWriterBolt to slowdown. All the Topology is bottleneck by this issue. We cannot increase the traffic at source beyond a certain level, the reason we noted is that sink (MYSQL) or the bolt adjacent to sink is performing slow. Any suggestion on how should we proceed will be highly appreciated. Thanks. -- Ce n'est pas une signature -- Ce n'est pas une signature
Re: STORM with MYSQL optimizations
Sharding is a pain in the ass and should be avoided when possible. If its possible, I'd look for another data store that can handle a higher load as a cluster so you don't have to worry about the details of sharding. On Wed, Feb 26, 2014 at 8:54 AM, masoom alam masoom.a...@gmail.com wrote: @Sean: You are right, MYSQL is not configured to handle 1000 events per second. I will post the results of Batch, which is also slow in our case.I think we should investigate thoroughly why Batch of for example 1000 is also slow in our case. BTW, How easy it is to configure/implement Shards in MYSQL. Any useful pointers? On Wed, Feb 26, 2014 at 6:48 PM, Sean Allen s...@monkeysnatchbanana.comwrote: Is your mysql set up to handle 1000 writes a second? I'm going to guess no. If that is the case then Klaus' suggestions are good ones. Batch or Shard. On Wed, Feb 26, 2014 at 8:45 AM, masoom alam masoom.a...@gmail.comwrote: 1000 Events per second. On Wed, Feb 26, 2014 at 6:40 PM, Sean Allen s...@monkeysnatchbanana.com wrote: How much traffic exactly are you pushing at mysql before the load gets to high and it starts to buckle under the weight? On Wed, Feb 26, 2014 at 8:38 AM, masoom alam masoom.a...@gmail.comwrote: Dear All, Have any body worked on the configurations/optimizations needed generally for using STORM with MYSQL. Our scenario stores data in MYSQL tables, but as the data rate increases MYSQL starts responding very slow (in some cases connection refused error), resulting in DBWriterBolt to slowdown. All the Topology is bottleneck by this issue. We cannot increase the traffic at source beyond a certain level, the reason we noted is that sink (MYSQL) or the bolt adjacent to sink is performing slow. Any suggestion on how should we proceed will be highly appreciated. Thanks. -- Ce n'est pas une signature -- Ce n'est pas une signature -- Ce n'est pas une signature
Re: [RELEASE] Apache Storm 0.9.1-incubating released (defaults.yaml)
The defaults.yaml file is part of the source distribution and is packaged into storm's jar when deployed. In a storm cluster deployment, it is not meant to be on the file system in ${storm.home}/conf. Perhaps you are pointing to your source working tree as storm home? -- Derek On 2/26/14, 5:59, Lajos wrote: Quick question on this: defaults.yaml is in both conf and storm-core.jar, so the first time you start nimbus 0.9.1 you get this message: java.lang.RuntimeException: Found multiple defaults.yaml resources. You're probably bundling the Storm jars with your topology jar. [file:/scratch/projects/apache-storm-0.9.1-incubating/conf/defaults.yaml, jar:file:/scratch/projects/apache-storm-0.9.1-incubating/lib/storm-core-0.9.1-incubating.jar!/defaults.yaml] at backtype.storm.utils.Utils.findAndReadConfigFile(Utils.java:133) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] ... Shouldn't conf/defaults.yaml be called like conf/defaults.yaml.copy or something? I like that it is in the conf directory, because now I can easily see all the config options instead of having to go to the source directory. But it shouldn't prevent startup ... Thanks, Lajos On 22/02/2014 21:09, P. Taylor Goetz wrote: The Storm team is pleased to announce the release of Apache Storm version 0.9.1-incubating. This is our first Apache release. Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. You can read more about Storm on the project website: http://storm.incubator.apache.org Downloads of source and binary distributions are listed in our download section: http://storm.incubator.apache.org/downloads.html Distribution artifacts are available in Maven Central at the following coordinates: groupId: org.apache.storm artifactId: storm-core version: 0.9.1-incubating The full list of changes is available here[1]. Please let us know [2] if you encounter any problems. Enjoy! [1]: http://s.apache.org/Ki0 (CHANGELOG) [2]: https://issues.apache.org/jira/browse/STORM
Re: Storm Message Size
Hi Klaus, I've been dealing with similar use cases. I do a couple of things (which may not be a final solution, but it is interesting to discuss alternate approaches): I have passed trained models in the 200MB range through storm, but I try to avoid it. The model gets dropped into persistence and then only ID to the model is passed through the topology. So my training bolt passes the whole model blob to the persistence bolt and that's it...in the future I may even remove that step so that the model blob never gets transferred by storm. Also, I use separate topologies for training, and those tend to have timeouts much higher because the train aggregator can take quite a while. Traditionally this would probably happen in Hadoop or some other batch system, but I'm too busy to do the setup and storm is handling it fine anyway. I don't have to do any polling because I have model selection running as a logically different step, i.e. tuple shows up for prediction, run a selection step which finds the model ID for scoring that tuple, then it flows on to an actual scoring bolt which retrieves the model based on ID and applies it to the tuple. If the creation of a new model leads you to re-score old tuples, you could use the model write to trigger those tuples to be replayed from some source of state such that they will pickup the new model ID and proceed as normal. Best, Adam On Wed, Feb 26, 2014 at 7:54 AM, Klausen Schaefersinho klaus.schaef...@gmail.com wrote: THX, the idea is good, I will keep that in mind. The only drawback is that it relies on polling, what I do not like to much in the PredictionBolt. Off couse I could also pass S3 or File refernces around in the messages, to trigger an update. But for the sake of simplicity I was thinking of keeping everything in storm and do not rely if possible on other system. Cheers, Klaus On Wed, Feb 26, 2014 at 12:22 PM, Enno Shioji eshi...@gmail.com wrote: I can't comment on how large tuples fare, but about the synchronization, would this not make more sense? InputSpout - AggregationBolt - PredictionBolt - OutputBolt | | \/ | Agg. State| /\ | |V TrainingBolt - Model State I.e. AggregationBolt writes to AggregationState, which is polled by TrainingBolt, which writes to ModelState. ModelState is then polled by PredictionBolt. This way, you can get rid of the large tuples as well and use instead something like S3 for these large states. On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho klaus.schaef...@gmail.com wrote: Hi, I have a topology which process events and aggregates them in some form and performs some prediction based on a machine learning (ML) model. Every x events the one of the bolt involved in the normal processing emit an trainModel event, which is routed to a bolt which is just dedicated to the training. One the training is done, the new model should be send back to the prediction bolt. The topology looks like: InputSpout - AggregationBolt - PredictionBolt - OutputBolt | /\ \/ | TrainingBolt -+ The model can get quite large ( 100 mb) so I am not sure how this would impact the performance of my cluster. Does anybody has experiences with transmitting large messages? Also the training might take a while, so the aggregation bolt should not trigger the training bolt if he is busy. Is there an established patterns how to archive this kind of synchronization? I could have some streams to send states, but then I would mix data stream with control stream, what I really would like to avoid. An alternative would be use ZooKeeper and perform the synchronization there. Lats but not least I could also make make the aggregation bolt into a data base and have the training bolt periodically wake up and read the data base. Does anybody has experience with such a setup? Kind Regards, Klaus
Re: [RELEASE] Apache Storm 0.9.1-incubating released
Hello, Padma! You can create a storm cluster on Windows with one node as described here: http://ptgoetz.github.io/blog/2013/12/18/running-apache-storm-on-windows/ I could set up following the instructions from this article. I hope that will help you also. Regards,\ Florin On Wed, Feb 26, 2014 at 2:23 PM, padma priya chitturi padmapriy...@gmail.com wrote: Does 0.9.1 version has inbuilt support to run on windows ? On Wed, Feb 26, 2014 at 5:29 PM, Lajos la...@protulae.com wrote: Quick question on this: defaults.yaml is in both conf and storm-core.jar, so the first time you start nimbus 0.9.1 you get this message: java.lang.RuntimeException: Found multiple defaults.yaml resources. You're probably bundling the Storm jars with your topology jar. [file:/scratch/projects/apache-storm-0.9.1-incubating/conf/defaults.yaml, jar:file:/scratch/projects/apache-storm-0.9.1-incubating/ lib/storm-core-0.9.1-incubating.jar!/defaults.yaml] at backtype.storm.utils.Utils.findAndReadConfigFile(Utils.java:133) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] ... Shouldn't conf/defaults.yaml be called like conf/defaults.yaml.copy or something? I like that it is in the conf directory, because now I can easily see all the config options instead of having to go to the source directory. But it shouldn't prevent startup ... Thanks, Lajos On 22/02/2014 21:09, P. Taylor Goetz wrote: The Storm team is pleased to announce the release of Apache Storm version 0.9.1-incubating. This is our first Apache release. Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. You can read more about Storm on the project website: http://storm.incubator.apache.org Downloads of source and binary distributions are listed in our download section: http://storm.incubator.apache.org/downloads.html Distribution artifacts are available in Maven Central at the following coordinates: groupId: org.apache.storm artifactId: storm-core version: 0.9.1-incubating The full list of changes is available here[1]. Please let us know [2] if you encounter any problems. Enjoy! [1]: http://s.apache.org/Ki0 (CHANGELOG) [2]: https://issues.apache.org/jira/browse/STORM
Re: [RELEASE] Apache Storm 0.9.1-incubating released (defaults.yaml)
Hi Derek, Ah! I accidentally unpacked source on top of binary, when I meant to put it in a separate directory. That's the problem, thanks. Cheers, L On 26/02/2014 15:32, Derek Dagit wrote: The defaults.yaml file is part of the source distribution and is packaged into storm's jar when deployed. In a storm cluster deployment, it is not meant to be on the file system in ${storm.home}/conf. Perhaps you are pointing to your source working tree as storm home?
Re: Unexpected behavior on message resend
Hi Adam, ok, good to know. I resolved to create the tuple from scratch in case it needs to be resend. I don't where else in-place modification could hurt in a linear process. Am I missing something? Thanks, Harald. On 26.02.2014 15:48, Adam Lewis wrote: I've already gotten slapped around on the list for doing in place modifications, so let me pass it on :) Don't modify tuple objects in place. You shouldn't rely on serialization happening or not happening for correctness. On Mon, Feb 24, 2014 at 11:18 AM, Harald Kirsch harald.kir...@raytion.com mailto:harald.kir...@raytion.com wrote: Hi all, my TOPOLOGY_MESSAGE_TIMEOUT_SECS was slightly to low. I got a fail for a tuple and the spout just resend it. One bolt normalizes a date in place in a field of the tuple. After the spout resend the tuple, I got errors from the date parser because the date was already normalized. Since I currently have only one node, I know of course what happens. The tuple was just the very same object that was already partially processed when the timeout hit. In a distributed setup I envisage the bolt to be on another machine with a serialized copy of the spout's tuple such that changes to the tuple are not reflected in the original. Would that be true? I reckon from this that all processing in bolts needs to be idempotent if I want to be able to replay failed tuples. Is that true or am I doing something wrong? Harald. -- Harald Kirsch Raytion GmbH Kaiser-Friedrich-Ring 74 40547 Duesseldorf Fon +49-211-550266-0 tel:%2B49-211-550266-0 Fax +49-211-550266-19 tel:%2B49-211-550266-19 http://www.raytion.com -- Harald Kirsch Raytion GmbH Kaiser-Friedrich-Ring 74 40547 Duesseldorf Fon +49-211-550266-0 Fax +49-211-550266-19 http://www.raytion.com
Re: Unexpected behavior on message resend
In my case it was the state objects created as part of trident aggregation. Here is the final message in the thread (i.e. read bottom up): http://mail-archives.apache.org/mod_mbox/storm-user/201312.mbox/%3CCAAYLz+p4YhF+i3LAkFoyU3nvngZXOusZWXj=0+bynrx0+tg...@mail.gmail.com%3E On Wed, Feb 26, 2014 at 10:35 AM, Harald Kirsch harald.kir...@raytion.comwrote: Hi Adam, ok, good to know. I resolved to create the tuple from scratch in case it needs to be resend. I don't where else in-place modification could hurt in a linear process. Am I missing something? Thanks, Harald. On 26.02.2014 15:48, Adam Lewis wrote: I've already gotten slapped around on the list for doing in place modifications, so let me pass it on :) Don't modify tuple objects in place. You shouldn't rely on serialization happening or not happening for correctness. On Mon, Feb 24, 2014 at 11:18 AM, Harald Kirsch harald.kir...@raytion.com mailto:harald.kir...@raytion.com wrote: Hi all, my TOPOLOGY_MESSAGE_TIMEOUT_SECS was slightly to low. I got a fail for a tuple and the spout just resend it. One bolt normalizes a date in place in a field of the tuple. After the spout resend the tuple, I got errors from the date parser because the date was already normalized. Since I currently have only one node, I know of course what happens. The tuple was just the very same object that was already partially processed when the timeout hit. In a distributed setup I envisage the bolt to be on another machine with a serialized copy of the spout's tuple such that changes to the tuple are not reflected in the original. Would that be true? I reckon from this that all processing in bolts needs to be idempotent if I want to be able to replay failed tuples. Is that true or am I doing something wrong? Harald. -- Harald Kirsch Raytion GmbH Kaiser-Friedrich-Ring 74 40547 Duesseldorf Fon +49-211-550266-0 tel:%2B49-211-550266-0 Fax +49-211-550266-19 tel:%2B49-211-550266-19 http://www.raytion.com -- Harald Kirsch Raytion GmbH Kaiser-Friedrich-Ring 74 40547 Duesseldorf Fon +49-211-550266-0 Fax +49-211-550266-19 http://www.raytion.com
Storm cannot run in combination with a recent Hadoop/HBase version.
Hi, I'm trying to write some storm bolts and I want them to output the information they produce into HBase. Now the HBase we have running here is based on CDH 4.5.0 which is fully based on the zookeeper versions in the 3.4.x range. The problem I have is that Storm currently still uses zookeeper 3.3.3 The important difference in my case between these two is that 3.3.x has: org.apache.zookeeper.server.NIOServerCnxn$Factory 3.4.x has: org.apache.zookeeper.server.NIOServerCnxnFactory As a consequence I'm getting a ClassNotFoundException. I found that during a short period this problem was fixed but because of a performance problem in curator was turned back. https://github.com/nathanmarz/storm/pull/225 What does it take to get this fixed (i.e. zookeeper goes to a 3.4.x version)? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: [DISCUSS] Pulling Contrib Modules into Apache
Thanks for the feedback Bobby. To clarify, I’m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm. Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc. The idea behind having one or more Storm committers act as a “sponsor” is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor’s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added. It would be the Committers’/PPMC’s responsibly to make sure things didn’t get out of hand, and to do something about it if it does. Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we’d see anything along that scale. - Taylor [1] https://issues.apache.org/jira/browse/HADOOP-3601 On Feb 26, 2014, at 12:35 PM, Bobby Evans ev...@yahoo-inc.com wrote: I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works. But I don’t see much long term developer benefit in building them all together. If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm. I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that. I’d love to have storm on yarn inside Apache. I just don’t want to go overboard with it. There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up. Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm. I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term. If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it. I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it. I would prefer that they are built and tested by default. If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules. —Bobby From: Brian Enochson brian.enoch...@gmail.commailto:brian.enoch...@gmail.com Reply-To: user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org Date: Tuesday, February 25, 2014 at 9:50 PM To: user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org Cc: d...@storm.incubator.apache.orgmailto:d...@storm.incubator.apache.org d...@storm.incubator.apache.orgmailto:d...@storm.incubator.apache.org Subject: Re: [DISCUSS] Pulling Contrib Modules into Apache hi, I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality. I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and
Re: [DISCUSS] Pulling Contrib Modules into Apache
Bobby, FWIW, I¹d love to see storm-yarn inside. I think we could definitely make things easier on the end-user if they were more cohesive. e.g. Imagine if we had ³storm launch yarn² inside of $storm/bin that would kickoff a storm-yarn launch, with whatever version was built. It would likely simplify the ³create-tarball² and storm-yarn getStormConfig process as well. -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 2/26/14, 4:25 PM, Bobby Evans ev...@yahoo-inc.com wrote: I totally agree and I am +1 on bringing these spout/trident pieces in, assuming there are committers to support them. I am also curious about how people feel about pulling in other projects like storm-starter, storm-deploy, storm-mesos, and storm-yarn? Storm-starter in my option seems more like documentation and it would be nice to pull in so that it stays up to date with storm itself, just like the documentation. The others are more of ways to run storm in different environments. They seem like there could be a lot of coupling between them and storm as storm evolves, and they kind of fit with integrate storm with *Technology X*² except X in this case is a compute environment instead of a data source or store. But then again we also just shot down a request to create juju charms for storm. Bobby From: P. Taylor Goetz ptgo...@gmail.commailto:ptgo...@gmail.com Reply-To: d...@storm.incubator.apache.orgmailto:d...@storm.incubator.apache.org Date: Wednesday, February 26, 2014 at 1:21 PM To: d...@storm.incubator.apache.orgmailto:d...@storm.incubator.apache.org Cc: user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org Subject: Re: [DISCUSS] Pulling Contrib Modules into Apache Thanks for the feedback Bobby. To clarify, I¹m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm. Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc. The idea behind having one or more Storm committers act as a ³sponsor² is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor¹s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added. It would be the Committers¹/PPMC¹s responsibly to make sure things didn¹t get out of hand, and to do something about it if it does. Here¹s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we¹d see anything along that scale. - Taylor [1] https://issues.apache.org/jira/browse/HADOOP-3601 On Feb 26, 2014, at 12:35 PM, Bobby Evans ev...@yahoo-inc.commailto:ev...@yahoo-inc.com wrote: I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works. But I don¹t see much long term developer benefit in building them all together. If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm. I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that. I¹d love to have storm on yarn inside Apache. I just don¹t want to go overboard with it. There was a time when HBase was a ³contrib² module under
RE: [DISCUSS] Pulling Contrib Modules into Apache
Bobby, I vote to include both storm-yarn and storm-deploy. Roger -Original Message- From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: Wednesday, February 26, 2014 3:39 PM To: d...@storm.incubator.apache.org Cc: user@storm.incubator.apache.org Subject: Re: [DISCUSS] Pulling Contrib Modules into Apache Bobby, FWIW, I¹d love to see storm-yarn inside. I think we could definitely make things easier on the end-user if they were more cohesive. e.g. Imagine if we had ³storm launch yarn² inside of $storm/bin that would kickoff a storm-yarn launch, with whatever version was built. It would likely simplify the ³create-tarball² and storm-yarn getStormConfig process as well. -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 2/26/14, 4:25 PM, Bobby Evans ev...@yahoo-inc.com wrote: I totally agree and I am +1 on bringing these spout/trident pieces in, assuming there are committers to support them. I am also curious about how people feel about pulling in other projects like storm-starter, storm-deploy, storm-mesos, and storm-yarn? Storm-starter in my option seems more like documentation and it would be nice to pull in so that it stays up to date with storm itself, just like the documentation. The others are more of ways to run storm in different environments. They seem like there could be a lot of coupling between them and storm as storm evolves, and they kind of fit with integrate storm with *Technology X*² except X in this case is a compute environment instead of a data source or store. But then again we also just shot down a request to create juju charms for storm. ‹Bobby From: P. Taylor Goetz ptgo...@gmail.commailto:ptgo...@gmail.com Reply-To: d...@storm.incubator.apache.orgmailto:d...@storm.incubator.apache.org Date: Wednesday, February 26, 2014 at 1:21 PM To: d...@storm.incubator.apache.orgmailto:d...@storm.incubator.apache.org Cc: user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org user@storm.incubator.apache.orgmailto:user@storm.incubator.apache.org Subject: Re: [DISCUSS] Pulling Contrib Modules into Apache Thanks for the feedback Bobby. To clarify, I¹m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm. Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc. The idea behind having one or more Storm committers act as a ³sponsor² is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor¹s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added. It would be the Committers¹/PPMC¹s responsibly to make sure things didn¹t get out of hand, and to do something about it if it does. Here¹s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we¹d see anything along that scale. - Taylor [1] https://issues.apache.org/jira/browse/HADOOP-3601 On Feb 26, 2014, at 12:35 PM, Bobby Evans ev...@yahoo-inc.commailto:ev...@yahoo-inc.com wrote: I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works. But I don¹t see much long term developer benefit in building them all together. If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release
Re: Spout missing Acks when a Bolt uses JRuby
Thanks Taylor. I was afraid creating the JRuby runtime this way might be expensive. Initially I did create it inside of the prepare() method but I ran into some trouble because the Ruby class is not serializable. I played around with it a little more today and I had some success creating a static class to hold my Ruby objects. That way I could initialize it in prepare() and just call my processing code in execute. When I initialize my Ruby objects this way I'm generally receiving my Acks the way I expect to, but there are still some that I don't receive. This is different than when I had the Ruby initialization in execute. In that scenario I failed to receive any Acks at all. I'm still new to Storm so it's possible that I'm just missing something obvious. The thing is though when I take the JRuby code out everything works fine. Is it possible that I'm just not waiting long enough? In my tests it takes about 10 seconds for my test data to flow through the topology (single node) and I shut it down after 30 seconds. The Acks i do receive happen pretty much instantly after I call the ack() in the last Bolt though. On Wed, Feb 26, 2014 at 6:13 PM, P. Taylor Goetz ptgo...@gmail.com wrote: Hi Jonathan, I've used jruby fairly extensively with storm (though with the trident API), but it's been a while so I'm rusty. Initializing the jruby runtime is very expensive, so you should do that in the prepare() method of your bolt. That means you'll have to store it as an instance variable in your bolt, which in turn opens the door for potential concurrency issues in your jruby code. Be warned. It can get kind of crazy. I forget what the magic jruby runtime configuration was off hand. But it works. I'll try to unarchive those memories and reply. -Taylor On Feb 25, 2014, at 10:32 PM, Jonathan Nilsson jonathan.nils...@gmail.com wrote: I'm trying to write a Storm Bolt that does some processing with JRuby. When my data goes through this Bolt I see that the Spout does not appear to be receiving any acks. I'm pretty sure I'm anchoring my tuples correctly. If I take the JRuby Bolt out of the topology everything works fine again. In trying to isolate the problem I wrote a Bolt that does no processing at all but does call Ruby.getGlobalRuntime(). That call alone seems to be enough to stop the acks from flowing. I've boiled the execute method down to public void execute(Tuple input) { Ruby.getGlobalRuntime(); LOG.info(Sending Ack for +input); collector.ack(input); } and I get the log message but no messages in the Spout's ack or fail methods. If I remove the Ruby.getGlobalRuntime(); line everything works. I've tried using Ruby.getThreadLocalRuntime() but it doesn't seem to make a difference. Has anyone seen a similar problem? Are there any tricks to calling JRuby code from within Storm?
Facing Error in storm-deploy
Hi everyone, I am trying to deploy the storm on AWS cluster, and getting following error. I am using a mac machine, so these are the steps I followed: 1. downloaded lein, converted it to executable, moved to usr/local/bin and executed same. 2. did a git clone of storm deploy code. 3. cd into storm deploy and run lein deps 4. made a config.clj file as below. Also, did ssh-keygen in ~/.ssh to create a public private key pair which was used in config.clj. (defpallet :services { :default { :blobstore-provider aws-s3 :provider aws-ec2 :environment {:user {:username storm ; this must be storm :private-key-path ~/.ssh/id_rsa :public-key-path ~/.ssh/id_rsa.pub} :aws-user-id 2517} :identity :credential :jclouds.regions us-east-1 } }) Then did a: lein deploy-storm --start --name mycluster --branch 0.8.3 the error below is mentioned. I have followed the steps as described and cross checked these at multiple places, however the error remains. Any help or pointers shall be really useful, Thanks in Advance! --Gaurav INFO execute - Output: /Users/admin/.ssh/id_rsa.pub DEBUG execute - out = /Users/admin/.ssh/id_rsa.pub\n INFO execute - Output: /Users/admin/.ssh/id_rsa DEBUG execute - out = /Users/admin/.ssh/id_rsa\n INFO execute - Output: storm DEBUG execute - out = storm\n INFO execute - Output: /Users/admin/.ssh/id_rsa.pub DEBUG execute - out = /Users/admin/.ssh/id_rsa.pub\n INFO execute - Output: /Users/admin/.ssh/id_rsa DEBUG execute - out = /Users/admin/.ssh/id_rsa\n INFO execute - Output: /Users/admin/.ssh/id_rsa.pub DEBUG execute - out = /Users/admin/.ssh/id_rsa.pub\n INFO execute - Output: /Users/admin/.ssh/id_rsa DEBUG execute - out = /Users/admin/.ssh/id_rsa\n DEBUG jclouds - Found jclouds sshj driver DEBUG jclouds - extensions (:log4j :slf4j :sshj) DEBUG jclouds - options [:jclouds.regions us-east-1 :blobstore-provider aws-s3] ERROR logging - Exception in thread main ERROR logging - com.google.inject.CreationException: Guice creation errors: 1) org.jclouds.rest.RestContextorg.jclouds.aws.ec2.AWSEC2Client, A cannot be used as a key; It is not fully specified. 1 error (form-init8975416400432954481.clj:1) ERROR logging - at clojure.lang.Compiler.eval(Compiler.java:5440) ERROR logging - at clojure.lang.Compiler.eval(Compiler.java:5415) ERROR logging - at clojure.lang.Compiler.load(Compiler.java:5857) ERROR logging - at clojure.lang.Compiler.loadFile(Compiler.java:5820) ERROR logging - at clojure.main$load_script.invoke(main.clj:221) ERROR logging - at clojure.main$init_opt.invoke(main.clj:226) ERROR logging - at clojure.main$initialize.invoke(main.clj:254) ERROR logging - at clojure.main$null_opt.invoke(main.clj:279) ERROR logging - at clojure.main$main.doInvoke(main.clj:354) ERROR logging - at clojure.lang.RestFn.invoke(RestFn.java:422) ERROR logging - at clojure.lang.Var.invoke(Var.java:369) ERROR logging - at clojure.lang.AFn.applyToHelper(AFn.java:165) ERROR logging - at clojure.lang.Var.applyTo(Var.java:482) ERROR logging - at clojure.main.main(main.java:37) ERROR logging - Caused by: com.google.inject.CreationException: Guice creation errors: 1) org.jclouds.rest.RestContextorg.jclouds.aws.ec2.AWSEC2Client, A cannot be used as a key; It is not fully specified. 1 error ERROR logging - at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:435) ERROR logging - at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:154) ERROR logging - at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106) ERROR logging - at com.google.inject.Guice.createInjector(Guice.java:95) ERROR logging - at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:324) ERROR logging - at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:262) ERROR logging - at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:524) ERROR logging - at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:504) ERROR logging - at org.jclouds.compute2$compute_service.doInvoke(compute2.clj:92) ERROR logging - at clojure.lang.RestFn.applyTo(RestFn.java:147) ERROR logging - at clojure.core$apply.doInvoke(core.clj:548) ERROR logging - at clojure.lang.RestFn.invoke(RestFn.java:562) ERROR logging - at pallet.compute.jclouds$eval5952$fn__5954.invoke(jclouds.clj:720) ERROR logging - at clojure.lang.MultiFn.invoke(MultiFn.java:167) ERROR logging - at pallet.compute$compute_service.doInvoke(compute.clj:36) ERROR logging - at clojure.lang.RestFn.applyTo(RestFn.java:140) ERROR logging - at clojure.core$apply.invoke(core.clj:542) ERROR logging - at