supervisor start error
I installed a storm on a single machine.when i start the supervisor , it generate such error org.apache.thrift7.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:6627. at org.apache.thrift7.transport.TNonblockingServerSocket.init(TNonblockingServerSocket.java:89) ~[libthrift7-0.7.0-2.jar:0.7.0-2] at org.apache.thrift7.transport.TNonblockingServerSocket.init(TNonblockingServerSocket.java:68) ~[libthrift7-0.7.0-2.jar:0.7.0-2] at org.apache.thrift7.transport.TNonblockingServerSocket.init(TNonblockingServerSocket.java:61) ~[libthrift7-0.7.0-2.jar:0.7.0-2] at backtype.storm.daemon.nimbus$launch_server_BANG_.invoke(nimbus.clj:1137) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.daemon.nimbus$_launch.invoke(nimbus.clj:1167) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.daemon.nimbus$_main.invoke(nimbus.clj:1189) ~[storm-core-0.9.0.1.jar:na] at clojure.lang.AFn.applyToHelper(AFn.java:159) ~[clojure-1.4.0.jar:na] at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na] at backtype.storm.daemon.nimbus.main(Unknown Source) ~[storm-core-0.9.0.1.jar:na] I do not konw what cause this
Re: Tuples lost in Storm 0.9.1
Hi, Daria The lost tuples may going into two places: 1) message_queue in netty-client, which will cause memory leak; 2) netty internal buffer, if connection lose, all tuples in it get lost. so, check your worker log to see if there is any connection lost error Regards 2014-05-16 12:18 GMT+08:00 李家宏 jh.li...@gmail.com: I am running into the same issue. Where do the lost tuples gone ? If they were queueing in the transport layer, the memory usage should keep increasing, but I didn't see any noticeable memory leaks. Does storm have the guarantee all tuples sent from task A to task B will be received by task B ? Moreover, are they in order ? Can anybody give any idea on this issue 2014-04-02 20:56 GMT+08:00 Daria Mayorova d.mayor...@gmail.com: Hi everyone, We are having some issues with the Storm topology. The problem is that some tuples are being lost somewhere in the topology. Just after the topology is deployed, it goes pretty well, but after several hours it starts to loose a significant amount of tuples. From what we've found out from the logs, the thing is that the tuples exit one bolt/spout, and never enter the next bolt. Here is some info about the topology: - The version is 0.9.1, and netty is used as transport - The spout is extending BaseRichSpout, and the bolts extend BaseBasicBolt - The spout is using Kestrel message queue - The cluster consists of 2 nodes: zookeeper, nimbus and ui are running on one node, and the workers run on another node. I am attaching the content of the config files below. We have also tried running the workers on another node (the same where nimbus and zookeeper are), and also on both nodes, but the behavior is the same. According to the Storm UI there are no Failed tuples. Can anybody give any idea of what might be the reason of the tuples getting lost? Thanks. *Storm config (storm.yaml)* (In case both nodes have workers running, the configuration is the same on both nodes, just the storm.local.hostname parameter changes) storm.zookeeper.servers: - zkserver1 nimbus.host: nimbusserver storm.local.dir: /mnt/storm supervisor.slots.ports: - 6700 - 6701 - 6702 - 6703 storm.local.hostname: storm1server nimbus.childopts: -Xmx1024m -Djava.net.preferIPv4Stack=true ui.childopts: -Xmx768m -Djava.net.preferIPv4Stack=true supervisor.childopts: -Xmx1024m -Djava.net.preferIPv4Stack=true worker.childopts: -Xmx3548m -Djava.net.preferIPv4Stack=true storm.cluster.mode: distributed storm.local.mode.zmq: false storm.thrift.transport: backtype.storm.security.auth.SimpleTransportPlugin storm.messaging.transport: backtype.storm.messaging.netty.Context storm.messaging.netty.server_worker_threads: 1 storm.messaging.netty.client_worker_threads: 1 storm.messaging.netty.buffer_size: 5242880 #5MB buffer storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 1000 storm.messaging.netty.min_wait_ms: 100 *Zookeeper config (zoo.cfg):* tickTime=2000 initLimit=10 syncLimit=5 dataDir=/var/zookeeper clientPort=2181 autopurge.purgeInterval=24 autopurge.snapRetainCount=5 server.1=localhost:2888:3888 *Topology configuration* passed to the StormSubmitter: Config conf = new Config(); conf.setNumAckers(6); conf.setNumWorkers(4); conf.setMaxSpoutPending(100); Best regards, Daria Mayorova -- == Gvain Email: jh.li...@gmail.com -- == Gvain Email: jh.li...@gmail.com
Implications of running multiple topologies without isolation
I am trying to understand the implications of running multiple topologies on a single cluster without using the isolation scheduler. The way this appears to work, is isolation at the machine level and not the worker level. Our issue right now is that we only have 5 machines to work with. We have enough resources to run multiple workers per machine, but do not feel comfortable running each topology on fewer than all 5 machines. So the main questions are, 1) what issues do I risk running into if I run multiple topologies on a single cluster with out the isolation scheduler, or 2) is there a way to isolate at the worker level, ie; each worker handles tasks for a single topology? Thanks Justin
Re: Implications of running multiple topologies without isolation
1) a worker can spawn any number of threads, so you can possibly run into standard shared resources issues (CPU, network, disk, etc). RAM is not as big of a problem since each worker gets a fixed amount. 2) a worker is spawned for a particular topology; it only execute spout/bolt tasks for the topology to which it is assigned. On Fri, Jun 6, 2014 at 12:17 PM, Justin Workman justinjwork...@gmail.com wrote: I am trying to understand the implications of running multiple topologies on a single cluster without using the isolation scheduler. The way this appears to work, is isolation at the machine level and not the worker level. Our issue right now is that we only have 5 machines to work with. We have enough resources to run multiple workers per machine, but do not feel comfortable running each topology on fewer than all 5 machines. So the main questions are, 1) what issues do I risk running into if I run multiple topologies on a single cluster with out the isolation scheduler, or 2) is there a way to isolate at the worker level, ie; each worker handles tasks for a single topology? Thanks Justin
Re: Implications of running multiple topologies without isolation
Try storm on mesos for isolation. http://mesosphere.io/learn/run-storm-on-mesos/ On Fri, Jun 6, 2014 at 10:03 AM, Justin Workman justinjwork...@gmail.com wrote: Thanks for the responses. I assumed worker isolation worked this way. I had just read a couple things that made me question this. Justin Sent from my iPhone On Jun 6, 2014, at 10:25 AM, Derek Dagit der...@yahoo-inc.com wrote: So the main questions are, 1) what issues do I risk running into if I run multiple topologies on a single cluster with out the isolation scheduler, Resource contention on shared boxes: CPU (#cores), network, disk (if applicable). Depends on what these topologies are doing: which resources they will use the most. 2) is there a way to isolate at the worker level, ie; each worker handles tasks for a single topology? I thought this is the way it normally worked. A single worker JVM would run on behalf of one topology, but could run tasks from multiple different components (bolt/spouts) defined in that topology. -- Derek On 6/6/14, 11:17, Justin Workman wrote: I am trying to understand the implications of running multiple topologies on a single cluster without using the isolation scheduler. The way this appears to work, is isolation at the machine level and not the worker level. Our issue right now is that we only have 5 machines to work with. We have enough resources to run multiple workers per machine, but do not feel comfortable running each topology on fewer than all 5 machines. So the main questions are, 1) what issues do I risk running into if I run multiple topologies on a single cluster with out the isolation scheduler, or 2) is there a way to isolate at the worker level, ie; each worker handles tasks for a single topology? Thanks Justin -- Lin Zhao 3101 Park Blvd, Palo Alto, CA 94306
Re: Order of Bolt definition, catching that subscribes from non-existent component [ ...]
I am sorry for the late reply. Yes , you can't have a loop. You can have a chain though( which doesn't close upon itself ! ). Thanks :-) On Wed, May 7, 2014 at 12:50 PM, shahab shahab.mok...@gmail.com wrote: Thanks Abhishek. But this also implies that we can not have a loop ( of message processing stages) using Storm, right? best, /Shahab On Mon, May 5, 2014 at 9:45 PM, Abhishek Bhattacharjee abhishek.bhattacharje...@gmail.com wrote: I don't think what you are trying to do is achievable. Data in storm always move forward so you can't give it back to a bolt from which it originated. That is a bolt can subscribe from bolts which were created before it's creation. So, I think you can create another object of the A bolt say D and then assign the o/p of C to D. On Mon, May 5, 2014 at 8:11 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am trying to define a topology as following: S : is a spout A,B,C : are bolts -- : means emitting message S --A A --B B --C C --A I am declaring the Spouts and Bolts in the above order in my java code , first S, then A , B and finally C. I am using globalGrouping(BoltName, StreamID) for collecting messages to be collected by each bolt, The problem is that I receive an error, while defining bolt A saying that subscribes from non-existent component [C] . I guess the error is happening because component C is not defined yet! but what could be the solution to this? best, /Shahab -- *Abhishek Bhattacharjee* *Pune Institute of Computer Technology* -- *Abhishek Bhattacharjee* *Pune Institute of Computer Technology*
Re: Order of Bolt definition, catching that subscribes from non-existent component [ ...]
You can have a loop on a different stream. It's not always the best thing to do (deadlock possibilities from buffers) but we have a production topology that has that kind of pattern. In our case, one bolt acts as a coordinator for recursive search. Michael Rose (@Xorlev https://twitter.com/xorlev) Senior Platform Engineer, FullContact http://www.fullcontact.com/ mich...@fullcontact.com On Fri, Jun 6, 2014 at 2:28 PM, Abhishek Bhattacharjee abhishek.bhattacharje...@gmail.com wrote: I am sorry for the late reply. Yes , you can't have a loop. You can have a chain though( which doesn't close upon itself ! ). Thanks :-) On Wed, May 7, 2014 at 12:50 PM, shahab shahab.mok...@gmail.com wrote: Thanks Abhishek. But this also implies that we can not have a loop ( of message processing stages) using Storm, right? best, /Shahab On Mon, May 5, 2014 at 9:45 PM, Abhishek Bhattacharjee abhishek.bhattacharje...@gmail.com wrote: I don't think what you are trying to do is achievable. Data in storm always move forward so you can't give it back to a bolt from which it originated. That is a bolt can subscribe from bolts which were created before it's creation. So, I think you can create another object of the A bolt say D and then assign the o/p of C to D. On Mon, May 5, 2014 at 8:11 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am trying to define a topology as following: S : is a spout A,B,C : are bolts -- : means emitting message S --A A --B B --C C --A I am declaring the Spouts and Bolts in the above order in my java code , first S, then A , B and finally C. I am using globalGrouping(BoltName, StreamID) for collecting messages to be collected by each bolt, The problem is that I receive an error, while defining bolt A saying that subscribes from non-existent component [C] . I guess the error is happening because component C is not defined yet! but what could be the solution to this? best, /Shahab -- *Abhishek Bhattacharjee* *Pune Institute of Computer Technology* -- *Abhishek Bhattacharjee* *Pune Institute of Computer Technology*
Time Partitioning of Tuples
Hi Everyone, I'm currently investigating different data processing tools for an application I'm interested in. I have many sensors that I collect data from. However, I would like to group the data from every sensor at predefined time intervals and process it together. Using Storm terminology, I would have each sensor send data to a spout. The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition. Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags. Is this possible with Storm? I appreciate your help! Jonathan
Re: Time Partitioning of Tuples
You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member. I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing. -- Kyle On 06/06/2014 05:14 PM, Jonathan Poon wrote: Hi Nathan, The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together? Or is it only possible for a bolt to process each tuple one at a time? Thanks! On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung ncle...@gmail.com mailto:ncle...@gmail.com wrote: You can have your bolt subscribe to the spout using fields grouping and use time tag as your key. On Jun 6, 2014 6:01 PM, Jonathan Poon jkp...@ucdavis.edu mailto:jkp...@ucdavis.edu wrote: Hi Everyone, I'm currently investigating different data processing tools for an application I'm interested in. I have many sensors that I collect data from. However, I would like to group the data from every sensor at predefined time intervals and process it together. Using Storm terminology, I would have each sensor send data to a spout. The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition. Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags. Is this possible with Storm? I appreciate your help! Jonathan
Re: Time Partitioning of Tuples
Hi Kyle, I'm looking for a real-time batch processing tool. In my case, I'm looking to make correlations between all of the sensors at each time interval. I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor. Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce However, Map Reduce seems inefficient because my sensor data is already time sorted naturally. In addition, I would like real-time data on the fly. Seems like Storm might be a candidate for this application. Please let me know what you think...! Thanks for your help! Jonathan On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum knusb...@yahoo-inc.com wrote: You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member. I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing. -- Kyle On 06/06/2014 05:14 PM, Jonathan Poon wrote: Hi Nathan, The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together? Or is it only possible for a bolt to process each tuple one at a time? Thanks! On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung ncle...@gmail.com wrote: You can have your bolt subscribe to the spout using fields grouping and use time tag as your key. On Jun 6, 2014 6:01 PM, Jonathan Poon jkp...@ucdavis.edu wrote: Hi Everyone, I'm currently investigating different data processing tools for an application I'm interested in. I have many sensors that I collect data from. However, I would like to group the data from every sensor at predefined time intervals and process it together. Using Storm terminology, I would have each sensor send data to a spout. The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition. Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags. Is this possible with Storm? I appreciate your help! Jonathan
Re: Time Partitioning of Tuples
Sounds interesting. I don't know much about your project, so I won't speculate about your purposes. One thing to consider is that the duration of the computation on a time slice must be longer than the time slice itself to really make this type of setup worthwhile. Otherwise you could just feed the batches through the same bolt, since it would be done processing a batch before the next one comes in. -- Kyle On 06/06/2014 05:40 PM, Jonathan Poon wrote: Hi Kyle, I'm looking for a real-time batch processing tool. In my case, I'm looking to make correlations between all of the sensors at each time interval. I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor. Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce However, Map Reduce seems inefficient because my sensor data is already time sorted naturally. In addition, I would like real-time data on the fly. Seems like Storm might be a candidate for this application. Please let me know what you think...! Thanks for your help! Jonathan On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum knusb...@yahoo-inc.com mailto:knusb...@yahoo-inc.com wrote: You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member. I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing. -- Kyle On 06/06/2014 05:14 PM, Jonathan Poon wrote: Hi Nathan, The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together? Or is it only possible for a bolt to process each tuple one at a time? Thanks! On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung ncle...@gmail.com mailto:ncle...@gmail.com wrote: You can have your bolt subscribe to the spout using fields grouping and use time tag as your key. On Jun 6, 2014 6:01 PM, Jonathan Poon jkp...@ucdavis.edu mailto:jkp...@ucdavis.edu wrote: Hi Everyone, I'm currently investigating different data processing tools for an application I'm interested in. I have many sensors that I collect data from. However, I would like to group the data from every sensor at predefined time intervals and process it together. Using Storm terminology, I would have each sensor send data to a spout. The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition. Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags. Is this possible with Storm? I appreciate your help! Jonathan
Re: Time Partitioning of Tuples
I will take a look into Trident as well. Thanks for the tip! On Fri, Jun 6, 2014 at 3:53 PM, Kyle Nusbaum knusb...@yahoo-inc.com wrote: Sounds interesting. I don't know much about your project, so I won't speculate about your purposes. One thing to consider is that the duration of the computation on a time slice must be longer than the time slice itself to really make this type of setup worthwhile. Otherwise you could just feed the batches through the same bolt, since it would be done processing a batch before the next one comes in. -- Kyle On 06/06/2014 05:40 PM, Jonathan Poon wrote: Hi Kyle, I'm looking for a real-time batch processing tool. In my case, I'm looking to make correlations between all of the sensors at each time interval. I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor. Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce However, Map Reduce seems inefficient because my sensor data is already time sorted naturally. In addition, I would like real-time data on the fly. Seems like Storm might be a candidate for this application. Please let me know what you think...! Thanks for your help! Jonathan On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum knusb...@yahoo-inc.com wrote: You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member. I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing. -- Kyle On 06/06/2014 05:14 PM, Jonathan Poon wrote: Hi Nathan, The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together? Or is it only possible for a bolt to process each tuple one at a time? Thanks! On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung ncle...@gmail.com wrote: You can have your bolt subscribe to the spout using fields grouping and use time tag as your key. On Jun 6, 2014 6:01 PM, Jonathan Poon jkp...@ucdavis.edu wrote: Hi Everyone, I'm currently investigating different data processing tools for an application I'm interested in. I have many sensors that I collect data from. However, I would like to group the data from every sensor at predefined time intervals and process it together. Using Storm terminology, I would have each sensor send data to a spout. The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition. Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags. Is this possible with Storm? I appreciate your help! Jonathan
RE: Time Partitioning of Tuples
You might look at Esper. I believe someone has even embedded Esper into Storm -Dan Date: Fri, 6 Jun 2014 15:40:08 -0700 Subject: Re: Time Partitioning of Tuples From: jkp...@ucdavis.edu To: user@storm.incubator.apache.org Hi Kyle, I'm looking for a real-time batch processing tool. In my case, I'm looking to make correlations between all of the sensors at each time interval. I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor. Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce However, Map Reduce seems inefficient because my sensor data is already time sorted naturally. In addition, I would like real-time data on the fly. Seems like Storm might be a candidate for this application. Please let me know what you think...! Thanks for your help! Jonathan On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum knusb...@yahoo-inc.com wrote: You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member. I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing. -- Kyle On 06/06/2014 05:14 PM, Jonathan Poon wrote: Hi Nathan, The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together? Or is it only possible for a bolt to process each tuple one at a time? Thanks! On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung ncle...@gmail.com wrote: You can have your bolt subscribe to the spout using fields grouping and use time tag as your key. On Jun 6, 2014 6:01 PM, Jonathan Poon jkp...@ucdavis.edu wrote: Hi Everyone, I'm currently investigating different data processing tools for an application I'm interested in. I have many sensors that I collect data from. However, I would like to group the data from every sensor at predefined time intervals and process it together. Using Storm terminology, I would have each sensor send data to a spout. The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition. Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags. Is this possible with Storm? I appreciate your help! Jonathan