Originally, we had multiple topologies single worker with a maxSpoutpending of 3, a parallellimshint between 1 and 20 depending on the bolt,a batch size of 300 and a maxTask of 90
by reducing the batchsize to 2 records, and saw that my workers died always after about ~45 seconds... so I just did many tries by changing values of all parameters I still don't totally understand the difference between the maxtask and the paralellism hint I only remember that originally, we preferred to reduce the parallellismHint to minimum to avoid the problem caused by https://issues.apache.org/jira/browse/STORM-503 note that we use trident, so if I count all bolts I can see under the section "Bolt (All time)" in storm UI, I have 137 bolts (including all merges, joins, project...) Eric ________________________________ De : Harsha <[email protected]> Envoyé : 23 juillet 2015 11:14 À : [email protected] Objet : Re: worker dies after view minutes Thanks for update Eric. Could you describe how did you find that maxTask too high causing this issue. We are trying to improve debugging of storm topologies , this will be helpful for us. Thanks, Harhsa On Thu, Jul 23, 2015, at 07:36 AM, Eric Ruel wrote: finally the problem was caused by a maxTask too high ________________________________ De : Harsha <[email protected]> Envoyé : 22 juillet 2015 10:56 À : [email protected] Objet : Re: worker dies after view minutes how is your topology code looks like are you throwing any errors from bolt's execute method?. It does look like there is a RuntimeException happening " Error when processing event java.lang.RuntimeException: " Its up to the user to catch any exception and log or do something with instead of throwing it back to worker jvm -Harsha On Wed, Jul 22, 2015, at 07:43 AM, Eric Ruel wrote: Hello the workers in my topology dies after 1,2 minutes I tried to change the config about the heartbeat, cluster or local mode, but they always die any idea? 10:38:38.019 ERROR backtype.storm.daemon.worker - Error when processing event java.lang.RuntimeException: org.apache.storm.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /workerbeats/testeric-1-1437575782/259f61ae-02a5-4a75-be50-68f27054a7b2-1024 at backtype.storm.util$wrap_in_runtime.invoke(util.clj:44) ~[storm-core-0.9.6.jar:0.9.6] at backtype.storm.zookeeper$set_data.invoke(zookeeper.clj:173) ~[storm-core-0.9.6.jar:0.9.6] at backtype.storm.cluster$mk_distributed_cluster_state$reify__1919.set_data(cluster.clj:92) ~[storm-core-0.9.6.jar:0.9.6] at backtype.storm.cluster$mk_storm_cluster_state$reify__2376.worker_heartbeat_BANG_(cluster.clj:332) ~[storm-core-0.9.6.jar:0.9.6] at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) ~[na:na] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_71] at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_71] at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.5.1.jar:na] at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) ~[clojure-1.5.1.jar:na] at backtype.storm.daemon.worker$do_executor_heartbeats.doInvoke(worker.clj:56) ~[storm-core-0.9.6.jar:0.9.6] at clojure.lang.RestFn.invoke(RestFn.java:439) ~[clojure-1.5.1.jar:na] at backtype.storm.daemon.worker$fn__3757$exec_fn__1163__auto____3758$fn__3761.invoke(worker.clj:413) ~[storm-core-0.9.6.jar:0.9.6] at backtype.storm.timer$schedule_recurring$this__1704.invoke(timer.clj:99) ~[storm-core-0.9.6.jar:0.9.6] at backtype.storm.timer$mk_timer$fn__1687$fn__1688.invoke(timer.clj:50) ~[storm-core-0.9.6.jar:0.9.6] at backtype.storm.timer$mk_timer$fn__1687.invoke(timer.clj:42) [storm-core-0.9.6.jar:0.9.6] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: org.apache.storm.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /workerbeats/testeric-1-1437575782/259f61ae-02a5-4a75-be50-68f27054a7b2-1024 at org.apache.storm.zookeeper.KeeperException.create(KeeperException.java:99) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:260) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:256) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:252) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:239) ~[storm-core-0.9.6.jar:0.9.6] at org.apache.storm.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:39) ~[storm-core-0.9.6.jar:0.9.6] at backtype.storm.zookeeper$set_data.invoke(zookeeper.clj:172) ~[storm-core-0.9.6.jar:0.9.6] ... 15 common frames omitted 10:38:38.023 ERROR backtype.storm.util - Halting process: ("Error when processing an event") java.lang.RuntimeException: ("Error when processing an event") at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:325) [storm-core-0.9.6.jar:0.9.6] at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.5.1.jar:na] at backtype.storm.daemon.worker$mk_halting_timer$fn__3572.invoke(worker.clj:177) [storm-core-0.9.6.jar:0.9.6] at backtype.storm.timer$mk_timer$fn__1687$fn__1688.invoke(timer.clj:68) [storm-core-0.9.6.jar:0.9.6] at backtype.storm.timer$mk_timer$fn__1687.invoke(timer.clj:42) [storm-core-0.9.6.jar:0.9.6] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
