[ https://issues.apache.org/jira/browse/STORM-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313819#comment-14313819 ]
Sean Zhong commented on STORM-404: ---------------------------------- [~danielschonfeld], please take a look at patch at STORM-329 https://github.com/apache/storm/pull/268 > Worker on one machine crashes due to a failure of another worker on another > machine > ----------------------------------------------------------------------------------- > > Key: STORM-404 > URL: https://issues.apache.org/jira/browse/STORM-404 > Project: Apache Storm > Issue Type: Sub-task > Affects Versions: 0.9.2-incubating > Reporter: Itai Frenkel > > I have two workers (one on each machine). The first worker(10.30.206.125) had > a problem starting (could not find Nimbus host), however the second worker > crashed too since it could not connect to the first worker. > This looks like a cascading failure, which seems like a bug. > 2014-07-15 17:43:32 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [17] > 2014-07-15 17:43:33 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [18] > 2014-07-15 17:43:34 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [19] > 2014-07-15 17:43:35 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [20] > 2014-07-15 17:43:36 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [21] > 2014-07-15 17:43:37 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [22] > 2014-07-15 17:43:38 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [23] > 2014-07-15 17:43:39 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [24] > 2014-07-15 17:43:40 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [25] > 2014-07-15 17:43:41 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [26] > 2014-07-15 17:43:42 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [27] > 2014-07-15 17:43:43 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [28] > 2014-07-15 17:43:44 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [29] > 2014-07-15 17:43:45 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [30] > 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Closing Netty Client > Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700 > 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Waiting for pending batchs to be > sent with Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700..., > timeout: 600000ms, pendings: 0 > 2014-07-15 17:43:46 b.s.util [ERROR] Async loop died! > java.lang.RuntimeException: java.lang.RuntimeException: Client is being > closed, and does not take requests any more > at > backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at > backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at > backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at > backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60] > Caused by: java.lang.RuntimeException: Client is being closed, and does not > take requests any more > at backtype.storm.messaging.netty.Client.send(Client.java:194) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at > backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at > backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at > backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > at > backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) > ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] > ... 6 common frames omitted > 2014-07-15 17:43:46 b.s.util [INFO] Halting process: ("Async loop died!") -- This message was sent by Atlassian JIRA (v6.3.4#6332)