[ https://issues.apache.org/jira/browse/STORM-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326055#comment-16326055 ]
Alexandre Vermeerbergen commented on STORM-1560: ------------------------------------------------ Hello, FYI - We get same error about one time per month in our production Storm cluster. Our by-pass is to monitor our topologies using Storm web services and restart impacted topologies when we detect they are impacted (poor's man by pass...) Best regards, Alexandre Vermeerbergen > Topology stops processing after Netty catches/swallows Throwable > ---------------------------------------------------------------- > > Key: STORM-1560 > URL: https://issues.apache.org/jira/browse/STORM-1560 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 1.0.0 > Reporter: P. Taylor Goetz > Priority: Major > Attachments: fix-lockup.patch > > > In some scenarios, netty connection problems can leave a topology in an > unrecoverable state. The likely culprit is the Netty {{HashedWheelTimer}} > class that contains the following code: > {code} > public void expire() { > if(this.compareAndSetState(0, 2)) { > try { > this.task.run(this); > } catch (Throwable var2) { > if(HashedWheelTimer.logger.isWarnEnabled()) { > HashedWheelTimer.logger.warn("An exception was thrown > by " + TimerTask.class.getSimpleName() + '.', var2); > } > } > } > } > {code} > The exception being swallowed can be seen below: > {code} > 2016-02-18 08:46:59.116 o.a.s.m.n.Client [INFO] closing Netty Client > Netty-Client-/192.168.202.6:6701 > 2016-02-18 08:46:59.173 o.a.s.m.n.Client [INFO] waiting up to 600000 ms to > send 0 pending messages to Netty-Client-/192.168.202.6:6701 > 2016-02-18 08:46:59.271 STDIO [ERROR] Feb 18, 2016 8:46:59 AM > org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer > WARNING: An exception was thrown by TimerTask. > java.lang.RuntimeException: Giving up to scheduleConnect to > Netty-Client-/192.168.202.6:6701 after 44 failed attempts. 3 messages were > lost > at org.apache.storm.messaging.netty.Client$Connect.run(Client.java:573) > at > org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:546) > at > org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:446) > at > org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:395) > at > org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > at java.lang.Thread.run(Thread.java:745) > {code} > The netty client then never recovers, and the follows messages repeat forever: > {code} > 2016-02-18 09:42:56.251 o.a.s.m.n.Client [ERROR] discarding 1 messages > because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed > 2016-02-18 09:43:25.248 o.a.s.m.n.Client [ERROR] discarding 1 messages > because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed > 2016-02-18 09:43:55.248 o.a.s.m.n.Client [ERROR] discarding 1 messages > because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed > 2016-02-18 09:43:55.752 o.a.s.m.n.Client [ERROR] discarding 2 messages > because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed > 2016-02-18 09:43:56.252 o.a.s.m.n.Client [ERROR] discarding 1 messages > because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed > 2016-02-18 09:44:25.249 o.a.s.m.n.Client [ERROR] discarding 1 messages > because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)