[ 
https://issues.apache.org/jira/browse/STORM-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200404#comment-16200404
 ] 

Robert Joseph Evans commented on STORM-1560:
--------------------------------------------

[~saurav689],

In 1.1.1 The only time that "Giving up to schedule" is thrown, is when the 
netty client is closing.  The only time that it is closing is when the 
scheduling has changed and we no longer have a need to that client, which is 
what you have in your logs with the refresh-connections-timer.

>From the logs it looks like your topology had a worker scheduled to be on 
>192.168.2.195:6702, but that worker never came up for some reason.  You didn't 
>include the logs so I cannot tell.  After some time nimbus rescheduled the 
>worker to be on a different host/port.  At that point you got an exception 
>while we were closing the client. 

The later logs for Netty-server-localhost-6702-worker-1 indicate that a worker 
that was connected to this worker broke the connection.  They are most likely 
not related to the first one.

Did your topology eventually recover?  Did you ever look at the logs for 
192.168.2.195:6702 to try and see why it didn't come up?  In 2.x we have added 
in better logging so hopefully we would be able to see which worker 
disconnected from the server.


> Topology stops processing after Netty catches/swallows Throwable
> ----------------------------------------------------------------
>
>                 Key: STORM-1560
>                 URL: https://issues.apache.org/jira/browse/STORM-1560
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.0
>            Reporter: P. Taylor Goetz
>         Attachments: fix-lockup.patch
>
>
> In some scenarios, netty connection problems can leave a topology in an 
> unrecoverable state. The likely culprit is the Netty {{HashedWheelTimer}} 
> class that contains the following code:
> {code}
>         public void expire() {
>             if(this.compareAndSetState(0, 2)) {
>                 try {
>                     this.task.run(this);
>                 } catch (Throwable var2) {
>                     if(HashedWheelTimer.logger.isWarnEnabled()) {
>                         HashedWheelTimer.logger.warn("An exception was thrown 
> by " + TimerTask.class.getSimpleName() + '.', var2);
>                     }
>                 }
>             }
>         }
> {code}
> The exception being swallowed can be seen below:
> {code}
> 2016-02-18 08:46:59.116 o.a.s.m.n.Client [INFO] closing Netty Client 
> Netty-Client-/192.168.202.6:6701
> 2016-02-18 08:46:59.173 o.a.s.m.n.Client [INFO] waiting up to 600000 ms to 
> send 0 pending messages to Netty-Client-/192.168.202.6:6701
> 2016-02-18 08:46:59.271 STDIO [ERROR] Feb 18, 2016 8:46:59 AM 
> org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer
> WARNING: An exception was thrown by TimerTask.
> java.lang.RuntimeException: Giving up to scheduleConnect to 
> Netty-Client-/192.168.202.6:6701 after 44 failed attempts. 3 messages were 
> lost
>       at org.apache.storm.messaging.netty.Client$Connect.run(Client.java:573)
>       at 
> org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:546)
>       at 
> org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:446)
>       at 
> org.apache.storm.shade.org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:395)
>       at 
> org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> The netty client then never recovers, and the follows messages repeat forever:
> {code}
> 2016-02-18 09:42:56.251 o.a.s.m.n.Client [ERROR] discarding 1 messages 
> because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:25.248 o.a.s.m.n.Client [ERROR] discarding 1 messages 
> because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:55.248 o.a.s.m.n.Client [ERROR] discarding 1 messages 
> because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:55.752 o.a.s.m.n.Client [ERROR] discarding 2 messages 
> because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:43:56.252 o.a.s.m.n.Client [ERROR] discarding 1 messages 
> because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> 2016-02-18 09:44:25.249 o.a.s.m.n.Client [ERROR] discarding 1 messages 
> because the Netty client to Netty-Client-/192.168.202.6:6701 is being closed
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to