[ https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191839#comment-14191839 ]
ASF GitHub Bot commented on STORM-537: -------------------------------------- Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/storm/pull/304#discussion_r19668982 --- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java --- @@ -153,6 +153,7 @@ private synchronized void connect() { if (!future.isSuccess()) { if (null != current) { current.close(); + channel = null; --- End diff -- I think it's more natural to move it to line 143 or 144. Your PR fixes situation with precondition - channel is not null but not connected, and we cannot connect later. Moving it to before while statement helps to explain loop's precondition. And actually we don't need to assign channel to null for each failed retry, because if it succeed, we assign channel to actual connection and "break" loop. > A worker reconnects infinitely to another dead worker > ----------------------------------------------------- > > Key: STORM-537 > URL: https://issues.apache.org/jira/browse/STORM-537 > Project: Apache Storm > Issue Type: Bug > Affects Versions: 0.9.3 > Reporter: Sergey Tryuber > > We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a > side efffect for STORM-409. When I kill a worker, another worker starts to > print messages like: > {noformat} > 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [0] > 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [1] > 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [2] > ..... so on > {noformat} > Then it reaches default 300 max_retries and starts the cycle again: > {noformat} > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote > host Netty-Client-<HOST>:4706, [id: > 0xec088412, /<HOST>:39795 :> <HOST>:4706] > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [0] > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [1] > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [2] > {noformat} > And so on infinitely... > An issue most probably is in backtype.storm.messaging.netty.Client#connect > method in following place which determines that we give up on reconnection: > {code} > if (null != channel) { > LOG.info("connection established to a remote host " + name() + ", " + > channel.toString()); > channelRef.set(channel); > } else { > close(); > throw new RuntimeException("Remote address is not reachable. We will > close this client " + name()); > } > {code} > I guess (not tried yet), that _channel_ object is not _null_ if this is a > real reconnection. So the method return a _channel_ object and then > reconnection starts again and again. > This might be fixed by adding explicity *current = null;* into following code > block of the same method: > {code} > if (!future.isSuccess()) { > if (null != current) { > current.close(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)