[ https://issues.apache.org/jira/browse/STORM-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463945#comment-16463945 ]
zhangbiao commented on STORM-3055: ---------------------------------- thr problem is caused by context's connection cache. for example supervisor with id 'a' restart with local version store corrupt , then it will generate an other id 'b' (as an example). when 'b' is up, then nimbus will assign some task on 'b', if old assignment is [a:6700, c:6700], the new assignment is [b:6700, c:6700] then task c:6700 will first connect [b:6700] then close and remove connection [a:6700], since a, b is the same ip so b:6700 will share connection a:6700. but the same connection will close by remove > never refresh connection > ------------------------ > > Key: STORM-3055 > URL: https://issues.apache.org/jira/browse/STORM-3055 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 1.1.1 > Reporter: zhangbiao > Priority: Major > > in our enviroment some worker's connection to other worker being closed and > never reconnect, > the log show's that > 2018-05-02 10:28:49.302 o.a.s.m.n.Client > Thread-90-disruptor-worker-transfer-queue [ERROR] discarding 1 messages > because the Netty client to Netty-Client-/192.168.31.1:6800 is being closed > ...... > 2018-05-02 11:00:29.540 o.a.s.m.n.Client > Thread-90-disruptor-worker-transfer-queue [ERROR] discarding 1 messages > because the Netty client to Netty-Client-/192.168.31.1:6800 is being closed > the log shows that it never can reconnect again. i can only fix it after > restart the topo, -- This message was sent by Atlassian JIRA (v7.6.3#76005)