[ https://issues.apache.org/jira/browse/STORM-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961211#comment-14961211 ]
ASF GitHub Bot commented on STORM-1115: --------------------------------------- Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148807939 @Parth-Brahmbhatt that's a tricky one. I haven't found a way to reproduce but leaving nimbus work for a day or so with number of nimbuses > 1 and a good load on the system we see the number of ZK nodes/keys go up to (X*nimbuses)+1 under /leader-lock. When that happens, we have problems trying to do anything as no nimbus thinks it's the leader which is exactly what's described in CURATOR-202. If you can think of a way to disconnect the ZK connection but reconnect using the same session programmatically you'll have a reproduction of this bug as this always starts showing up after something like the following log lines: ``` 2015-10-16 18:16:13 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: RECONNECTED 2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 6668ms for sessionid 0x1506caf14ab005f, closing socket connection and attempting reconnect 2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 6672ms for sessionid 0x1506caf14ab0060, closing socket connection and attempting reconnect 2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server 10.101.1.2/10.101.1.2:2181. Will not attempt to authenticate using SASL (unknown error) 2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection established to 10.101.1.2/10.101.1.2:2181, initiating session 2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Session establishment complete on server 10.101.1.2/10.101.1.2:2181, sessionid = 0x1506caf14ab005f, negotiated timeout = 20000 2015-10-16 18:16:15 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: RECONNECTED ``` > Stale leader-lock key effectively bans all nodes from becoming leaders > ---------------------------------------------------------------------- > > Key: STORM-1115 > URL: https://issues.apache.org/jira/browse/STORM-1115 > Project: Apache Storm > Issue Type: Bug > Affects Versions: 0.11.0 > Reporter: Daniel Schonfeld > > I believe this curator bug is what's in play causing the above described > situation. > https://issues.apache.org/jira/browse/CURATOR-202 > Whenever we were hit by this bug we'd start seeing problems in submitting > topologies to nimbus, as well as having problems > activating/deactivating/killing topologies. Basically any topology that > utilizes the `is-leader` macro, since no nimbus believes itself to be the > leader based on LeaderLatch.hasLeadership() -- This message was sent by Atlassian JIRA (v6.3.4#6332)