[
https://issues.apache.org/jira/browse/STORM-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961211#comment-14961211
]
ASF GitHub Bot commented on STORM-1115:
---------------------------------------
Github user danielschonfeld commented on the pull request:
https://github.com/apache/storm/pull/802#issuecomment-148807939
@Parth-Brahmbhatt that's a tricky one. I haven't found a way to reproduce
but leaving nimbus work for a day or so with number of nimbuses > 1 and a good
load on the system we see the number of ZK nodes/keys go up to (X*nimbuses)+1
under /leader-lock. When that happens, we have problems trying to do anything
as no nimbus thinks it's the leader which is exactly what's described in
CURATOR-202.
If you can think of a way to disconnect the ZK connection but reconnect
using the same session programmatically you'll have a reproduction of this bug
as this always starts showing up after something like the following log lines:
```
2015-10-16 18:16:13 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State
change: RECONNECTED
2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed
out, have not heard from server in 6668ms for sessionid 0x1506caf14ab005f,
closing socket connection and attempting reconnect
2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed
out, have not heard from server in 6672ms for sessionid 0x1506caf14ab0060,
closing socket connection and attempting reconnect
2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
connection to server 10.101.1.2/10.101.1.2:2181. Will not attempt to
authenticate using SASL (unknown error)
2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection
established to 10.101.1.2/10.101.1.2:2181, initiating session
2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Session establishment
complete on server 10.101.1.2/10.101.1.2:2181, sessionid = 0x1506caf14ab005f,
negotiated timeout = 20000
2015-10-16 18:16:15 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State
change: RECONNECTED
```
> Stale leader-lock key effectively bans all nodes from becoming leaders
> ----------------------------------------------------------------------
>
> Key: STORM-1115
> URL: https://issues.apache.org/jira/browse/STORM-1115
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.11.0
> Reporter: Daniel Schonfeld
>
> I believe this curator bug is what's in play causing the above described
> situation.
> https://issues.apache.org/jira/browse/CURATOR-202
> Whenever we were hit by this bug we'd start seeing problems in submitting
> topologies to nimbus, as well as having problems
> activating/deactivating/killing topologies. Basically any topology that
> utilizes the `is-leader` macro, since no nimbus believes itself to be the
> leader based on LeaderLatch.hasLeadership()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)