[jira] [Commented] (STORM-1115) Stale leader-lock key effectively bans all nodes from becoming leaders

ASF GitHub Bot (JIRA) Fri, 16 Oct 2015 12:05:05 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961211#comment-14961211
 ]


ASF GitHub Bot commented on STORM-1115:
---------------------------------------

Github user danielschonfeld commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148807939
  
    @Parth-Brahmbhatt that's a tricky one.  I haven't found a way to reproduce 
but leaving nimbus work for a day or so with number of nimbuses > 1 and a good 
load on the system we see the number of ZK nodes/keys go up to (X*nimbuses)+1 
under /leader-lock.  When that happens, we have problems trying to do anything 
as no nimbus thinks it's the leader which is exactly what's described in 
CURATOR-202.
    
    If you can think of a way to disconnect the ZK connection but reconnect 
using the same session programmatically you'll have a reproduction of this bug 
as this always starts showing up after something like the following log lines:
    
    ```
    2015-10-16 18:16:13 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State 
change: RECONNECTED
    2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed 
out, have not heard from server in 6668ms for sessionid 0x1506caf14ab005f, 
closing socket connection and attempting reconnect
    2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed 
out, have not heard from server in 6672ms for sessionid 0x1506caf14ab0060, 
closing socket connection and attempting reconnect
    2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket 
connection to server 10.101.1.2/10.101.1.2:2181. Will not attempt to 
authenticate using SASL (unknown error)
    2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection 
established to 10.101.1.2/10.101.1.2:2181, initiating session
    2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Session establishment 
complete on server 10.101.1.2/10.101.1.2:2181, sessionid = 0x1506caf14ab005f, 
negotiated timeout = 20000
    2015-10-16 18:16:15 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State 
change: RECONNECTED
    ```


> Stale leader-lock key effectively bans all nodes from becoming leaders
> ----------------------------------------------------------------------
>
>                 Key: STORM-1115
>                 URL: https://issues.apache.org/jira/browse/STORM-1115
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.11.0
>            Reporter: Daniel Schonfeld
>
> I believe this curator bug is what's in play causing the above described 
> situation.
> https://issues.apache.org/jira/browse/CURATOR-202
> Whenever we were hit by this bug we'd start seeing problems in submitting 
> topologies to nimbus, as well as having problems 
> activating/deactivating/killing topologies.  Basically any topology that 
> utilizes the `is-leader` macro, since no nimbus believes itself to be the 
> leader based on LeaderLatch.hasLeadership()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1115) Stale leader-lock key effectively bans all nodes from becoming leaders

Reply via email to