[ https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754372#comment-16754372 ]
ASF GitHub Bot commented on CURATOR-498: ---------------------------------------- Github user Randgalt commented on a diff in the pull request: https://github.com/apache/curator/pull/303#discussion_r251599266 --- Diff: curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java --- @@ -94,7 +89,7 @@ public static void injectSessionExpiration(ZooKeeper zooKeeper) Object eventThread = eventThreadField.get(clientCnxn); queueEventMethod.invoke(eventThread, event); queueEventOfDeathMethod.invoke(eventThread); --- End diff -- TBH I just wanted to make the minimal change needed to get this to work. It's been this way a very long time. > Protected Mode creation can mistake closing session's node causing problems > for many recipes such as LeaderLatch > ---------------------------------------------------------------------------------------------------------------- > > Key: CURATOR-498 > URL: https://issues.apache.org/jira/browse/CURATOR-498 > Project: Apache Curator > Issue Type: Bug > Components: Framework > Affects Versions: 4.0.1, 4.1.0 > Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly > 3.4.13), Linux > Reporter: Shay Shimony > Assignee: Jordan Zimmerman > Priority: Blocker > Fix For: 4.1.1 > > Attachments: CURATOR-498.png, HaWatcher.log, LeaderLatch0.java, > ha.tar.gz, logs.tar.gz, reproduction.tar.gz, reproduction2.tar.gz > > > The Curator app I am working on uses the LeaderLatch to select a leader out > of 6 clients. > While testing my app, I noticed that when I make ZK lose its quorum for a > while and then restore it, then after Curator in my app restores it's > connection to ZK - sometimes not all the 6 clients are found in the latch > path (using zkCli.sh). That is, I have 5 instead of 6. > After investigating a little, I have a suspicion that LeaderLatch deleted the > leader in method setNode. > To investigate it I copied the LeaderLatch code and added some log messages, > and from them it seems like very old create() background callback was > surprisingly scheduled and corrupted the current leader with its stale path > name. Meaning, this old one called setNode with its stale name, and set > itself instead of the leader and deleted the leader. This leaves client > running, thinking it is the leader, while another leader is selected. > If my analysis is correct then it seems like we need to make this obsolete > create callback cancelled (I think its session was suspended on 22:38:54 and > then lost on 22:39:04 - so on SUSPENDED cancel ongoing callbacks). > Please see attached log file and modified LeaderLatch0. > > In the log, note that on 22:39:26 it shows that 0000000485 is replaced by > 0000000480 and then probably deleted. > Note also that at 22:38:52, 34 seconds before, we can see that it was in the > reset() method ("RESET OUR PATH") and possibly triggered the creation of > 0000000480 then. -- This message was sent by Atlassian JIRA (v7.6.3#76005)