[ https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754529#comment-16754529 ]
ASF GitHub Bot commented on CURATOR-498: ---------------------------------------- Github user cammckenzie commented on a diff in the pull request: https://github.com/apache/curator/pull/303#discussion_r251672915 --- Diff: curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java --- @@ -48,19 +50,21 @@ public class CreateBuilderImpl implements CreateBuilder, CreateBuilder2, BackgroundOperation<PathAndBytes>, ErrorListenerPathAndBytesable<String> { + private final Logger log = LoggerFactory.getLogger(getClass()); private final CuratorFrameworkImpl client; private CreateMode createMode; private Backgrounding backgrounding; private boolean createParentsIfNeeded; private boolean createParentsAsContainers; - private boolean doProtected; private boolean compress; private boolean setDataIfExists; private int setDataIfExistsVersion = -1; - private String protectedId; private ACLing acling; private Stat storingStat; private long ttl; + private boolean doProtected; + private String protectedId; + private long initialSessionId; --- End diff -- I think that maybe the initialSessionId should have a different name as it's a bit misleading. It's not really the initialSessionId as it gets updated during the edge case for handling protected ephemeral nodes. Really, it's the session ID at the time of creation for a protected ephemeral node isn't it? Perhaps it should be named accordingly and only initialised for the case where it will be needed? > Protected Mode creation can mistake closing session's node causing problems > for many recipes such as LeaderLatch > ---------------------------------------------------------------------------------------------------------------- > > Key: CURATOR-498 > URL: https://issues.apache.org/jira/browse/CURATOR-498 > Project: Apache Curator > Issue Type: Bug > Components: Framework > Affects Versions: 4.0.1, 4.1.0 > Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly > 3.4.13), Linux > Reporter: Shay Shimony > Assignee: Jordan Zimmerman > Priority: Blocker > Fix For: 4.1.1 > > Attachments: CURATOR-498.png, HaWatcher.log, LeaderLatch0.java, > ha.tar.gz, logs.tar.gz, reproduction.tar.gz, reproduction2.tar.gz > > > The Curator app I am working on uses the LeaderLatch to select a leader out > of 6 clients. > While testing my app, I noticed that when I make ZK lose its quorum for a > while and then restore it, then after Curator in my app restores it's > connection to ZK - sometimes not all the 6 clients are found in the latch > path (using zkCli.sh). That is, I have 5 instead of 6. > After investigating a little, I have a suspicion that LeaderLatch deleted the > leader in method setNode. > To investigate it I copied the LeaderLatch code and added some log messages, > and from them it seems like very old create() background callback was > surprisingly scheduled and corrupted the current leader with its stale path > name. Meaning, this old one called setNode with its stale name, and set > itself instead of the leader and deleted the leader. This leaves client > running, thinking it is the leader, while another leader is selected. > If my analysis is correct then it seems like we need to make this obsolete > create callback cancelled (I think its session was suspended on 22:38:54 and > then lost on 22:39:04 - so on SUSPENDED cancel ongoing callbacks). > Please see attached log file and modified LeaderLatch0. > > In the log, note that on 22:39:26 it shows that 0000000485 is replaced by > 0000000480 and then probably deleted. > Note also that at 22:38:52, 34 seconds before, we can see that it was in the > reset() method ("RESET OUR PATH") and possibly triggered the creation of > 0000000480 then. -- This message was sent by Atlassian JIRA (v7.6.3#76005)