XComp commented on code in PR #430: URL: https://github.com/apache/curator/pull/430#discussion_r970541350
########## curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java: ########## @@ -586,6 +586,9 @@ private void checkLeadership(List<String> children) throws Exception final String localOurPath = ourPath.get(); List<String> sortedChildren = LockInternals.getSortedChildren(LOCK_NAME, sorter, children); int ourIndex = (localOurPath != null) ? sortedChildren.indexOf(ZKPaths.getNodeFromPath(localOurPath)) : -1; + + log.debug("checkLeadership with ourPath: {}, children: {}", localOurPath, sortedChildren); Review Comment: ```suggestion log.debug("[{}] checkLeadership with ourPath: {}, children: {}", id, localOurPath, sortedChildren); ``` ########## curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java: ########## @@ -218,6 +218,56 @@ public void testWatchedNodeDeletedOnReconnect() throws Exception } } + @Test + public void testLeadershipElectionWhenNodeDisappearsAfterChildrenAreRetrieved() throws Exception + { + final String latchPath = "/foo/bar"; + final Timing2 timing = new Timing2(); + try (CuratorFramework client = CuratorFrameworkFactory.newClient(server.getConnectString(), timing.session(), timing.connection(), new RetryOneTime(1))) + { + client.start(); + LeaderLatch latchInitialLeader = new LeaderLatch(client, latchPath, "initial-leader"); + LeaderLatch latchCandidate0 = new LeaderLatch(client, latchPath, "candidate-0"); + LeaderLatch latchCandidate1 = new LeaderLatch(client, latchPath, "candidate-1"); + + try + { + latchInitialLeader.start(); + + // we want to make sure that the leader gets leadership before other instances are joining the party + waitForALeader(Collections.singletonList(latchInitialLeader), new Timing()); + + // candidate #0 will wait for the leader to go away - this should happen after the child nodes are retrieved by candidate #0 + latchCandidate0.debugCheckLeaderShipLatch = new CountDownLatch(1); + + latchCandidate0.start(); + timing.sleepABit(); + + // no extra CountDownLatch needs to be set here because candidate #1 will rely on candidate #0 + latchCandidate1.start(); + timing.sleepABit(); Review Comment: You need to add `Awaitility` to the pom of the `curator-recipes` module: ``` <dependency> <groupId>org.awaitility</groupId> <artifactId>awaitility</artifactId> <scope>test</scope> </dependency> ``` ########## curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java: ########## @@ -218,6 +218,56 @@ public void testWatchedNodeDeletedOnReconnect() throws Exception } } + @Test + public void testLeadershipElectionWhenNodeDisappearsAfterChildrenAreRetrieved() throws Exception + { + final String latchPath = "/foo/bar"; + final Timing2 timing = new Timing2(); + try (CuratorFramework client = CuratorFrameworkFactory.newClient(server.getConnectString(), timing.session(), timing.connection(), new RetryOneTime(1))) + { + client.start(); + LeaderLatch latchInitialLeader = new LeaderLatch(client, latchPath, "initial-leader"); + LeaderLatch latchCandidate0 = new LeaderLatch(client, latchPath, "candidate-0"); + LeaderLatch latchCandidate1 = new LeaderLatch(client, latchPath, "candidate-1"); + + try + { + latchInitialLeader.start(); + + // we want to make sure that the leader gets leadership before other instances are joining the party + waitForALeader(Collections.singletonList(latchInitialLeader), new Timing()); + + // candidate #0 will wait for the leader to go away - this should happen after the child nodes are retrieved by candidate #0 + latchCandidate0.debugCheckLeaderShipLatch = new CountDownLatch(1); + + latchCandidate0.start(); + timing.sleepABit(); + + // no extra CountDownLatch needs to be set here because candidate #1 will rely on candidate #0 + latchCandidate1.start(); + timing.sleepABit(); Review Comment: ```suggestion // we want to make sure that the leader gets leadership before other instances are going to join the party waitForALeader(Collections.singletonList(latchInitialLeader), new Timing()); // candidate #0 will wait for the leader to go away - this should happen after the child nodes are retrieved by candidate #0 latchCandidate0.debugCheckLeaderShipLatch = new CountDownLatch(1); latchCandidate0.start(); final int expectedChildrenAfterCandidate0Joins = 2; Awaitility.await("There should be " + expectedChildrenAfterCandidate0Joins + " child nodes created after candidate #0 joins the leader election.") .pollInterval(Duration.ofMillis(100)) .pollInSameThread() .until(() -> client.getChildren().forPath(latchPath).size() == expectedChildrenAfterCandidate0Joins); // no extra CountDownLatch needs to be set here because candidate #1 will rely on candidate #0 latchCandidate1.start(); final int expectedChildrenAfterCandidate1Joins = 3; Awaitility.await("There should be " + expectedChildrenAfterCandidate1Joins + " child nodes created after candidate #1 joins the leader election.") .pollInterval(Duration.ofMillis(100)) .pollInSameThread() .until(() -> client.getChildren().forPath(latchPath).size() == expectedChildrenAfterCandidate1Joins); ``` Initially, I played around with checking the `ourPath` of each `LeaderLatch` instance. This didn't work, because that field is set in a `BackgroundCallback` (see [LeaderLatch:540](https://github.com/apache/curator/blob/2775855fc80fdea8433d5dedbb746f58ddd443c4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L540)). The background tasks are handled in the `EventThread` of the ZK client. That one is already blocked by the `CountdownLatch` of candidate #0. But we could still use the `SendThread` which is used for outgoing commands. That's why calling `.getChildren` on the client itself still works and is sufficient enough to verify that the child nodes are actually created. ########## curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java: ########## @@ -667,9 +670,9 @@ protected void handleStateChange(ConnectionState newState) { try { - if ( client.getConnectionStateErrorPolicy().isErrorState(ConnectionState.SUSPENDED) || !hasLeadership.get() ) + if ( client.getConnectionStateErrorPolicy().isErrorState(ConnectionState.SUSPENDED) ) { - reset(); + getChildren(); Review Comment: I'm not convinced anymore that this change is correct. Here, we're covering the case where `ConnectionState.SUSPENDED` is considered an error state, i.e. the leadership is lost if the connection was suspended. In that case, we should do a reset because we're considering the leadership being lost already due to the connection instability having caused an error. Instead, we should add an `else` branch that does the `getChildren` call in case of a reconnect and the policy is allowing reconnects after suspension. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@curator.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org