jsancio commented on a change in pull request #10309: URL: https://github.com/apache/kafka/pull/10309#discussion_r594031740
########## File path: raft/src/test/java/org/apache/kafka/raft/RaftEventSimulationTest.java ########## @@ -401,6 +401,51 @@ private void checkBackToBackLeaderFailures(QuorumConfig config) { } } + @Test + public void checkSingleNodeCommittedDataLossQuorumSizeThree() { + checkSingleNodeCommittedDataLoss(new QuorumConfig(3, 0)); + } + + private void checkSingleNodeCommittedDataLoss(QuorumConfig config) { + assertTrue(config.numVoters > 2, + "This test requires the cluster to be able to recover from one failed node"); + + for (int seed = 0; seed < 100; seed++) { + // We run this test without the `MonotonicEpoch` and `MajorityReachedHighWatermark` + // invariants since the loss of committed data on one node can violate them. + + Cluster cluster = new Cluster(config, seed); + EventScheduler scheduler = new EventScheduler(cluster.random, cluster.time); + scheduler.addInvariant(new MonotonicHighWatermark(cluster)); + scheduler.addInvariant(new SingleLeader(cluster)); + scheduler.addValidation(new ConsistentCommittedData(cluster)); + + MessageRouter router = new MessageRouter(cluster); + + cluster.startAll(); + schedulePolling(scheduler, cluster, 3, 5); + scheduler.schedule(router::deliverAll, 0, 2, 5); + scheduler.schedule(new SequentialAppendAction(cluster), 0, 2, 3); + scheduler.runUntil(() -> cluster.anyReachedHighWatermark(10)); + + RaftNode node = cluster.randomRunning().orElseThrow(() -> + new AssertionError("Failed to find running node") + ); + + // Kill a random node and drop all of its persistent state. The Raft + // protocol guarantees should still ensure we lose no committed data + // as long as a new leader is elected before the failed node is restarted. + cluster.kill(node.nodeId); + cluster.deletePersistentState(node.nodeId); + scheduler.runUntil(() -> !cluster.hasLeader(node.nodeId) && cluster.hasConsistentLeader()); Review comment: Hmm. I was looking at the implementation for `hasConsistentLeader`. It checks that all of the `LeaderState` match. Which means that all of the replicas need to vote for the same leader. This is not strictly required for having a consistent leader. Maybe this works in this test because the number of voters is 3 and one of the nodes was killed. ########## File path: raft/src/test/java/org/apache/kafka/raft/RaftEventSimulationTest.java ########## @@ -401,6 +401,51 @@ private void checkBackToBackLeaderFailures(QuorumConfig config) { } } + @Test + public void checkSingleNodeCommittedDataLossQuorumSizeThree() { + checkSingleNodeCommittedDataLoss(new QuorumConfig(3, 0)); + } + + private void checkSingleNodeCommittedDataLoss(QuorumConfig config) { + assertTrue(config.numVoters > 2, + "This test requires the cluster to be able to recover from one failed node"); + + for (int seed = 0; seed < 100; seed++) { + // We run this test without the `MonotonicEpoch` and `MajorityReachedHighWatermark` + // invariants since the loss of committed data on one node can violate them. + + Cluster cluster = new Cluster(config, seed); + EventScheduler scheduler = new EventScheduler(cluster.random, cluster.time); + scheduler.addInvariant(new MonotonicHighWatermark(cluster)); + scheduler.addInvariant(new SingleLeader(cluster)); + scheduler.addValidation(new ConsistentCommittedData(cluster)); + + MessageRouter router = new MessageRouter(cluster); + + cluster.startAll(); + schedulePolling(scheduler, cluster, 3, 5); + scheduler.schedule(router::deliverAll, 0, 2, 5); + scheduler.schedule(new SequentialAppendAction(cluster), 0, 2, 3); + scheduler.runUntil(() -> cluster.anyReachedHighWatermark(10)); + + RaftNode node = cluster.randomRunning().orElseThrow(() -> + new AssertionError("Failed to find running node") + ); + + // Kill a random node and drop all of its persistent state. The Raft + // protocol guarantees should still ensure we lose no committed data + // as long as a new leader is elected before the failed node is restarted. + cluster.kill(node.nodeId); + cluster.deletePersistentState(node.nodeId); Review comment: The implementation for `deletePersistentState` assumes that `kill` was or will be called for the change to take effect. Should we instead have tests call a method called `killAndDeletePersistentState`? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org