[GitHub] [kafka] jsancio commented on a change in pull request #10309: KAFKA-12181; Loosen raft fetch offset validation of remote replicas

GitBox Sun, 14 Mar 2021 20:52:13 -0700


jsancio commented on a change in pull request #10309:
URL: https://github.com/apache/kafka/pull/10309#discussion_r594031740




##########
File path: raft/src/test/java/org/apache/kafka/raft/RaftEventSimulationTest.java
##########
@@ -401,6 +401,51 @@ private void checkBackToBackLeaderFailures(QuorumConfig 
config) {
         }
     }
 
+    @Test
+    public void checkSingleNodeCommittedDataLossQuorumSizeThree() {
+        checkSingleNodeCommittedDataLoss(new QuorumConfig(3, 0));
+    }
+
+    private void checkSingleNodeCommittedDataLoss(QuorumConfig config) {
+        assertTrue(config.numVoters > 2,
+            "This test requires the cluster to be able to recover from one 
failed node");
+
+        for (int seed = 0; seed < 100; seed++) {
+            // We run this test without the `MonotonicEpoch` and 
`MajorityReachedHighWatermark`
+            // invariants since the loss of committed data on one node can 
violate them.
+
+            Cluster cluster = new Cluster(config, seed);
+            EventScheduler scheduler = new EventScheduler(cluster.random, 
cluster.time);
+            scheduler.addInvariant(new MonotonicHighWatermark(cluster));
+            scheduler.addInvariant(new SingleLeader(cluster));
+            scheduler.addValidation(new ConsistentCommittedData(cluster));
+
+            MessageRouter router = new MessageRouter(cluster);
+
+            cluster.startAll();
+            schedulePolling(scheduler, cluster, 3, 5);
+            scheduler.schedule(router::deliverAll, 0, 2, 5);
+            scheduler.schedule(new SequentialAppendAction(cluster), 0, 2, 3);
+            scheduler.runUntil(() -> cluster.anyReachedHighWatermark(10));
+
+            RaftNode node = cluster.randomRunning().orElseThrow(() ->
+                new AssertionError("Failed to find running node")
+            );
+
+            // Kill a random node and drop all of its persistent state. The 
Raft
+            // protocol guarantees should still ensure we lose no committed 
data
+            // as long as a new leader is elected before the failed node is 
restarted.
+            cluster.kill(node.nodeId);
+            cluster.deletePersistentState(node.nodeId);
+            scheduler.runUntil(() -> !cluster.hasLeader(node.nodeId) && 
cluster.hasConsistentLeader());

Review comment:
       Hmm. I was looking at the implementation for `hasConsistentLeader`. It 
checks that all of the `LeaderState` match. Which means that all of the 
replicas need to vote for the same leader. This is not strictly required for 
having a consistent leader.
   
   Maybe this works in this test because the number of voters is 3 and one of 
the nodes was killed.

##########
File path: raft/src/test/java/org/apache/kafka/raft/RaftEventSimulationTest.java
##########
@@ -401,6 +401,51 @@ private void checkBackToBackLeaderFailures(QuorumConfig 
config) {
         }
     }
 
+    @Test
+    public void checkSingleNodeCommittedDataLossQuorumSizeThree() {
+        checkSingleNodeCommittedDataLoss(new QuorumConfig(3, 0));
+    }
+
+    private void checkSingleNodeCommittedDataLoss(QuorumConfig config) {
+        assertTrue(config.numVoters > 2,
+            "This test requires the cluster to be able to recover from one 
failed node");
+
+        for (int seed = 0; seed < 100; seed++) {
+            // We run this test without the `MonotonicEpoch` and 
`MajorityReachedHighWatermark`
+            // invariants since the loss of committed data on one node can 
violate them.
+
+            Cluster cluster = new Cluster(config, seed);
+            EventScheduler scheduler = new EventScheduler(cluster.random, 
cluster.time);
+            scheduler.addInvariant(new MonotonicHighWatermark(cluster));
+            scheduler.addInvariant(new SingleLeader(cluster));
+            scheduler.addValidation(new ConsistentCommittedData(cluster));
+
+            MessageRouter router = new MessageRouter(cluster);
+
+            cluster.startAll();
+            schedulePolling(scheduler, cluster, 3, 5);
+            scheduler.schedule(router::deliverAll, 0, 2, 5);
+            scheduler.schedule(new SequentialAppendAction(cluster), 0, 2, 3);
+            scheduler.runUntil(() -> cluster.anyReachedHighWatermark(10));
+
+            RaftNode node = cluster.randomRunning().orElseThrow(() ->
+                new AssertionError("Failed to find running node")
+            );
+
+            // Kill a random node and drop all of its persistent state. The 
Raft
+            // protocol guarantees should still ensure we lose no committed 
data
+            // as long as a new leader is elected before the failed node is 
restarted.
+            cluster.kill(node.nodeId);
+            cluster.deletePersistentState(node.nodeId);

Review comment:
       The implementation for `deletePersistentState` assumes that `kill` was 
or will be called for the change to take effect. Should we instead have tests 
call a method called `killAndDeletePersistentState`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] jsancio commented on a change in pull request #10309: KAFKA-12181; Loosen raft fetch offset validation of remote replicas

Reply via email to