Re: [PR] KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments [kafka]

via GitHub Wed, 09 Jul 2025 12:26:55 -0700


junrao commented on code in PR #20121:
URL: https://github.com/apache/kafka/pull/20121#discussion_r2193545685



##########
core/src/test/scala/unit/kafka/server/LogRecoveryTest.scala:
##########
@@ -215,7 +215,7 @@ class LogRecoveryTest extends QuorumTestHarness {
     server2.startup()
     updateProducer()
     // check if leader moves to the other server
-    leader = awaitLeaderChange(servers, topicPartition, oldLeaderOpt = 
Some(leader), timeout = 30000L)
+    leader = awaitLeaderChange(Seq(server2), topicPartition, oldLeaderOpt = 
Some(leader), timeout = 30000L)

Review Comment:
   Thanks, but this doesn't seem to be the root cause. Note that 
`awaitLeaderChange()` will ignore server1 even if it's included in the brokers 
since it's the same as the old leader.
   
   I dug a bit deeper on this. The following is what I found.
   
   When the test passes, we have the following sequence. We fence broker 0 
first, which causes 0 to be removed from ISR. We then fence broker 1. Since 
broker 1 is the last member in ISR. It's preserved. When broker 1 is restarted, 
it can be elected as the leader since it's in ISR.
   
   ```
   [2025-07-08 18:09:24,746] INFO [QuorumController id=1000] Fencing broker 0 
at epoch 9 because its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager:1700)
   [2025-07-08 18:09:24,747] INFO [QuorumController id=1000] 
handleBrokerFenced: changing partition(s): new-topic-0 : 
PartitionChangeRecord(partitionId=0, topicId=ygLg0ywdTcCWn2_pMjpvKg, isr=[1], 
leader=1, replicas=null, removingReplicas=null, addingReplicas=null, 
leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, 
lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057)
   
   [2025-07-08 18:09:24,758] INFO [QuorumController id=1000] Fencing broker 1 
at epoch 11 because its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager:1700)
   [2025-07-08 18:09:24,759] INFO [QuorumController id=1000] 
handleBrokerFenced: changing partition(s): new-topic-0 : 
PartitionChangeRecord(partitionId=0, topicId=ygLg0ywdTcCWn2_pMjpvKg, isr=null, 
leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, 
leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, 
lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057)
   
   [2025-07-08 18:09:26,145] INFO [QuorumController id=1000] The request from 
broker 1 to unfence has been granted because it has caught up with the offset 
of its register broker record 44. 
(org.apache.kafka.controller.BrokerHeartbeatManager:413)
   [2025-07-08 18:09:26,146] INFO [QuorumController id=1000] 
handleBrokerUnfenced: changing partition(s): new-topic-0 : 
PartitionChangeRecord(partitionId=0, topicId=ygLg0ywdTcCWn2_pMjpvKg, isr=null, 
leader=1, replicas=null, removingReplicas=null, addingReplicas=null, 
leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, 
lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057)
   ```
   
   When the test fails, we have the following sequence. First, broker 1 is 
re-registered after starting. This causes broker 1 to be removed from ISR. We 
then fence broker 0. Since it's the last member in ISR, 0 is preserved. This 
leads to no leader since broker 1 is not in ISR and unclean leader election is 
disabled.
   
   ```
   [2025-07-08 18:09:37,659] INFO [QuorumController id=1000] 
handleBrokerShutdown: changing partition(s): new-topic-0 : 
PartitionChangeRecord(partitionId=0, topicId=jAy5ExUJRHODDr8kttB-cA, isr=[0], 
leader=-2, replicas=null, removingReplicas=null, addingReplicas=null, 
leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, 
lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057)
   [2025-07-08 18:09:37,659] INFO [QuorumController id=1000] Registering a new 
incarnation of broker 1. Previous incarnation ID was wCXN8auETdmv29caNaaBAQ; 
new incarnation ID is In8FF4IGRjig4e5IgmwPVQ. Generated 1 record(s) to clean up 
previous incarnations. Broker epoch will become 34. 
(org.apache.kafka.controller.ClusterControlManager:444)
   
   [2025-07-08 18:09:38,012] INFO [QuorumController id=1000] Fencing broker 0 
at epoch 9 because its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager:1700)
   [2025-07-08 18:09:38,013] INFO [QuorumController id=1000] 
handleBrokerFenced: changing partition(s): new-topic-0 : 
PartitionChangeRecord(partitionId=0, topicId=jAy5ExUJRHODDr8kttB-cA, isr=null, 
leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, 
leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, 
lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057)
   ```
   
   I tried to enable unclean leader election in the test. But the test still 
fails. Need to investigate further on that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments [kafka]

Reply via email to