junrao commented on code in PR #20121: URL: https://github.com/apache/kafka/pull/20121#discussion_r2193545685
########## core/src/test/scala/unit/kafka/server/LogRecoveryTest.scala: ########## @@ -215,7 +215,7 @@ class LogRecoveryTest extends QuorumTestHarness { server2.startup() updateProducer() // check if leader moves to the other server - leader = awaitLeaderChange(servers, topicPartition, oldLeaderOpt = Some(leader), timeout = 30000L) + leader = awaitLeaderChange(Seq(server2), topicPartition, oldLeaderOpt = Some(leader), timeout = 30000L) Review Comment: Thanks, but this doesn't seem to be the root cause. Note that `awaitLeaderChange()` will ignore server1 even if it's included in the brokers since it's the same as the old leader. I dug a bit deeper on this. The following is what I found. When the test passes, we have the following sequence. We fence broker 0 first, which causes 0 to be removed from ISR. We then fence broker 1. Since broker 1 is the last member in ISR. It's preserved. When broker 1 is restarted, it can be elected as the leader since it's in ISR. ``` [2025-07-08 18:09:24,746] INFO [QuorumController id=1000] Fencing broker 0 at epoch 9 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1700) [2025-07-08 18:09:24,747] INFO [QuorumController id=1000] handleBrokerFenced: changing partition(s): new-topic-0 : PartitionChangeRecord(partitionId=0, topicId=ygLg0ywdTcCWn2_pMjpvKg, isr=[1], leader=1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057) [2025-07-08 18:09:24,758] INFO [QuorumController id=1000] Fencing broker 1 at epoch 11 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1700) [2025-07-08 18:09:24,759] INFO [QuorumController id=1000] handleBrokerFenced: changing partition(s): new-topic-0 : PartitionChangeRecord(partitionId=0, topicId=ygLg0ywdTcCWn2_pMjpvKg, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057) [2025-07-08 18:09:26,145] INFO [QuorumController id=1000] The request from broker 1 to unfence has been granted because it has caught up with the offset of its register broker record 44. (org.apache.kafka.controller.BrokerHeartbeatManager:413) [2025-07-08 18:09:26,146] INFO [QuorumController id=1000] handleBrokerUnfenced: changing partition(s): new-topic-0 : PartitionChangeRecord(partitionId=0, topicId=ygLg0ywdTcCWn2_pMjpvKg, isr=null, leader=1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057) ``` When the test fails, we have the following sequence. First, broker 1 is re-registered after starting. This causes broker 1 to be removed from ISR. We then fence broker 0. Since it's the last member in ISR, 0 is preserved. This leads to no leader since broker 1 is not in ISR and unclean leader election is disabled. ``` [2025-07-08 18:09:37,659] INFO [QuorumController id=1000] handleBrokerShutdown: changing partition(s): new-topic-0 : PartitionChangeRecord(partitionId=0, topicId=jAy5ExUJRHODDr8kttB-cA, isr=[0], leader=-2, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057) [2025-07-08 18:09:37,659] INFO [QuorumController id=1000] Registering a new incarnation of broker 1. Previous incarnation ID was wCXN8auETdmv29caNaaBAQ; new incarnation ID is In8FF4IGRjig4e5IgmwPVQ. Generated 1 record(s) to clean up previous incarnations. Broker epoch will become 34. (org.apache.kafka.controller.ClusterControlManager:444) [2025-07-08 18:09:38,012] INFO [QuorumController id=1000] Fencing broker 0 at epoch 9 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1700) [2025-07-08 18:09:38,013] INFO [QuorumController id=1000] handleBrokerFenced: changing partition(s): new-topic-0 : PartitionChangeRecord(partitionId=0, topicId=jAy5ExUJRHODDr8kttB-cA, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1, directories=null, eligibleLeaderReplicas=null, lastKnownElr=null) (org.apache.kafka.controller.ReplicationControlManager:2057) ``` I tried to enable unclean leader election in the test. But the test still fails. Need to investigate further on that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org