[ https://issues.apache.org/jira/browse/KAFKA-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732454#comment-17732454 ]
Dimitar Dimitrov commented on KAFKA-15052: ------------------------------------------ [~jlprat] , [~cadonna] , thanks for linking these - after the originally suggested fix the failures seemed to disappear initially, but probably only became slightly more rare and can now be seen in the most recent `trunk` runs again: [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1920/tests/] I'm trying to get a local repro (unfortunately still no luck with either Gradle or IntelliJ, but I'm still working through the full matrix of JDK and Scala versions). If nothing more elegant can be found or would work more reliably than the previous fix, bumping the session timeout with a second seemed reliable the last time, so it could be one possible alternative. I'll update here once I have more to share. > Fix flaky test QuorumControllerTest.testBalancePartitionLeaders() > ----------------------------------------------------------------- > > Key: KAFKA-15052 > URL: https://issues.apache.org/jira/browse/KAFKA-15052 > Project: Kafka > Issue Type: Test > Reporter: Dimitar Dimitrov > Assignee: Dimitar Dimitrov > Priority: Major > > Test failed at > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/] > as well as in various local runs. > The test creates a topic, fences a broker, notes partition imbalance due to > another broker taking over the partition the fenced broker lost, re-registers > and unfences the fenced broker, sends {{AlterPartition}} for the lost > partition adding the now unfenced broker back to its ISR, then waits for the > partition imbalance to disappear. > The local failures seem to happen when the brokers (including the ones that > never get fenced by the test) accidentally get fenced by losing their session > due to reaching the (aggressively low for test purposes) session timeout. > The Cloudbees failure quoted above also seems to indicate that this happened: > {code:java} > ...[truncated 738209 chars]... > 23. (org.apache.kafka.controller.QuorumController:768) > [2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write > event for maybeBalancePartitionLeaders because scheduled (DEFERRED), > checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true) > (org.apache.kafka.controller.QuorumController:1401) > [2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 > because its session has timed out. > (org.apache.kafka.controller.ReplicationControlManager:1459) > [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: > changing partition(s): foo-0, foo-1, foo-2 > (org.apache.kafka.controller.ReplicationControlManager:1750) > [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for > foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 > -> 3, partitionEpoch: 2 -> 3 > (org.apache.kafka.controller.ReplicationControlManager:157) > [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for > foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: > 3 -> 4, partitionEpoch: 4 -> 5 > (org.apache.kafka.controller.ReplicationControlManager:157) > [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for > foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 > -> 3, partitionEpoch: 2 -> 3 > (org.apache.kafka.controller.ReplicationControlManager:157) > [2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, > appendTimestamp=240, > records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, > topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, > removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at > version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, > topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, > removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at > version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, > topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, > removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at > version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, > brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), > prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253) > [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory > snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197) > [2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log > check. (org.apache.kafka.metalog.LocalLogManager:512) > [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation > maybeFenceReplicas(451616131) will be completed when the log reaches offset > 27. (org.apache.kafka.controller.QuorumController:768) > [2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 > because its session has timed out. > (org.apache.kafka.controller.ReplicationControlManager:1459) > [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: > changing partition(s): foo-1 > (org.apache.kafka.controller.ReplicationControlManager:1750) > [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for > foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 > -> 5, partitionEpoch: 5 -> 6 > (org.apache.kafka.controller.ReplicationControlManager:157){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)