[ https://issues.apache.org/jira/browse/KAFKA-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen resolved KAFKA-15052. ------------------------------- Resolution: Fixed > Fix flaky test QuorumControllerTest.testBalancePartitionLeaders() > ----------------------------------------------------------------- > > Key: KAFKA-15052 > URL: https://issues.apache.org/jira/browse/KAFKA-15052 > Project: Kafka > Issue Type: Test > Reporter: Dimitar Dimitrov > Assignee: Dimitar Dimitrov > Priority: Major > Labels: flaky-test > Fix For: 3.6.0 > > > Test failed at > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/] > as well as in various local runs. > The test creates a topic, fences a broker, notes partition imbalance due to > another broker taking over the partition the fenced broker lost, re-registers > and unfences the fenced broker, sends {{AlterPartition}} for the lost > partition adding the now unfenced broker back to its ISR, then waits for the > partition imbalance to disappear. > The local failures seem to happen when the brokers (including the ones that > never get fenced by the test) accidentally get fenced by losing their session > due to reaching the (aggressively low for test purposes) session timeout. > The Cloudbees failure quoted above also seems to indicate that this happened: > {code:java} > ...[truncated 738209 chars]... > 23. (org.apache.kafka.controller.QuorumController:768) > [2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write > event for maybeBalancePartitionLeaders because scheduled (DEFERRED), > checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true) > (org.apache.kafka.controller.QuorumController:1401) > [2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 > because its session has timed out. > (org.apache.kafka.controller.ReplicationControlManager:1459) > [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: > changing partition(s): foo-0, foo-1, foo-2 > (org.apache.kafka.controller.ReplicationControlManager:1750) > [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for > foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 > -> 3, partitionEpoch: 2 -> 3 > (org.apache.kafka.controller.ReplicationControlManager:157) > [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for > foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: > 3 -> 4, partitionEpoch: 4 -> 5 > (org.apache.kafka.controller.ReplicationControlManager:157) > [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for > foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 > -> 3, partitionEpoch: 2 -> 3 > (org.apache.kafka.controller.ReplicationControlManager:157) > [2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, > appendTimestamp=240, > records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, > topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, > removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at > version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, > topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, > removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at > version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, > topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, > removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at > version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, > brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), > prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253) > [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory > snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197) > [2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log > check. (org.apache.kafka.metalog.LocalLogManager:512) > [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation > maybeFenceReplicas(451616131) will be completed when the log reaches offset > 27. (org.apache.kafka.controller.QuorumController:768) > [2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 > because its session has timed out. > (org.apache.kafka.controller.ReplicationControlManager:1459) > [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: > changing partition(s): foo-1 > (org.apache.kafka.controller.ReplicationControlManager:1750) > [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for > foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 > -> 5, partitionEpoch: 5 -> 6 > (org.apache.kafka.controller.ReplicationControlManager:157){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)