[ https://issues.apache.org/jira/browse/KAFKA-13173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson updated KAFKA-13173: ------------------------------------ Fix Version/s: 3.0.0 > KRaft controller does not handle simultaneous broker expirations correctly > -------------------------------------------------------------------------- > > Key: KAFKA-13173 > URL: https://issues.apache.org/jira/browse/KAFKA-13173 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Priority: Blocker > Fix For: 3.0.0 > > > In `ReplicationControlManager.fenceStaleBrokers`, we find all of the current > stale replicas and attempt to remove them from the ISR. However, when > multiple expirations occur at once, we do not properly accumulate the ISR > changes. For example, I ran a test where the ISR of a partition was > initialized to [1, 2, 3]. Then I triggered a timeout of replicas 2 and 3 at > the same time. The records that were generated by `fenceStaleBrokers` were > the following: > {code} > ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, > topicId=_seg8hBuSymBHUQ1sMKr2g, isr=[1, 3], leader=1, replicas=null, > removingReplicas=null, addingReplicas=null) at version 0), > ApiMessageAndVersion(FenceBrokerRecord(id=2, epoch=102) at version 0), > ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, > topicId=_seg8hBuSymBHUQ1sMKr2g, isr=[1, 2], leader=1, replicas=null, > removingReplicas=null, addingReplicas=null) at version 0), > ApiMessageAndVersion(FenceBrokerRecord(id=3, epoch=103) at version 0)] > {code} > First the ISR is shrunk to [1, 3] as broker 2 is fenced. We also see the > record to fence broker 2. Then the ISR is modified to [1, 2] as the fencing > of broker 3 is handled. So we did not account for the fact that we had > already fenced broker 2 in the request. > A simple solution for now is to change the logic to handle fencing only one > broker at a time. -- This message was sent by Atlassian Jira (v8.3.4#803005)