Jason Gustafson created KAFKA-13173:
---------------------------------------

             Summary: KRaft controller does not handle simultaneous broker 
expirations correctly
                 Key: KAFKA-13173
                 URL: https://issues.apache.org/jira/browse/KAFKA-13173
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson


In `ReplicationControlManager.fenceStaleBrokers`, we find all of the current 
stale replicas and attempt to remove them from the ISR. However, when multiple 
expirations occur at once, we do not properly accumulate the ISR changes. For 
example, I ran a test where the ISR of a partition was initialized to [1, 2, 
3]. Then I triggered a timeout of replicas 2 and 3 at the same time. The 
records that were generated by `fenceStaleBrokers` were the following:

{code}
ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, 
topicId=_seg8hBuSymBHUQ1sMKr2g, isr=[1, 3], leader=1, replicas=null, 
removingReplicas=null, addingReplicas=null) at version 0), 
ApiMessageAndVersion(FenceBrokerRecord(id=2, epoch=102) at version 0), 
ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, 
topicId=_seg8hBuSymBHUQ1sMKr2g, isr=[1, 2], leader=1, replicas=null, 
removingReplicas=null, addingReplicas=null) at version 0), 
ApiMessageAndVersion(FenceBrokerRecord(id=3, epoch=103) at version 0)]
{code}

First the ISR is shrunk to [1, 3] as broker 2 is fenced. We also see the record 
to fence broker 2. Then the ISR is modified to [1, 2] as the fencing of broker 
3 is handled. So we did not account for the fact that we had already fenced 
broker 2 in the request. 

A simple solution for now is to change the logic to handle fencing only one 
broker at a time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to