Jason Gustafson created KAFKA-9484:
--------------------------------------

             Summary: Unnecessary LeaderAndIsr update following reassignment 
completion
                 Key: KAFKA-9484
                 URL: https://issues.apache.org/jira/browse/KAFKA-9484
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson


Following the completion of the reassignment, the controller executes two 
steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr 
update (in any case) with the new target replica set; second, it removes 
unneeded replicas from the replica set and sends another round of LeaderAndIsr 
updates. I am doubting the need for the first round of updates in the case that 
the leader doesn't needed changing. 

For example, suppose we have the following reassignment state: 

replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, epoch=10

First the controller will bump the epoch with the target replica set, which 
will result in a round of to the target replica set with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11 

Immediately following this, the controller will bump the epoch again and remove 
the unneeded replica. This will result in another round of LeaderAndIsr 
requests with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3], leader=2, epoch=12 

The first round of LeaderAndIsr updates puzzles me a bit. It is justified in 
the code with this comment: 

{code} 
B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader 
from adding any replica in TRS - ORS back in the isr. 
{code} 
(I think the comment is backwards. It should be ORS (original replica set) - 
TRS (target replica set).) 

It sounds like we are trying to prevent a member of ORS from being added back 
to the ISR, but even if it did get added, it would be removed in the next step 
anyway. In the uncommon case that an ORS replica is out of sync, there does not 
seem to be any benefit to this first update since it is basically paying the 
cost of one write in order to save the speculative cost of one write. 
Additionally, it would be useful if the protocol could enforce the invariant 
that the ISR is always a subset of the replica set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to