Jason Gustafson created KAFKA-9484:
--------------------------------------
Summary: Unnecessary LeaderAndIsr update following reassignment
completion
Key: KAFKA-9484
URL: https://issues.apache.org/jira/browse/KAFKA-9484
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
Following the completion of the reassignment, the controller executes two
steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr
update (in any case) with the new target replica set; second, it removes
unneeded replicas from the replica set and sends another round of LeaderAndIsr
updates. I am doubting the need for the first round of updates in the case that
the leader doesn't needed changing.
For example, suppose we have the following reassignment state:
replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, epoch=10
First the controller will bump the epoch with the target replica set, which
will result in a round of to the target replica set with the following state:
replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11
Immediately following this, the controller will bump the epoch again and remove
the unneeded replica. This will result in another round of LeaderAndIsr
requests with the following state:
replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3], leader=2, epoch=12
The first round of LeaderAndIsr updates puzzles me a bit. It is justified in
the code with this comment:
{code}
B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader
from adding any replica in TRS - ORS back in the isr.
{code}
(I think the comment is backwards. It should be ORS (original replica set) -
TRS (target replica set).)
It sounds like we are trying to prevent a member of ORS from being added back
to the ISR, but even if it did get added, it would be removed in the next step
anyway. In the uncommon case that an ORS replica is out of sync, there does not
seem to be any benefit to this first update since it is basically paying the
cost of one write in order to save the speculative cost of one write.
Additionally, it would be useful if the protocol could enforce the invariant
that the ISR is always a subset of the replica set.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)