Jason Gustafson created KAFKA-13790:
---------------------------------------

             Summary: ReplicaManager should be robust to all partition updates 
from kraft metadata log
                 Key: KAFKA-13790
                 URL: https://issues.apache.org/jira/browse/KAFKA-13790
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


There are two ways that partition state can be updated in the zk world: one is 
through `LeaderAndIsr` requests and one is through `AlterPartition` responses. 
All changes made to partition state result in new LeaderAndIsr requests, but 
replicas will ignore them if the leader epoch is less than or equal to the 
current known leader epoch. Basically it works like this:
 * Changes made by the leader are done through AlterPartition requests. These 
changes bump the partition epoch (or zk version), but leave the leader epoch 
unchanged. LeaderAndIsr requests are sent by the controller, but replicas 
ignore them. Partition state is instead only updated when the AlterIsr response 
is received.
 * Changes made by the controller are made directly by the controller and 
always result in a leader epoch bump. These changes are sent to replicas 
through LeaderAndIsr requests and are applied by replicas.

The code in `kafka.server.ReplicaManager` and `kafka.cluster.Partition` are 
built on top of these assumptions. The logic in `makeLeader`, for example, 
assumes that the leader epoch has indeed been bumped. Specifically, follower 
state gets reset and a new entry is written to the leader epoch cache.

In KRaft, we also have two paths to update partition state. One is 
AlterPartition, just like in the zk world. The second is updates received from 
the metadata log. These follow the same path as LeaderAndIsr requests for the 
most part, but a big difference is that all changes are sent down to 
`kafka.cluster.Partition`, even those which do not have a bumped leader epoch. 
This breaks the assumptions mentioned above in `makeLeader`, which could result 
in leader epoch cache inconsistency. Another side effect of this on the 
follower side is that replica fetchers for updated partitions get unnecessarily 
restarted. There may be others as well.

We need to either replicate the same logic on the zookeeper side or make the 
logic robust to all updates including those without a leader epoch bump.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to