Jason Gustafson created KAFKA-13790:
---------------------------------------
Summary: ReplicaManager should be robust to all partition updates
from kraft metadata log
Key: KAFKA-13790
URL: https://issues.apache.org/jira/browse/KAFKA-13790
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson
There are two ways that partition state can be updated in the zk world: one is
through `LeaderAndIsr` requests and one is through `AlterPartition` responses.
All changes made to partition state result in new LeaderAndIsr requests, but
replicas will ignore them if the leader epoch is less than or equal to the
current known leader epoch. Basically it works like this:
* Changes made by the leader are done through AlterPartition requests. These
changes bump the partition epoch (or zk version), but leave the leader epoch
unchanged. LeaderAndIsr requests are sent by the controller, but replicas
ignore them. Partition state is instead only updated when the AlterIsr response
is received.
* Changes made by the controller are made directly by the controller and
always result in a leader epoch bump. These changes are sent to replicas
through LeaderAndIsr requests and are applied by replicas.
The code in `kafka.server.ReplicaManager` and `kafka.cluster.Partition` are
built on top of these assumptions. The logic in `makeLeader`, for example,
assumes that the leader epoch has indeed been bumped. Specifically, follower
state gets reset and a new entry is written to the leader epoch cache.
In KRaft, we also have two paths to update partition state. One is
AlterPartition, just like in the zk world. The second is updates received from
the metadata log. These follow the same path as LeaderAndIsr requests for the
most part, but a big difference is that all changes are sent down to
`kafka.cluster.Partition`, even those which do not have a bumped leader epoch.
This breaks the assumptions mentioned above in `makeLeader`, which could result
in leader epoch cache inconsistency. Another side effect of this on the
follower side is that replica fetchers for updated partitions get unnecessarily
restarted. There may be others as well.
We need to either replicate the same logic on the zookeeper side or make the
logic robust to all updates including those without a leader epoch bump.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)