Jason Gustafson created KAFKA-14154:
---------------------------------------
Summary: Persistent URP after controller soft failure
Key: KAFKA-14154
URL: https://issues.apache.org/jira/browse/KAFKA-14154
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson
We ran into a scenario where a partition leader was unable to expand the ISR
after a soft controller failover. Here is what happened:
Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as the
current controller.
1. Broker 1 loses its session in Zookeeper.
2. Broker 2 becomes the new controller.
3. During initialization, controller 2 removes 1 from the ISR. So state is
updated: leader=2, isr=[1, 2], leader epoch=11.
4. Broker 2 receives `LeaderAndIsr` from the new controller with leader
epoch=11.
5. Broker 2 immediately tries to add replica 1 back to the ISR since it is
still fetching and is caught up. However, the
`BrokerToControllerChannelManager` is still pointed at controller 1, so that is
where the `AlterPartition` is sent.
6. Controller 1 does not yet realize that it is not the controller, so it
processes the `AlterPartition` request. It sees the leader epoch of 11, which
is higher than what it has in its own context. Following changes to the
`AlterPartition` validation in
[https://github.com/apache/kafka/pull/12032/files,] the controller returns
FENCED_LEADER_EPOCH.
7. After receiving the FENCED_LEADER_EPOCH from the old controller, the leader
is stuck because it assumes that the error implies that another LeaderAndIsr
request should be sent.
Prior to
[https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
the way we handled this case was a little different. We only verified that the
leader epoch in the request was at least as large as the current epoch in the
controller context. Anything higher was accepted. The controller would have
attempted to write the updated state to Zookeeper. This update would have
failed because of the controller epoch check, however, we would have returned
NOT_CONTROLLER in this case, which is handled in `AlterPartitionManager`.
It is tempting to revert the logic, but the risk is in the idempotency check:
[https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.]
If the AlterPartition request happened to match the state inside the old
controller, the controller would consider the update successful and return no
error. But if its state was already stale at that point, then that might cause
the leader to incorrectly assume that the state had been updated.
One way to fix this problem without weakening the validation is to rely on the
controller epoch in `AlterPartitionManager`. When we discover a new controller,
we also discover its epoch, so we can pass that through. The `LeaderAndIsr`
request already includes the controller epoch of the controller that sent it
and we already propagate this through to `AlterPartition.submit`. Hence all we
need to do is verify that the epoch of the current controller target is at
least as large as the one discovered through the `LeaderAndIsr`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)