[
https://issues.apache.org/jira/browse/KAFKA-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson resolved KAFKA-14154.
-------------------------------------
Resolution: Fixed
> Persistent URP after controller soft failure
> --------------------------------------------
>
> Key: KAFKA-14154
> URL: https://issues.apache.org/jira/browse/KAFKA-14154
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Assignee: Jason Gustafson
> Priority: Blocker
> Fix For: 3.3.0
>
>
> We ran into a scenario where a partition leader was unable to expand the ISR
> after a soft controller failover. Here is what happened:
> Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as
> the current controller.
> 1. Broker 1 loses its session in Zookeeper.
> 2. Broker 2 becomes the new controller.
> 3. During initialization, controller 2 removes 1 from the ISR. So state is
> updated: leader=2, isr=[2], leader epoch=11.
> 4. Broker 2 receives `LeaderAndIsr` from the new controller with leader
> epoch=11.
> 5. Broker 2 immediately tries to add replica 1 back to the ISR since it is
> still fetching and is caught up. However, the
> `BrokerToControllerChannelManager` is still pointed at controller 1, so that
> is where the `AlterPartition` is sent.
> 6. Controller 1 does not yet realize that it is not the controller, so it
> processes the `AlterPartition` request. It sees the leader epoch of 11, which
> is higher than what it has in its own context. Following changes to the
> `AlterPartition` validation in
> [https://github.com/apache/kafka/pull/12032/files,] the controller returns
> FENCED_LEADER_EPOCH.
> 7. After receiving the FENCED_LEADER_EPOCH from the old controller, the
> leader is stuck because it assumes that the error implies that another
> LeaderAndIsr request should be sent.
> Prior to
> [https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
> the way we handled this case was a little different. We only verified that
> the leader epoch in the request was at least as large as the current epoch in
> the controller context. Anything higher was accepted. The controller would
> have attempted to write the updated state to Zookeeper. This update would
> have failed because of the controller epoch check, however, we would have
> returned NOT_CONTROLLER in this case, which is handled in
> `AlterPartitionManager`.
> It is tempting to revert the logic, but the risk is in the idempotency check:
> [https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.]
> If the AlterPartition request happened to match the state inside the old
> controller, the controller would consider the update successful and return no
> error. But if its state was already stale at that point, then that might
> cause the leader to incorrectly assume that the state had been updated.
> One way to fix this problem without weakening the validation is to rely on
> the controller epoch in `AlterPartitionManager`. When we discover a new
> controller, we also discover its epoch, so we can pass that through. The
> `LeaderAndIsr` request already includes the controller epoch of the
> controller that sent it and we already propagate this through to
> `AlterPartition.submit`. Hence all we need to do is verify that the epoch of
> the current controller target is at least as large as the one discovered
> through the `LeaderAndIsr`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)