Jason Gustafson created KAFKA-14154: ---------------------------------------
Summary: Persistent URP after controller soft failure Key: KAFKA-14154 URL: https://issues.apache.org/jira/browse/KAFKA-14154 Project: Kafka Issue Type: Bug Reporter: Jason Gustafson Assignee: Jason Gustafson We ran into a scenario where a partition leader was unable to expand the ISR after a soft controller failover. Here is what happened: Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as the current controller. 1. Broker 1 loses its session in Zookeeper. 2. Broker 2 becomes the new controller. 3. During initialization, controller 2 removes 1 from the ISR. So state is updated: leader=2, isr=[1, 2], leader epoch=11. 4. Broker 2 receives `LeaderAndIsr` from the new controller with leader epoch=11. 5. Broker 2 immediately tries to add replica 1 back to the ISR since it is still fetching and is caught up. However, the `BrokerToControllerChannelManager` is still pointed at controller 1, so that is where the `AlterPartition` is sent. 6. Controller 1 does not yet realize that it is not the controller, so it processes the `AlterPartition` request. It sees the leader epoch of 11, which is higher than what it has in its own context. Following changes to the `AlterPartition` validation in [https://github.com/apache/kafka/pull/12032/files,] the controller returns FENCED_LEADER_EPOCH. 7. After receiving the FENCED_LEADER_EPOCH from the old controller, the leader is stuck because it assumes that the error implies that another LeaderAndIsr request should be sent. Prior to [https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,], the way we handled this case was a little different. We only verified that the leader epoch in the request was at least as large as the current epoch in the controller context. Anything higher was accepted. The controller would have attempted to write the updated state to Zookeeper. This update would have failed because of the controller epoch check, however, we would have returned NOT_CONTROLLER in this case, which is handled in `AlterPartitionManager`. It is tempting to revert the logic, but the risk is in the idempotency check: [https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.] If the AlterPartition request happened to match the state inside the old controller, the controller would consider the update successful and return no error. But if its state was already stale at that point, then that might cause the leader to incorrectly assume that the state had been updated. One way to fix this problem without weakening the validation is to rely on the controller epoch in `AlterPartitionManager`. When we discover a new controller, we also discover its epoch, so we can pass that through. The `LeaderAndIsr` request already includes the controller epoch of the controller that sent it and we already propagate this through to `AlterPartition.submit`. Hence all we need to do is verify that the epoch of the current controller target is at least as large as the one discovered through the `LeaderAndIsr`. -- This message was sent by Atlassian Jira (v8.20.10#820010)