Jason Gustafson created KAFKA-14154:
---------------------------------------

             Summary: Persistent URP after controller soft failure
                 Key: KAFKA-14154
                 URL: https://issues.apache.org/jira/browse/KAFKA-14154
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


We ran into a scenario where a partition leader was unable to expand the ISR 
after a soft controller failover. Here is what happened:

Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as the 
current controller.

1. Broker 1 loses its session in Zookeeper.  

2. Broker 2 becomes the new controller.

3. During initialization, controller 2 removes 1 from the ISR. So state is 
updated: leader=2, isr=[1, 2], leader epoch=11.

4. Broker 2 receives `LeaderAndIsr` from the new controller with leader 
epoch=11.

5. Broker 2 immediately tries to add replica 1 back to the ISR since it is 
still fetching and is caught up. However, the 
`BrokerToControllerChannelManager` is still pointed at controller 1, so that is 
where the `AlterPartition` is sent.

6. Controller 1 does not yet realize that it is not the controller, so it 
processes the `AlterPartition` request. It sees the leader epoch of 11, which 
is higher than what it has in its own context. Following changes to the 
`AlterPartition` validation in 
[https://github.com/apache/kafka/pull/12032/files,] the controller returns 
FENCED_LEADER_EPOCH.

7. After receiving the FENCED_LEADER_EPOCH from the old controller, the leader 
is stuck because it assumes that the error implies that another LeaderAndIsr 
request should be sent.

Prior to 
[https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
 the way we handled this case was a little different. We only verified that the 
leader epoch in the request was at least as large as the current epoch in the 
controller context. Anything higher was accepted. The controller would have 
attempted to write the updated state to Zookeeper. This update would have 
failed because of the controller epoch check, however, we would have returned 
NOT_CONTROLLER in this case, which is handled in `AlterPartitionManager`.

It is tempting to revert the logic, but the risk is in the idempotency check: 
[https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.]
 If the AlterPartition request happened to match the state inside the old 
controller, the controller would consider the update successful and return no 
error. But if its state was already stale at that point, then that might cause 
the leader to incorrectly assume that the state had been updated.

One way to fix this problem without weakening the validation is to rely on the 
controller epoch in `AlterPartitionManager`. When we discover a new controller, 
we also discover its epoch, so we can pass that through. The `LeaderAndIsr` 
request already includes the controller epoch of the controller that sent it 
and we already propagate this through to `AlterPartition.submit`. Hence all we 
need to do is verify that the epoch of the current controller target is at 
least as large as the one discovered through the `LeaderAndIsr`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to