Alyssa Huang created KAFKA-20515:
------------------------------------

             Summary: ZK leader failover during ZK migration can block migration
                 Key: KAFKA-20515
                 URL: https://issues.apache.org/jira/browse/KAFKA-20515
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.9.0
            Reporter: Alyssa Huang


Similar issue was fixed in https://issues.apache.org/jira/browse/KAFKA-16171, 
the symptoms are similar
{code:java}
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for /migration
...
java.lang.RuntimeException: Conditional update on KRaft Migration ZNode failed. 
Sent zkVersion = X. The failed write was: 
ZkMigrationLeadershipState{kraftControllerId=<>, kraftControllerEpoch=<>, 
kraftMetadataOffset=<>, kraftMetadataEpoch=<>, lastUpdatedTimeMs=-1, 
migrationZkVersion=X, controllerZkEpoch=-1, controllerZkVersion=-2}. This 
indicates that another KRaft controller is making writes to ZooKeeper.{code}

But the underlying root cause is different:

In KAFKA-16171, the trigger is from the controller failing over during 
migration, in this issue the trigger is the ZK leader failing over before 
sending back an ACK to the controller for a successful /migration node change.
 * The KRaft controller sends to ZK "set /migration, expected current 
dataVersion = N"
 * The ZK leader receives this request, writes new value to /migration and sets 
dataVersion to `N + 1`. ZK leader replicates this to follower nodes but shuts 
down before sending the success reply back to the controller.
 * The controller will retry the same request after reconnecting with ZK, but 
it now expects `N` whereas ZK has already moved onto `N + 1`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to