Alyssa Huang created KAFKA-20515:
------------------------------------
Summary: ZK leader failover during ZK migration can block migration
Key: KAFKA-20515
URL: https://issues.apache.org/jira/browse/KAFKA-20515
Project: Kafka
Issue Type: Bug
Affects Versions: 3.9.0
Reporter: Alyssa Huang
Similar issue was fixed in https://issues.apache.org/jira/browse/KAFKA-16171,
the symptoms are similar
{code:java}
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for /migration
...
java.lang.RuntimeException: Conditional update on KRaft Migration ZNode failed.
Sent zkVersion = X. The failed write was:
ZkMigrationLeadershipState{kraftControllerId=<>, kraftControllerEpoch=<>,
kraftMetadataOffset=<>, kraftMetadataEpoch=<>, lastUpdatedTimeMs=-1,
migrationZkVersion=X, controllerZkEpoch=-1, controllerZkVersion=-2}. This
indicates that another KRaft controller is making writes to ZooKeeper.{code}
But the underlying root cause is different:
In KAFKA-16171, the trigger is from the controller failing over during
migration, in this issue the trigger is the ZK leader failing over before
sending back an ACK to the controller for a successful /migration node change.
* The KRaft controller sends to ZK "set /migration, expected current
dataVersion = N"
* The ZK leader receives this request, writes new value to /migration and sets
dataVersion to `N + 1`. ZK leader replicates this to follower nodes but shuts
down before sending the success reply back to the controller.
* The controller will retry the same request after reconnecting with ZK, but
it now expects `N` whereas ZK has already moved onto `N + 1`
--
This message was sent by Atlassian Jira
(v8.20.10#820010)