[ 
https://issues.apache.org/jira/browse/KAFKA-20022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Raj updated KAFKA-20022:
--------------------------------
    Description: 
Hi,

We migrated our Kafka cluster (v3.9) to *dual write mode* three weeks ago as 
part of the planned one-month transition away from ZooKeeper. Recently, the 
controller sync between the ZooKeeper and KRaft metadata went out of alignment. 
As a result, the cluster is no longer in dual write mode. As proposed in 
KAFKA-16171, attempts to restart the controllers did not restore sync, and the 
ZooKeeper metadata is now lagging behind as per logs.

*Impact*
 * Dual write mode is no longer active, increasing risk of metadata divergence.
 * ZooKeeper metadata is stale compared to KRaft.
 * Migration timeline is at risk.

*Repetitive Logs in leader controller:*
{code:java}
[2025-12-29 01:44:23,852] ERROR Encountered zk migration fault: Unhandled error 
in SyncKRaftMetadataEvent (org.apache.kafka.server.fault.LoggingFaultHandler)
java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. Sent 
zkVersion = 5349155. This indicates that another KRaft controller is making 
writes to ZooKeeper.
        at 
kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2050)
        at 
kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2076)
        at 
kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2101)
        at 
scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
        at 
scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
        at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:42)
        at 
kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2101)
        at 
kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:137)
        at 
kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:111)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$null$3(KRaftMigrationZkWriter.java:233)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:246)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:240)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:63)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:844)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:970)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$4(KRaftMigrationZkWriter.java:230)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:228)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:96)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:843)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:132)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:215)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
        at java.base/java.lang.Thread.run(Thread.java:840)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
 {code}
 

 

*cluster status*
{code:java}
ClusterId:              QBC8K1kNS02Sl9930_QDAA
LeaderId:               10002
LeaderEpoch:            253
HighWatermark:          12515984
MaxFollowerLag:         0
MaxFollowerLagTimeMs:   111
CurrentVoters:          [10001,10002,10003]
CurrentObservers:       
[5,4,1104,6,3,1,1103,1112,1108,1107,1109,1101,1102,2,1106,1110,1111,1105]
 {code}
 

*Migration data in zookeeper*
{code:java}
In [4]: zk_client.get('/kafka/qa/migration')
Out[4]: 
(b'{"version":0,"kraft_metadata_offset":10840503,"kraft_controller_id":10002,"kraft_metadata_epoch":170,"kraft_controller_epoch":253}',
 ZnodeStat(czxid=7129652820618, mzxid=7176892421062, ctime=1765176701970, 
mtime=1766986727917, version=5349156, cversion=0, aversion=0, ephemeralOwner=0, 
dataLength=130, numChildren=0, pzxid=7129652820618))
 {code}

  was:
Hi,

We migrated our Kafka cluster (v3.9) to *dual write mode* three weeks ago as 
part of the planned one-month transition away from ZooKeeper. Recently, the 
controller sync between the ZooKeeper and KRaft metadata went out of alignment. 
As a result, the cluster is no longer in dual write mode. As proposed in 
[KAFKA-16171|https://issues.apache.org/jira/browse/KAFKA-16171], attempts to 
restart the controllers did not restore sync, and the ZooKeeper metadata is now 
lagging behind as per logs.

*Impact*
 * Dual write mode is no longer active, increasing risk of metadata divergence.
 * ZooKeeper metadata is stale compared to KRaft.
 * Migration timeline is at risk.

*Repetitive Logs in leader controller:*

 
{code:java}
[2025-12-29 01:44:23,852] ERROR Encountered zk migration fault: Unhandled error 
in SyncKRaftMetadataEvent (org.apache.kafka.server.fault.LoggingFaultHandler)
java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. Sent 
zkVersion = 5349155. This indicates that another KRaft controller is making 
writes to ZooKeeper.
        at 
kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2050)
        at 
kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2076)
        at 
kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2101)
        at 
scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
        at 
scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
        at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:42)
        at 
kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2101)
        at 
kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:137)
        at 
kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:111)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$null$3(KRaftMigrationZkWriter.java:233)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:246)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:240)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:63)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:844)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:970)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$4(KRaftMigrationZkWriter.java:230)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:228)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:96)
        at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:843)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:132)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:215)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
        at java.base/java.lang.Thread.run(Thread.java:840)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
metadata delta, but the controller is not in dual-write mode. Ignoring this 
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
 {code}
 

 

*cluster status*

 
{code:java}
ClusterId:              QBC8K1kNS02Sl9930_QDAA
LeaderId:               10002
LeaderEpoch:            253
HighWatermark:          12515984
MaxFollowerLag:         0
MaxFollowerLagTimeMs:   111
CurrentVoters:          [10001,10002,10003]
CurrentObservers:       
[5,4,1104,6,3,1,1103,1112,1108,1107,1109,1101,1102,2,1106,1110,1111,1105]
 {code}
 

 

*Migration data in zookeeper*

 
{code:java}
In [4]: zk_client.get('/kafka/qa/migration')
Out[4]: 
(b'{"version":0,"kraft_metadata_offset":10840503,"kraft_controller_id":10002,"kraft_metadata_epoch":170,"kraft_controller_epoch":253}',
 ZnodeStat(czxid=7129652820618, mzxid=7176892421062, ctime=1765176701970, 
mtime=1766986727917, version=5349156, cversion=0, aversion=0, ephemeralOwner=0, 
dataLength=130, numChildren=0, pzxid=7129652820618))
 {code}
 

 


> Kafka Dual Write Mode Sync Failure
> ----------------------------------
>
>                 Key: KAFKA-20022
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20022
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 3.9.0
>            Reporter: Shubham Raj
>            Assignee: David Arthur
>            Priority: Blocker
>
> Hi,
> We migrated our Kafka cluster (v3.9) to *dual write mode* three weeks ago as 
> part of the planned one-month transition away from ZooKeeper. Recently, the 
> controller sync between the ZooKeeper and KRaft metadata went out of 
> alignment. As a result, the cluster is no longer in dual write mode. As 
> proposed in KAFKA-16171, attempts to restart the controllers did not restore 
> sync, and the ZooKeeper metadata is now lagging behind as per logs.
> *Impact*
>  * Dual write mode is no longer active, increasing risk of metadata 
> divergence.
>  * ZooKeeper metadata is stale compared to KRaft.
>  * Migration timeline is at risk.
> *Repetitive Logs in leader controller:*
> {code:java}
> [2025-12-29 01:44:23,852] ERROR Encountered zk migration fault: Unhandled 
> error in SyncKRaftMetadataEvent 
> (org.apache.kafka.server.fault.LoggingFaultHandler)
> java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. Sent 
> zkVersion = 5349155. This indicates that another KRaft controller is making 
> writes to ZooKeeper.
>         at 
> kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2050)
>         at 
> kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2076)
>         at 
> kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2101)
>         at 
> scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
>         at 
> scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
>         at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:42)
>         at 
> kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2101)
>         at 
> kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:137)
>         at 
> kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:111)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$null$3(KRaftMigrationZkWriter.java:233)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:246)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:240)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:63)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:844)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:970)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$4(KRaftMigrationZkWriter.java:230)
>         at java.base/java.lang.Iterable.forEach(Iterable.java:75)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:228)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:96)
>         at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:843)
>         at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:132)
>         at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:215)
>         at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
>         at java.base/java.lang.Thread.run(Thread.java:840)
> [2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
> metadata delta, but the controller is not in dual-write mode. Ignoring this 
> metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
> metadata delta, but the controller is not in dual-write mode. Ignoring this 
> metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
> metadata delta, but the controller is not in dual-write mode. Ignoring this 
> metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
> metadata delta, but the controller is not in dual-write mode. Ignoring this 
> metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received 
> metadata delta, but the controller is not in dual-write mode. Ignoring this 
> metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
>  {code}
>  
>  
> *cluster status*
> {code:java}
> ClusterId:              QBC8K1kNS02Sl9930_QDAA
> LeaderId:               10002
> LeaderEpoch:            253
> HighWatermark:          12515984
> MaxFollowerLag:         0
> MaxFollowerLagTimeMs:   111
> CurrentVoters:          [10001,10002,10003]
> CurrentObservers:       
> [5,4,1104,6,3,1,1103,1112,1108,1107,1109,1101,1102,2,1106,1110,1111,1105]
>  {code}
>  
> *Migration data in zookeeper*
> {code:java}
> In [4]: zk_client.get('/kafka/qa/migration')
> Out[4]: 
> (b'{"version":0,"kraft_metadata_offset":10840503,"kraft_controller_id":10002,"kraft_metadata_epoch":170,"kraft_controller_epoch":253}',
>  ZnodeStat(czxid=7129652820618, mzxid=7176892421062, ctime=1765176701970, 
> mtime=1766986727917, version=5349156, cversion=0, aversion=0, 
> ephemeralOwner=0, dataLength=130, numChildren=0, pzxid=7129652820618))
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to