[ 
https://issues.apache.org/jira/browse/KAFKA-16171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-16171.
----------------------------------
    Resolution: Fixed

> Controller failover during ZK migration can prevent metadata updates to ZK 
> brokers
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-16171
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16171
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, kraft, migration
>    Affects Versions: 3.6.0, 3.7.0, 3.6.1
>            Reporter: David Arthur
>            Assignee: David Arthur
>            Priority: Blocker
>             Fix For: 3.6.2, 3.7.0
>
>
> h2. Description
> During the ZK migration, after KRaft becomes the active controller we enter a 
> state called hybrid mode. This means we have a mixture of ZK and KRaft 
> brokers. The KRaft controller updates the ZK brokers using the deprecated 
> controller RPCs (LeaderAndIsr, UpdateMetadata, etc). 
>  
> A race condition exists where the KRaft controller will get stuck in a retry 
> loop while initializing itself after a failover which prevents it from 
> sending these RPCs to ZK brokers.
> h2. Impact
> Since the KRaft controller cannot send any RPCs to the ZK brokers, the ZK 
> brokers will not receive any metadata updates. The ZK brokers will be able to 
> send requests to the controller (such as AlterPartitions), but the metadata 
> updates which come as a result of those requests will never be seen. This 
> essentially looks like the controller is unavailable from the ZK brokers 
> perspective.
> h2. Detection and Mitigation
> This bug can be seen by observing failed ZK writes from a recently elected 
> controller.
> The tell-tale error message is:
> {code:java}
> Check op on KRaft Migration ZNode failed. Expected zkVersion = 507823. This 
> indicates that another KRaft controller is making writes to ZooKeeper. {code}
> with a stacktrace like:
> {noformat}
> java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. 
> Expected zkVersion = 507823. This indicates that another KRaft controller is 
> making writes to ZooKeeper.
>       at 
> kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2613)
>       at 
> kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2639)
>       at 
> kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2664)
>       at 
> scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
>       at 
> scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
>       at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:43)
>       at 
> kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2664)
>       at 
> kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:158)
>       at 
> kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:141)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$27(KRaftMigrationZkWriter.java:441)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:262)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:64)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:791)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:880)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$28(KRaftMigrationZkWriter.java:438)
>       at java.base/java.lang.Iterable.forEach(Iterable.java:75)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:436)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:115)
>       at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:790)
>       at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
>       at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
>       at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
>       at java.base/java.lang.Thread.run(Thread.java:1583)
>       at 
> org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66){noformat}
> To mitigate this problem, a new KRaft controller should be elected. This can 
> be done by restarting the problematic active controller. To verify that the 
> new controller does not encounter the race condition, look for 
> {code:java}
> [KRaftMigrationDriver id=9991] 9991 transitioning from SYNC_KRAFT_TO_ZK to 
> KRAFT_CONTROLLER_TO_BROKER_COMM state {code}
>  
> h2. Details
> Controller A loses leadership via Raft event (e.g., from a timeout in the 
> Raft layer). A KRaftLeaderEvent is added to KRaftMigrationDriver event queue 
> behind any pending MetadataChangeEvents. 
>  
> Controller B is elected and a KRaftLeaderEvent is added to 
> KRaftMigrationDriver's queue. Since this controller is inactive, it processes 
> the event immediately. This event simply loads the migration state from ZK 
> (/migration) to check if the migration has been completed. This information 
> is used to determine the downstream transitions in the state machine. 
> Controller B passes through WAIT_FOR_ACTIVE_CONTROLLER and transitions to 
> BECOME_CONTROLLER since the migration is done. While handling the 
> BecomeZkControllerEvent, the controller forcibly takes ZK controller 
> leadership by writing its ID into /controller and its epoch into 
> /controller_epoch.
>  
> The change to /controller_epoch causes all of the pending writes on 
> Controller A to fail since those writes are doing a check op on 
> /controller_epoch as part of the multi-op writes to ZK. 
>  
> However, there is a race between Controller B loading the state in /migration 
> and when it updates /controller_epoch. It is possible for Controller A to 
> successfully write to ZK with its older epoch. This causes the znode version 
> of /migration to increase which will cause Controller B to get stuck.
>  
> It is safe for the old controller to be making these writes, since we only 
> dual-write committed state from KRaft (i.e., “write-behind), but this race 
> causes the new controller to have a stale version of /migration. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to