[ https://issues.apache.org/jira/browse/KAFKA-16171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Arthur updated KAFKA-16171: --------------------------------- Fix Version/s: 3.7.0 > Controller failover during ZK migration can prevent metadata updates to ZK > brokers > ---------------------------------------------------------------------------------- > > Key: KAFKA-16171 > URL: https://issues.apache.org/jira/browse/KAFKA-16171 > Project: Kafka > Issue Type: Bug > Components: controller, kraft, migration > Affects Versions: 3.6.0, 3.7.0, 3.6.1 > Reporter: David Arthur > Assignee: David Arthur > Priority: Blocker > Fix For: 3.7.0 > > > h2. Description > During the ZK migration, after KRaft becomes the active controller we enter a > state called hybrid mode. This means we have a mixture of ZK and KRaft > brokers. The KRaft controller updates the ZK brokers using the deprecated > controller RPCs (LeaderAndIsr, UpdateMetadata, etc). > > A race condition exists where the KRaft controller will get stuck in a retry > loop while initializing itself after a failover which prevents it from > sending these RPCs to ZK brokers. > h2. Impact > Since the KRaft controller cannot send any RPCs to the ZK brokers, the ZK > brokers will not receive any metadata updates. The ZK brokers will be able to > send requests to the controller (such as AlterPartitions), but the metadata > updates which come as a result of those requests will never be seen. This > essentially looks like the controller is unavailable from the ZK brokers > perspective. > h2. Detection and Mitigation > This bug can be seen by observing failed ZK writes from a recently elected > controller. > The tell-tale error message is: > {code:java} > Check op on KRaft Migration ZNode failed. Expected zkVersion = 507823. This > indicates that another KRaft controller is making writes to ZooKeeper. {code} > with a stacktrace like: > {noformat} > java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. > Expected zkVersion = 507823. This indicates that another KRaft controller is > making writes to ZooKeeper. > at > kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2613) > at > kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2639) > at > kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2664) > at > scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100) > at > scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87) > at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:43) > at > kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2664) > at > kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:158) > at > kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:141) > at > org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$27(KRaftMigrationZkWriter.java:441) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:262) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:64) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:791) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:880) > at > org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$28(KRaftMigrationZkWriter.java:438) > at java.base/java.lang.Iterable.forEach(Iterable.java:75) > at > org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:436) > at > org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:115) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:790) > at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) > at java.base/java.lang.Thread.run(Thread.java:1583) > at > org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66){noformat} > To mitigate this problem, a new KRaft controller should be elected. This can > be done by restarting the problematic active controller. To verify that the > new controller does not encounter the race condition, look for > {code:java} > [KRaftMigrationDriver id=9991] 9991 transitioning from SYNC_KRAFT_TO_ZK to > KRAFT_CONTROLLER_TO_BROKER_COMM state {code} > > h2. Details > Controller A loses leadership via Raft event (e.g., from a timeout in the > Raft layer). A KRaftLeaderEvent is added to KRaftMigrationDriver event queue > behind any pending MetadataChangeEvents. > > Controller B is elected and a KRaftLeaderEvent is added to > KRaftMigrationDriver's queue. Since this controller is inactive, it processes > the event immediately. This event simply loads the migration state from ZK > (/migration) to check if the migration has been completed. This information > is used to determine the downstream transitions in the state machine. > Controller B passes through WAIT_FOR_ACTIVE_CONTROLLER and transitions to > BECOME_CONTROLLER since the migration is done. While handling the > BecomeZkControllerEvent, the controller forcibly takes ZK controller > leadership by writing its ID into /controller and its epoch into > /controller_epoch. > > The change to /controller_epoch causes all of the pending writes on > Controller A to fail since those writes are doing a check op on > /controller_epoch as part of the multi-op writes to ZK. > > However, there is a race between Controller B loading the state in /migration > and when it updates /controller_epoch. It is possible for Controller A to > successfully write to ZK with its older epoch. This causes the znode version > of /migration to increase which will cause Controller B to get stuck. > > It is safe for the old controller to be making these writes, since we only > dual-write committed state from KRaft (i.e., “write-behind), but this race > causes the new controller to have a stale version of /migration. -- This message was sent by Atlassian Jira (v8.20.10#820010)