[ https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870579#comment-17870579 ]
Paolo Patierno commented on KAFKA-17146: ---------------------------------------- [~saimon46] it's not clear to me, sorry for the stupid question, when you did the rollback and they tried to run the migration again. >From the KRaft migration documentation you can see it says: "After the migration has been finalized, it is not possible to revert back to ZooKeeper mode." So it means that when you completed the migration with the controllers not connected to ZooKeeper anymore (of course the same for the brokers) you cannot rollback to ZooKeeper. I am saying this because as part of our automation of KRaft migration with Strimzi project (running Kafka on Kubernetes), I rolledback the migration at the point where controllers are still connected to ZooKeeper (which is your last chance), so coming back to a full ZooKeeper based cluster just deleting /controller node and not /migration. Then I restarted the migration and it worked like a charm. I did this several times to be sure. > ZK to KRAFT migration stuck in pre-migration mode > ------------------------------------------------- > > Key: KAFKA-17146 > URL: https://issues.apache.org/jira/browse/KAFKA-17146 > Project: Kafka > Issue Type: Bug > Components: controller, kraft, migration > Affects Versions: 3.7.0, 3.7.1 > Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 > Zookeeper/KRAFT > Reporter: Simone Brundu > Priority: Blocker > Labels: kraft, migration, zookeeper > > I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. > (EDIT: the same issue happens for version 3.7.0) > I'm using this configuration to allow SSL everywhere and, SCRAM > authentication only for brokers and PLAIN authentication for controllers > {code:java} > listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL > inter.broker.listener.name=EXTERNAL_SASL > sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN > sasl.mechanism=SCRAM-SHA-512 > sasl.mechanism.controller.protocol=PLAIN > sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code} > The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers > cluster with 3 KRAFT controllers is configured and running in parallel as per > documentation for the migration process. > I’ve started the migration with 3 controllers enrolled with SASL_SSL with > PLAIN authentication and I already have a strange TRACE log: > {code:java} > TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the > controller is not in dual-write mode. Ignoring the change to be replicated to > Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > With later this message where KRAFT is waiting to brokers to connect > {code:java} > INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting > for brokers to register. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon I start to reconfigure the brokers letting them to connect to the new > controllers, all good in the KRAFT controllers with notifications that the > KRAFT brokers were connecting correctly connected and enrolled > {code:java} > INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for > broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, > incarnationId=xxxxxx, brokerEpoch=2638, > endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', > port=9095, securityProtocol=3)], > features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, > maxSupportedVersion=19)], rack='zur1', fenced=true, > inControlledShutdown=false, logDirs=[xxxxxx]) > (org.apache.kafka.controller.ClusterControlManager) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon the first broker is connected we start to get these info logs related > to the migration process in the controller: > {code:java} > INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas > in pre-migration mode. Returning NOT_CONTROLLER. > (org.apache.kafka.controller.QuorumController) > INFO [QuorumController id=1000] maybeFenceReplicas: event failed with > NotControllerException in 355 microseconds. Exception message: The controller > is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code} > but as well requests to autocreate topics that exist already, in loop every > 30seconds, in the last broker restarted: > {code:java} > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) {code} > Up to the moment we have a controller still in the old cluster (in the kafka > brokers) everything runs fine. As soon the last node is restarted the things > are going out of the rail. This last node never gets any partition assigned > and the cluster stays forever in with under replicated partitions. This is > the log from the last node register that should start the migration mode, but > the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode. > {code:java} > INFO [QuorumController id=1000] The request from broker 2 to unfence has been > granted because it has caught up with the offset of its register broker > record 4101 > [...] > INFO [KRaftMigrationDriver id=1000] Ignoring image > MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, > lastContainedLogTimeMs=1721133091831) which does not contain a superset of > the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > The only way to recover the cluster is revert everything stopping clusters, > removing /controller from zookeeper and restore the Zookeeper only > configuration in the brokers. A cleanup of the controller is necessary too. > The migration never starts and the controllers never understand that they > have to migrate the data from Zookeeper. More than that, the new controller > claims to be the CONTROLLER but it refuses to be it. -- This message was sent by Atlassian Jira (v8.20.10#820010)