[ https://issues.apache.org/jira/browse/KAFKA-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Arthur closed KAFKA-16883. -------------------------------- > Zookeeper-Kraft failing migration - RPC got timed out before it could be sent > ----------------------------------------------------------------------------- > > Key: KAFKA-16883 > URL: https://issues.apache.org/jira/browse/KAFKA-16883 > Project: Kafka > Issue Type: Bug > Components: kraft > Affects Versions: 3.7.0, 3.6.1, 3.6.2 > Reporter: Nicolas Henneaux > Priority: Major > Fix For: 3.7.1 > > > Despite several attempts to migrate from Zookeeper cluster to Kraft, it > failed to properly migrate. > We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3 > Zookeeper nodes. 3 new Kafka nodes are there for the new controllers. > It was tested with Kafka 3.6.1, 3.6.2 and 3.7.0. > it might be linked to KAFKA-15330. > The controllers are started without issue. When the brokers are then > configured for the migration, the migration is not starting. Once the last > broker is restarted, we got the following logs. > {code:java} > [2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped > (kafka.server.ReplicaFetcherThread) > [2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown > completed (kafka.server.ReplicaFetcherThread) > {code} > Then we only get the following every 30s > {code:java} > [2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true] > Unable to register the broker because the RPC got timed out before it could > be sent. (kafka.server.BrokerLifecycleManager) > [2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true] > Unable to register the broker because the RPC got timed out before it could > be sent. (kafka.server.BrokerLifecycleManager) > [2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true] > Unable to register the broker because the RPC got timed out before it could > be sent. (kafka.server.BrokerLifecycleManager){code} > The config on the controller node is the following > {code:java} > kafka0202e1 ~]$ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties | > grep -v password | sort > advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net > broker.rack=e1 > controller.listener.names=CONTROLLER > controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093 > default.replication.factor=3 > delete.topic.enable=false > group.initial.rebalance.delay.ms=3000 > inter.broker.protocol.version=3.7 > listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093 > listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL > log.dirs=/data/kafka > log.message.format.version=3.6 > log.retention.check.interval.ms=300000 > log.retention.hours=240 > log.segment.bytes=1073741824 > min.insync.replicas=2 > node.id=20 > num.io.threads=8 > num.network.threads=3 > num.partitions=1 > num.recovery.threads.per.data.dir=1 > offsets.topic.replication.factor=3 > process.roles=controller > security.inter.broker.protocol=SSL > socket.receive.buffer.bytes=102400 > socket.request.max.bytes=104857600 > socket.send.buffer.bytes=102400 > ssl.cipher.suites=TLS_AES_256_GCM_SHA384 > ssl.client.auth=required > ssl.enabled.protocols=TLSv1.3 > ssl.endpoint.identification.algorithm=HTTPS > ssl.keystore.location=/etc/kafka/ssl/keystore.ts > ssl.keystore.type=JKS > ssl.secure.random.implementation=SHA1PRNG > ssl.truststore.location=/etc/kafka/ssl/truststore.ts > transaction.state.log.min.isr=3 > transaction.state.log.replication.factor=3 > unclean.leader.election.enable=false > zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181, > zookeeper.metadata.migration.enable=true > {code} > The config on the broker node is the following > {code} > $ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties | grep -v > password | sort > advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net > advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092 > broker.id=12 > broker.rack=e3 > controller.listener.names=CONTROLLER # added once all controllers were started > controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093 > # added once all controllers were started > default.replication.factor=3 > delete.topic.enable=false > group.initial.rebalance.delay.ms=3000 > inter.broker.protocol.version=3.7 > listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL > listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092 > log.dirs=/data/kafka > log.retention.check.interval.ms=300000 > log.retention.hours=240 > log.segment.bytes=1073741824 > min.insync.replicas=2 > num.io.threads=8 > num.network.threads=3 > num.partitions=1 > num.recovery.threads.per.data.dir=1 > offsets.topic.replication.factor=3 > security.inter.broker.protocol=SSL > socket.receive.buffer.bytes=102400 > socket.request.max.bytes=104857600 > socket.send.buffer.bytes=102400 > ssl.cipher.suites=TLS_AES_256_GCM_SHA384 > ssl.client.auth=required > ssl.enabled.protocols=TLSv1.3 > ssl.endpoint.identification.algorithm=HTTPS > ssl.keystore.location=/etc/kafka/ssl/keystore.ts > ssl.keystore.type=JKS > ssl.secure.random.implementation=SHA1PRNG > ssl.truststore.location=/etc/kafka/ssl/truststore.ts > transaction.state.log.min.isr=3 > transaction.state.log.replication.factor=3 > unclean.leader.election.enable=false > zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181, > zookeeper.connection.timeout.ms=6000 > zookeeper.metadata.migration.enable=true # added once all controllers were > started > {code} > When trying to move to the next step (`Migrating brokers to KRaft`), it fails > to get controller quorum and crashes. > {code} > [2024-06-03 15:33:21,553] INFO [BrokerLifecycleManager id=12] Unable to > register the broker because the RPC got timed out before it could be sent. > (kafka.server.BrokerLifecycleManager) > [2024-06-03 15:33:32,549] ERROR [BrokerLifecycleManager id=12] Shutting down > because we were unable to register with the controller quorum. > (kafka.server.BrokerLifecycleManager) > [2024-06-03 15:33:32,550] INFO [BrokerLifecycleManager id=12] Transitioning > from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager) > [2024-06-03 15:33:32,551] INFO > [broker-12-to-controller-heartbeat-channel-manager]: Shutting down > (kafka.server.NodeToControllerRequestThread) > [2024-06-03 15:33:32,551] INFO > [broker-12-to-controller-heartbeat-channel-manager]: Shutdown completed > (kafka.server.NodeToControllerRequestThread) > [2024-06-03 15:33:32,551] ERROR [BrokerServer id=12] Received a fatal error > while waiting for the controller to acknowledge that we are caught up > (kafka.server.BrokerServer) > java.util.concurrent.CancellationException > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)