[ 
https://issues.apache.org/jira/browse/KAFKA-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur closed KAFKA-16883.
--------------------------------

> Zookeeper-Kraft failing migration - RPC got timed out before it could be sent
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-16883
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16883
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.7.0, 3.6.1, 3.6.2
>            Reporter: Nicolas Henneaux
>            Priority: Major
>             Fix For: 3.7.1
>
>
> Despite several attempts to migrate from Zookeeper cluster to Kraft, it 
> failed to properly migrate.
> We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3 
> Zookeeper nodes. 3 new Kafka nodes are there for the new controllers.
> It was tested with Kafka 3.6.1, 3.6.2 and 3.7.0.
> it might be linked to KAFKA-15330.
> The controllers are started without issue. When the brokers are then 
> configured for the migration, the migration is not starting. Once the last 
> broker is restarted, we got the following logs.
> {code:java}
> [2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped 
> (kafka.server.ReplicaFetcherThread)
> [2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown 
> completed (kafka.server.ReplicaFetcherThread)
> {code}
> Then we only get the following every 30s
> {code:java}
> [2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
> Unable to register the broker because the RPC got timed out before it could 
> be sent. (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
> Unable to register the broker because the RPC got timed out before it could 
> be sent. (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
> Unable to register the broker because the RPC got timed out before it could 
> be sent. (kafka.server.BrokerLifecycleManager){code}
> The config on the controller node is the following
> {code:java}
> kafka0202e1 ~]$  sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | 
> grep -v password | sort
> advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net
> broker.rack=e1
> controller.listener.names=CONTROLLER
> controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
> default.replication.factor=3
> delete.topic.enable=false
> group.initial.rebalance.delay.ms=3000
> inter.broker.protocol.version=3.7
> listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093
> listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> log.dirs=/data/kafka
> log.message.format.version=3.6
> log.retention.check.interval.ms=300000
> log.retention.hours=240
> log.segment.bytes=1073741824
> min.insync.replicas=2
> node.id=20
> num.io.threads=8
> num.network.threads=3
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> offsets.topic.replication.factor=3
> process.roles=controller
> security.inter.broker.protocol=SSL
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> ssl.cipher.suites=TLS_AES_256_GCM_SHA384
> ssl.client.auth=required
> ssl.enabled.protocols=TLSv1.3
> ssl.endpoint.identification.algorithm=HTTPS
> ssl.keystore.location=/etc/kafka/ssl/keystore.ts
> ssl.keystore.type=JKS
> ssl.secure.random.implementation=SHA1PRNG
> ssl.truststore.location=/etc/kafka/ssl/truststore.ts
> transaction.state.log.min.isr=3
> transaction.state.log.replication.factor=3
> unclean.leader.election.enable=false
> zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181,
> zookeeper.metadata.migration.enable=true
>  {code}
> The config on the broker node is the following
> {code}
> $ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | grep -v 
> password | sort
> advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net
> advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
> broker.id=12
> broker.rack=e3
> controller.listener.names=CONTROLLER # added once all controllers were started
> controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
>  # added once all controllers were started
> default.replication.factor=3
> delete.topic.enable=false
> group.initial.rebalance.delay.ms=3000
> inter.broker.protocol.version=3.7
> listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
> log.dirs=/data/kafka
> log.retention.check.interval.ms=300000
> log.retention.hours=240
> log.segment.bytes=1073741824
> min.insync.replicas=2
> num.io.threads=8
> num.network.threads=3
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> offsets.topic.replication.factor=3
> security.inter.broker.protocol=SSL
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> ssl.cipher.suites=TLS_AES_256_GCM_SHA384
> ssl.client.auth=required
> ssl.enabled.protocols=TLSv1.3
> ssl.endpoint.identification.algorithm=HTTPS
> ssl.keystore.location=/etc/kafka/ssl/keystore.ts
> ssl.keystore.type=JKS
> ssl.secure.random.implementation=SHA1PRNG
> ssl.truststore.location=/etc/kafka/ssl/truststore.ts
> transaction.state.log.min.isr=3
> transaction.state.log.replication.factor=3
> unclean.leader.election.enable=false
> zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181,
> zookeeper.connection.timeout.ms=6000
> zookeeper.metadata.migration.enable=true # added once all controllers were 
> started
> {code}
> When trying to move to the next step (`Migrating brokers to KRaft`), it fails 
> to get controller quorum and crashes.
> {code}
> [2024-06-03 15:33:21,553] INFO [BrokerLifecycleManager id=12] Unable to 
> register the broker because the RPC got timed out before it could be sent. 
> (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:33:32,549] ERROR [BrokerLifecycleManager id=12] Shutting down 
> because we were unable to register with the controller quorum. 
> (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:33:32,550] INFO [BrokerLifecycleManager id=12] Transitioning 
> from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:33:32,551] INFO 
> [broker-12-to-controller-heartbeat-channel-manager]: Shutting down 
> (kafka.server.NodeToControllerRequestThread)
> [2024-06-03 15:33:32,551] INFO 
> [broker-12-to-controller-heartbeat-channel-manager]: Shutdown completed 
> (kafka.server.NodeToControllerRequestThread)
> [2024-06-03 15:33:32,551] ERROR [BrokerServer id=12] Received a fatal error 
> while waiting for the controller to acknowledge that we are caught up 
> (kafka.server.BrokerServer)
> java.util.concurrent.CancellationException
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to