Nicolas Henneaux created KAFKA-16883:
----------------------------------------

             Summary: Zookeeper-Kraft failing migration - RPC got timed out 
before it could be sent
                 Key: KAFKA-16883
                 URL: https://issues.apache.org/jira/browse/KAFKA-16883
             Project: Kafka
          Issue Type: Bug
          Components: kraft
    Affects Versions: 3.6.2, 3.6.1, 3.7.0
            Reporter: Nicolas Henneaux


Despite several attempts to migrate from Zookeeper cluster to Kraft, it failed 
to properly migrate.

We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3 
Zookeeper nodes. 3 new Kafka nodes are there for the new controllers.

The controllers are started without issue. When the brokers are then configured 
for the migration, the migration is not starting. Once the last broker is 
restarted, we got the following logs.
{code:java}
[2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped 
(kafka.server.ReplicaFetcherThread)
[2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown completed 
(kafka.server.ReplicaFetcherThread)
{code}
Then we only get the following every 30s
{code:java}
[2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager){code}

The config on the controller node is the following
{code:java}
kafka0202e1 ~]$  sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | 
grep -v password | sort
advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net
broker.rack=e1
controller.listener.names=CONTROLLER
controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
log.dirs=/data/kafka
log.message.format.version=3.6
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
node.id=20
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
process.roles=controller
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181,
zookeeper.metadata.migration.enable=true
 {code}

The config on the broker node is the following
{code}
$ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | grep -v password 
| sort
advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net
advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
broker.id=12
broker.rack=e3
controller.listener.names=CONTROLLER # added once all controllers were started
controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
 # added once all controllers were started
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
log.dirs=/data/kafka
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181,
zookeeper.connection.timeout.ms=6000
zookeeper.metadata.migration.enable=true # added once all controllers were 
started
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to