dennis lucero created KAFKA-19318:
-------------------------------------
Summary: Zookeeper-Kraft failing migration - RPC got timed out
before it could be sent
Key: KAFKA-19318
URL: https://issues.apache.org/jira/browse/KAFKA-19318
Project: Kafka
Issue Type: Bug
Components: kraft
Affects Versions: 3.6.1, 3.6.2, 3.7.0
Reporter: dennis lucero
Fix For: 3.7.1
Despite several attempts to migrate from Zookeeper cluster to Kraft, it failed
to properly migrate.
We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3
Zookeeper nodes. 3 new Kafka nodes are there for the new controllers.
It was tested with Kafka 3.6.1, 3.6.2 and 3.7.0.
it might be linked to KAFKA-15330.
The controllers are started without issue. When the brokers are then configured
for the migration, the migration is not starting. Once the last broker is
restarted, we got the following logs.
{code:java}
[2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped
(kafka.server.ReplicaFetcherThread)
[2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown completed
(kafka.server.ReplicaFetcherThread)
{code}
Then we only get the following every 30s
{code:java}
[2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true]
Unable to register the broker because the RPC got timed out before it could be
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true]
Unable to register the broker because the RPC got timed out before it could be
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true]
Unable to register the broker because the RPC got timed out before it could be
sent. (kafka.server.BrokerLifecycleManager){code}
The config on the controller node is the following
{code:java}
kafka0202e1 ~]$ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties |
grep -v password | sort
advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net
broker.rack=e1
controller.listener.names=CONTROLLER
[email protected]:9093,[email protected]:9093,[email protected]:9093
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
log.dirs=/data/kafka
log.message.format.version=3.6
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
node.id=20
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
process.roles=controller
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181,
zookeeper.metadata.migration.enable=true
{code}
The config on the broker node is the following
{code}
$ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties | grep -v password
| sort
advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net
advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
broker.id=12
broker.rack=e3
controller.listener.names=CONTROLLER # added once all controllers were started
[email protected]:9093,[email protected]:9093,[email protected]:9093
# added once all controllers were started
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
log.dirs=/data/kafka
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181,
zookeeper.connection.timeout.ms=6000
zookeeper.metadata.migration.enable=true # added once all controllers were
started
{code}
When trying to move to the next step (`Migrating brokers to KRaft`), it fails
to get controller quorum and crashes.
{code}
[2024-06-03 15:33:21,553] INFO [BrokerLifecycleManager id=12] Unable to
register the broker because the RPC got timed out before it could be sent.
(kafka.server.BrokerLifecycleManager)
[2024-06-03 15:33:32,549] ERROR [BrokerLifecycleManager id=12] Shutting down
because we were unable to register with the controller quorum.
(kafka.server.BrokerLifecycleManager)
[2024-06-03 15:33:32,550] INFO [BrokerLifecycleManager id=12] Transitioning
from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:33:32,551] INFO
[broker-12-to-controller-heartbeat-channel-manager]: Shutting down
(kafka.server.NodeToControllerRequestThread)
[2024-06-03 15:33:32,551] INFO
[broker-12-to-controller-heartbeat-channel-manager]: Shutdown completed
(kafka.server.NodeToControllerRequestThread)
[2024-06-03 15:33:32,551] ERROR [BrokerServer id=12] Received a fatal error
while waiting for the controller to acknowledge that we are caught up
(kafka.server.BrokerServer)
java.util.concurrent.CancellationException
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)