Hello all,

I have recently upgraded a number of Kafka clusters from Kafka 2.5.1
(protocol 2.5) to Kafka 3.5.0 (protocol 3.5) according to this steps
outlined here: https://kafka.apache.org/documentation/#upgrade_350_zk
For the majority of these clusters, the new version and protocol have
been running smoothly, with no recorded produce and/or consume
availability issues. However, for some of my larger clusters, I have
noticed that during a controller failover and subsequent election that
occasionally a subset of brokers have trouble acknowledging the new
controller. This causes those affected brokers to lock-up and become
completely unresponsive to requests from other brokers within the
cluster as well as clients, resulting in the number of ISRs for
various partitions sinking below the minimum set value and eventually
going offline. I have to manually bounce these troublesome brokers in
order for them to successfully acknowledge the new controller and
continue operations as normal. It is important to note that this issue
seems sporadic, it does not occur during all failover elections, but
is common enough to cause pain for producers/consumers of the cluster.
This is not something I've observed in our old 2.5.x configuration and
wondering the best way to proceed with debugging this issue.

All of these clusters are deployed via docker containers on
distributed hosts across the cluster. I've seen no indication of
resource exhaustion aside from the "queued.max.requests" limit
outlined below. I've also included some suspicious logs I am seeing in
the cluster, as well as our broker configuration for reference:

PROBLEMATIC BROKER LOGS
---------
[2023-10-24 01:30:00,104] INFO
[zk-broker-4-to-controller-forwarding-channel-manager]: Recorded new
controller, from now on will use node kafka3-app.prod.net:9192 (id: 3
rack: ) (kafka.server.BrokerToControllerRequestThread)

[2023-10-24 01:30:00,104] WARN [BrokerToControllerChannelManager id=4
name=forwarding] Connection to node 3 (kafka3-app.prod.net:9192) could
not be established. Broker may not be available.
(org.apache.kafka.clients.NetworkClient)
[2023-10-24 01:30:00,104] INFO [BrokerToControllerChannelManager id=4
name=forwarding] Node 3 disconnected.
(org.apache.kafka.clients.NetworkClient)

...
---------
This logline will loop indefinitely until the broker is restarted. For
reference here, kafka3-app.prod.net:9192 is the OLD controller that is
no longer running. It seems for whatever reason the controller channel
was not able to update its reference to the new controller despite
Zookeeper successfully registering a znode.


NEW CONTROLLER LOGS
---------
[2023-10-24 01:33:20,569] WARN [RequestSendThread controllerId=11]
Controller 11's connection to broker kafka4-app.prod.net:9192 (id: 4
rack: ) was unsuccessful (kafka.controller.RequestSendThread)

java.net.SocketTimeoutException: Failed to connect within 30000 ms
at 
kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:297)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:249)
at 
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127)

...
---------
Simultaneously, the new controller is unable to connect to the
problematic broker, probably because that broker's
"queued.max.requests" config has completely filled up to the limit of
500, and thus is rejecting any new requests. My hunch is that the
unsuccessful, looping attempts to connect to the old controller are a
cause of that, but my understanding could be incorrect and that full
queue could also merely be an effect on something else happening.


BROKER CONFIGS
---------broker.id=4
listeners=PLAINTEXT://0.0.0.0:9092,SASL_SSL://0.0.0.0:9192
advertised.listeners=PLAINTEXT://kafka4-app.prod.net:9092,SASL_SSL://kafka4-app.prod:9192
num.network.threads=16
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/opt/qualtrics/kafka/data-logsnum.partitions=20
num.recovery.threads.per.data.dir=8
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=336
log.retention.bytes=1000000000
log.segment.bytes=50000000log.retention.check.interval.ms=300000
zookeeper.connect=<IP of external Zookeeper nodes,
redacted>zookeeper.connection.timeout.ms=18000group.initial.rebalance.delay.ms=0
default.replication.factor=3
ssl.truststore.password=kafkajks
port=9092
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-256
auto.create.topics.enable=false
listener.name.sasl_ssl.scram-sha-256.sasl.server.callback.handler.class=com.qualtrics.core.kafka.VaultAuthenticationHandler
controlled.shutdown.max.retries=5
delete.topic.enable=true
offsets.topic.num.partitions=50
broker.rack=ssl.client.auth=requested
inter.broker.protocol.version=3.5
security.inter.broker.protocol=SASL_SSL
auto.leader.rebalance.enable=true
offsets.commit.required.acks=-1
ssl.truststore.location=certs/truststore.jks
message.max.bytes=1000000
ssl.keystore.password=kafkajks
unclean.leader.election.enable=false
num.replica.fetchers=8
external.kafka.statsd.port=8125
min.insync.replicas=2
external.kafka.statsd.host=dockerhost
sasl.enabled.mechanisms=SCRAM-SHA-256
ssl.keystore.location=certs/keystore.jkshost.name=kafka4-app.prod.net
external.kafka.statsd.metrics.prefix=coreauthorizer.class.name=kafka.security.authorizer.AclAuthorizer
allow.everyone.if.no.acl.found=truereserved.broker.max.id=1000000
super.users=User:broker-admin;User:tooling-admincontrolled.shutdown.retry.backoff.ms=5000
controlled.shutdown.enable=true
log.segment.hours=24
offsets.retention.minutes=10080
---------
Most of these configurations are set to the default


Has anyone seen this on newer versions of Kafka and/or have guidance
in further troubleshooting this issue? I have not as of yet created a
JIRA ticket in the Kafka project as I don't believe the information
provided is specific enough and I've been unable to create a
consistent reproduction.

Thanks for any and all help!

Reply via email to