Hello all, I have recently upgraded a number of Kafka clusters from Kafka 2.5.1 (protocol 2.5) to Kafka 3.5.0 (protocol 3.5) according to this steps outlined here: https://kafka.apache.org/documentation/#upgrade_350_zk For the majority of these clusters, the new version and protocol have been running smoothly, with no recorded produce and/or consume availability issues. However, for some of my larger clusters, I have noticed that during a controller failover and subsequent election that occasionally a subset of brokers have trouble acknowledging the new controller. This causes those affected brokers to lock-up and become completely unresponsive to requests from other brokers within the cluster as well as clients, resulting in the number of ISRs for various partitions sinking below the minimum set value and eventually going offline. I have to manually bounce these troublesome brokers in order for them to successfully acknowledge the new controller and continue operations as normal. It is important to note that this issue seems sporadic, it does not occur during all failover elections, but is common enough to cause pain for producers/consumers of the cluster. This is not something I've observed in our old 2.5.x configuration and wondering the best way to proceed with debugging this issue.
All of these clusters are deployed via docker containers on distributed hosts across the cluster. I've seen no indication of resource exhaustion aside from the "queued.max.requests" limit outlined below. I've also included some suspicious logs I am seeing in the cluster, as well as our broker configuration for reference: PROBLEMATIC BROKER LOGS --------- [2023-10-24 01:30:00,104] INFO [zk-broker-4-to-controller-forwarding-channel-manager]: Recorded new controller, from now on will use node kafka3-app.prod.net:9192 (id: 3 rack: ) (kafka.server.BrokerToControllerRequestThread) [2023-10-24 01:30:00,104] WARN [BrokerToControllerChannelManager id=4 name=forwarding] Connection to node 3 (kafka3-app.prod.net:9192) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) [2023-10-24 01:30:00,104] INFO [BrokerToControllerChannelManager id=4 name=forwarding] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient) ... --------- This logline will loop indefinitely until the broker is restarted. For reference here, kafka3-app.prod.net:9192 is the OLD controller that is no longer running. It seems for whatever reason the controller channel was not able to update its reference to the new controller despite Zookeeper successfully registering a znode. NEW CONTROLLER LOGS --------- [2023-10-24 01:33:20,569] WARN [RequestSendThread controllerId=11] Controller 11's connection to broker kafka4-app.prod.net:9192 (id: 4 rack: ) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:297) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:249) at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127) ... --------- Simultaneously, the new controller is unable to connect to the problematic broker, probably because that broker's "queued.max.requests" config has completely filled up to the limit of 500, and thus is rejecting any new requests. My hunch is that the unsuccessful, looping attempts to connect to the old controller are a cause of that, but my understanding could be incorrect and that full queue could also merely be an effect on something else happening. BROKER CONFIGS ---------broker.id=4 listeners=PLAINTEXT://0.0.0.0:9092,SASL_SSL://0.0.0.0:9192 advertised.listeners=PLAINTEXT://kafka4-app.prod.net:9092,SASL_SSL://kafka4-app.prod:9192 num.network.threads=16 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/opt/qualtrics/kafka/data-logsnum.partitions=20 num.recovery.threads.per.data.dir=8 offsets.topic.replication.factor=3 transaction.state.log.replication.factor=1 transaction.state.log.min.isr=1 log.retention.hours=336 log.retention.bytes=1000000000 log.segment.bytes=50000000log.retention.check.interval.ms=300000 zookeeper.connect=<IP of external Zookeeper nodes, redacted>zookeeper.connection.timeout.ms=18000group.initial.rebalance.delay.ms=0 default.replication.factor=3 ssl.truststore.password=kafkajks port=9092 sasl.mechanism.inter.broker.protocol=SCRAM-SHA-256 auto.create.topics.enable=false listener.name.sasl_ssl.scram-sha-256.sasl.server.callback.handler.class=com.qualtrics.core.kafka.VaultAuthenticationHandler controlled.shutdown.max.retries=5 delete.topic.enable=true offsets.topic.num.partitions=50 broker.rack=ssl.client.auth=requested inter.broker.protocol.version=3.5 security.inter.broker.protocol=SASL_SSL auto.leader.rebalance.enable=true offsets.commit.required.acks=-1 ssl.truststore.location=certs/truststore.jks message.max.bytes=1000000 ssl.keystore.password=kafkajks unclean.leader.election.enable=false num.replica.fetchers=8 external.kafka.statsd.port=8125 min.insync.replicas=2 external.kafka.statsd.host=dockerhost sasl.enabled.mechanisms=SCRAM-SHA-256 ssl.keystore.location=certs/keystore.jkshost.name=kafka4-app.prod.net external.kafka.statsd.metrics.prefix=coreauthorizer.class.name=kafka.security.authorizer.AclAuthorizer allow.everyone.if.no.acl.found=truereserved.broker.max.id=1000000 super.users=User:broker-admin;User:tooling-admincontrolled.shutdown.retry.backoff.ms=5000 controlled.shutdown.enable=true log.segment.hours=24 offsets.retention.minutes=10080 --------- Most of these configurations are set to the default Has anyone seen this on newer versions of Kafka and/or have guidance in further troubleshooting this issue? I have not as of yet created a JIRA ticket in the Kafka project as I don't believe the information provided is specific enough and I've been unable to create a consistent reproduction. Thanks for any and all help!