[
https://issues.apache.org/jira/browse/KAFKA-15766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cameron resolved KAFKA-15766.
-----------------------------
Resolution: Abandoned
Closing this issue as it's a few years old now without much traction; I have
since moved to a KRaft based Kafka cluster and this issue is no longer
reproducible
> Possible request handler thread exhaustion in controller election
> -----------------------------------------------------------------
>
> Key: KAFKA-15766
> URL: https://issues.apache.org/jira/browse/KAFKA-15766
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.5.0
> Reporter: Cameron
> Priority: Major
>
> Hello,
> This is my first dive into the Kafka source code, so I may be
> mis-interpreting some of the logic outlined below. Please let me know if my
> understanding is not correct.
> ---
> After upgrading a relatively large Kafka cluster from IPB 2.5 to 3.5 (Kafka
> version v3.5.0), we've had numerous problems with controller failover and
> subsequent elections not being properly acknowledged by a subset of the
> cluster's brokers. These affected brokers will temporarily become
> unresponsive due to complete request handler thread starvation via failed
> ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old stopped
> controller. In the case of a disconnected response, as seen here, the
> broker-to-controller channel will place these failed requests at the
> *beginning* of the queue, effectively deadlocking the request handler thread
> pool. Depending on how quickly the configured queue.max.requests request
> queue fills, the new controller's UPDATE_METADATA request to notify the
> broker of the controller election result may not even be able to contact the
> affected broker to update the new controller Id.
> For context, the relevant configurations of my cluster are:
> * num.network.threads = 16
> * num.io.threads = 8
> * queued.max.requests = 500
> * inter.broker.protocol.version = 3.5
> * security.inter.broker.protocol = SASL_SSL
> * sasl.enabled.mechanisms = SCRAM-SHA-256
> And the cluster itself is set up as:
> * 9 brokers
> * ~4000 partitions / broker [aggregate leader + replica]
> * ~20 MB/s data in across cluster
>
> I am not able to consistently repro the issue, but it happens frequently
> enough to cause major pain in our set up. The general order of events are:
> # Controller election takes place (due to intentional failover or otherwise)
> # New controller successfully registers its’ znode with Zookeeper, initiates
> UPDATE_METADATA request to notify other brokers in the cluster
> # Problematic broker, noticing it has lost connection to the old controller
> but has not yet received and/or processed the incoming UPDATE_METADATA
> request, spams (every ~50ms as hardcoded) ALLOCATE_PRODUCER_IDS and
> ALTER_PARTITION requests to the old controller, receiving disconnect
> responses and placing the request back in the front of the request queue.
> In this scenario, the needed UPDATE_METADATA request will take much longer to
> process. In our specific cluster this will sometimes auto-resolve in 3-5m
> whereas other times it will require manual intervention to bounce the broker
> and allow it to initialize a connection with the new controller. During this
> time, I see the corresponding Kafka server logs:
> +Affected Broker Logs+
> {code:java}
> [2023-10-24 01:30:00,104] INFO
> [zk-broker-4-to-controller-alter-partition-channel-manager]: Recorded new
> controller, from now on will use node kafka3-app.prod.net:9192 (id: 3 rack: )
> (kafka.server.BrokerToControllerRequestThread)
> [2023-10-24 01:30:00,104] WARN [BrokerToControllerChannelManager id=4
> name=alter-partition] Connection to node 3 (kafka3-app.prod.net:9192) could
> not be established. Broker may not be available.
> (org.apache.kafka.clients.NetworkClient)
> [2023-10-24 01:30:00,104] INFO [BrokerToControllerChannelManager id=4
> name=alter-partition] Node 3 disconnected.
> (org.apache.kafka.clients.NetworkClient)
> ...{code}
> {code:java}
> [2023-10-26 02:42:35,166] WARN [RPC ProducerId Manager 103]: Timed out when
> requesting AllocateProducerIds from the controller.
> (kafka.coordinator.transaction.RPCProducerIdManager)
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: Timed out
> waiting for next producer ID block
> ...{code}
>
> +New Controller Logs+
> {code:java}
> [2023-10-24 01:33:20,569] WARN [RequestSendThread controllerId=11] Controller
> 11's connection to broker kafka4-app.prod.net:9192 (id: 4 rack: ) was
> unsuccessful (kafka.controller.RequestSendThread)
> java.net.SocketTimeoutException: Failed to connect within 30000 ms
> at
> kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:297)
> at
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:249)
> at
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127)
>
> ...{code}
>
> This issue seems very similar to some previous tickets I found:
> * https://issues.apache.org/jira/browse/KAFKA-7697
> * https://issues.apache.org/jira/browse/KAFKA-14694
> Unfortunately I have not been able to grab a threaddump because of the
> sporadic nature of the issue, but will update the ticket if I am able.
>
> This may also affect other versions of Kafka, but we've only experienced it
> with v3.5.0 + IBP 3.5. Curiously enough I haven't encountered this issue with
> numerous smaller throughput clusters, so there seems to be some sort of
> load-induced trigger here. I have also not seen this issue with v3.5.0 + IBP
> 2.5, assumedly because the ALLOCATE_PRODUCER_IDS and ALTER_PARTITION features
> are not yet enabled until IBP 2.7 and 3.0, respectively.
>
> Is there any guidance on whether this failure path is possible as described
> and what steps can be taken to mitigate the issue?
> Thank you!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)