[ 
https://issues.apache.org/jira/browse/KAFKA-15766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron resolved KAFKA-15766.
-----------------------------
    Resolution: Abandoned

Closing this issue as it's a few years old now without much traction; I have 
since moved to a KRaft based Kafka cluster and this issue is no longer 
reproducible

> Possible request handler thread exhaustion in controller election
> -----------------------------------------------------------------
>
>                 Key: KAFKA-15766
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15766
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Cameron
>            Priority: Major
>
> Hello,
> This is my first dive into the Kafka source code, so I may be 
> mis-interpreting some of the logic outlined below. Please let me know if my 
> understanding is not correct.
> --- 
> After upgrading a relatively large Kafka cluster from IPB 2.5 to 3.5 (Kafka 
> version v3.5.0), we've had numerous problems with controller failover and 
> subsequent elections not being properly acknowledged by a subset of the 
> cluster's brokers.  These affected brokers will temporarily become 
> unresponsive due to complete request handler thread starvation via failed 
> ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old stopped 
> controller. In the case of a disconnected response, as seen here, the 
> broker-to-controller channel will place these failed requests at the 
> *beginning* of the queue, effectively deadlocking the request handler thread 
> pool. Depending on how quickly the configured queue.max.requests request 
> queue fills, the new controller's UPDATE_METADATA request to notify the 
> broker of the controller election result may not even be able to contact the 
> affected broker to update the new controller Id. 
> For context, the relevant configurations of my cluster are:
>  * num.network.threads = 16
>  * num.io.threads = 8
>  * queued.max.requests = 500
>  * inter.broker.protocol.version = 3.5
>  * security.inter.broker.protocol = SASL_SSL
>  * sasl.enabled.mechanisms = SCRAM-SHA-256
> And the cluster itself is set up as:
>  * 9 brokers
>  * ~4000 partitions / broker [aggregate leader + replica]
>  * ~20 MB/s data in across cluster
>  
> I am not able to consistently repro the issue, but it happens frequently 
> enough to cause major pain in our set up. The general order of events are:
>  # Controller election takes place (due to intentional failover or otherwise)
>  # New controller successfully registers its’ znode with Zookeeper, initiates 
> UPDATE_METADATA request to notify other brokers in the cluster
>  # Problematic broker, noticing it has lost connection to the old controller 
> but has not yet received and/or processed the incoming UPDATE_METADATA 
> request, spams (every ~50ms as hardcoded) ALLOCATE_PRODUCER_IDS and 
> ALTER_PARTITION requests to the old controller, receiving disconnect 
> responses and placing the request back in the front of the request queue.
> In this scenario, the needed UPDATE_METADATA request will take much longer to 
> process. In our specific cluster this will sometimes auto-resolve in 3-5m 
> whereas other times it will require manual intervention to bounce the broker 
> and allow it to initialize a connection with the new controller. During this 
> time, I see the corresponding Kafka server logs:
> +Affected Broker Logs+
> {code:java}
> [2023-10-24 01:30:00,104] INFO
> [zk-broker-4-to-controller-alter-partition-channel-manager]: Recorded new 
> controller, from now on will use node kafka3-app.prod.net:9192 (id: 3 rack: ) 
> (kafka.server.BrokerToControllerRequestThread)
> [2023-10-24 01:30:00,104] WARN [BrokerToControllerChannelManager id=4 
> name=alter-partition] Connection to node 3 (kafka3-app.prod.net:9192) could 
> not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2023-10-24 01:30:00,104] INFO [BrokerToControllerChannelManager id=4 
> name=alter-partition] Node 3 disconnected. 
> (org.apache.kafka.clients.NetworkClient) 
> ...{code}
> {code:java}
> [2023-10-26 02:42:35,166] WARN [RPC ProducerId Manager 103]: Timed out when 
> requesting AllocateProducerIds from the controller. 
> (kafka.coordinator.transaction.RPCProducerIdManager)  
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: Timed out 
> waiting for next producer ID block
> ...{code}
>  
> +New Controller Logs+
> {code:java}
> [2023-10-24 01:33:20,569] WARN [RequestSendThread controllerId=11] Controller 
> 11's connection to broker kafka4-app.prod.net:9192 (id: 4 rack: ) was 
> unsuccessful (kafka.controller.RequestSendThread)
> java.net.SocketTimeoutException: Failed to connect within 30000 ms
> at 
> kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:297)
> at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:249)
> at 
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127)
>  
> ...{code}
>  
> This issue seems very similar to some previous tickets I found:
>  * https://issues.apache.org/jira/browse/KAFKA-7697 
>  * https://issues.apache.org/jira/browse/KAFKA-14694 
> Unfortunately I have not been able to grab a threaddump because of the 
> sporadic nature of the issue, but will update the ticket if I am able.
>  
> This may also affect other versions of Kafka, but we've only experienced it 
> with v3.5.0 + IBP 3.5. Curiously enough I haven't encountered this issue with 
> numerous smaller throughput clusters, so there seems to be some sort of 
> load-induced trigger here. I have also not seen this issue with v3.5.0 + IBP 
> 2.5, assumedly because the ALLOCATE_PRODUCER_IDS and ALTER_PARTITION features 
> are not yet enabled until IBP 2.7 and 3.0, respectively.
>  
> Is there any guidance on whether this failure path is possible as described 
> and what steps can be taken to mitigate the issue?
> Thank you!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to