Graham Campbell created KAFKA-16902:
---------------------------------------

             Summary: Sender ignores socket connection timeout when 
reconnecting to transaction coordinator
                 Key: KAFKA-16902
                 URL: https://issues.apache.org/jira/browse/KAFKA-16902
             Project: Kafka
          Issue Type: Bug
          Components: producer 
    Affects Versions: 3.6.2, 3.7.0, 3.5.2, 3.4.1, 3.3.2, 3.2.3, 3.1.2, 2.8.2
            Reporter: Graham Campbell


maybeSendAndPollTransactionalRequest checks that the required coordinator node 
is ready before sending transactional requests. It uses 
{{Sender::awaitNodeReady}} and {{NetworkClientUtils::awaitReady}} to do so in a 
blocking manner. 

If the NetworkClient is in the middle of reconnecting to the coordinator due to 
a broker side disconnection (eg. broker restart), the 
socket.connection.setup.timeout.ms config (default 10 seconds) is ignored and 
request.timeout.ms is used (35 seconds). This results in up to 25 extra seconds 
of waiting before hitting the expected timeout. In my connectivity-related 
exceptions are sometimes thrown before the full 35 second timeout (eg. 
NoRouteToHostException, UnknownHostException).

[awaitNodeReady uses 
requestTimeoutMs|https://github.com/apache/kafka/blob/896af1b2f2f2a7d579e0aef074bcb2004c0246f2/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L568]
 as the timeout, which [awaitReady passes to the NetworkClient as the poll 
timeout|https://github.com/apache/kafka/blob/896af1b2f2f2a7d579e0aef074bcb2004c0246f2/clients/src/main/java/org/apache/kafka/clients/NetworkClientUtils.java#L73-L74],
 so disconnections aren't processed until after the [selector.poll 
completes|https://github.com/apache/kafka/blob/896af1b2f2f2a7d579e0aef074bcb2004c0246f2/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L585].

Logs from a local test with request.timeout.ms=15000 and a patched client with 
some extra logging show the timeout being violated:

 
{code:java}
2024-05-29 21:31:49:736 +0000 [kafka-producer-network-thread | t7] INFO 
org.apache.kafka.clients.NetworkClient - [Producer clientId=t7, 
transactionalId=MyT7] Disconnecting from node 11 due to socket connection setup 
timeout. The timeout value is 10351 ms. 15013 ms since last attempt
2024-05-29 21:31:49:736 +0000 [kafka-producer-network-thread | t7] INFO 
org.apache.kafka.clients.producer.internals.Sender - [Producer clientId=t7, 
transactionalId=MyT7] awaitReady timed out after 15013 ms (threshold of 15000 
ms)
2024-05-29 21:31:49:752 +0000 [kafka-producer-network-thread | t7] INFO 
org.apache.kafka.clients.producer.internals.TransactionManager - [Producer 
clientId=t7, transactionalId=MyT7] Discovered transaction coordinator 
kafka-2.redacted:9093 (id: 2 rack: null) {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to