[ https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457274#comment-17457274 ]
David Jacot commented on KAFKA-13388: ------------------------------------- I can think of a few ways to fix this: 1) We could decide to include the ApiVersions step as part of the connection time. This means that we would not reset the connection time while transitioning to the CHECKING_API_VERSIONS state but only when transitioning to the READY state. 2) We could reset the connection time in `handleInitiateApiVersionRequests` only when the ApiVersions request is sent out. 3) We might be able to just remove the `selector.isChannelReady(node) && inFlightRequests.canSendMore(node)` condition in `handleInitiateApiVersionRequests`. I don't really understand what would prevent us from sending the request directly. This way the connection would time out based on the request timeout. > Kafka Producer nodes stuck in CHECKING_API_VERSIONS > --------------------------------------------------- > > Key: KAFKA-13388 > URL: https://issues.apache.org/jira/browse/KAFKA-13388 > Project: Kafka > Issue Type: Bug > Components: core > Reporter: David Hoffman > Priority: Critical > Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, > image-2021-10-21-13-42-06-528.png > > > I have been seeing expired batch errors in my app. > {code:java} > org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for > xxx-17:120002 ms has passed since batch creation > {code} > I would have assumed a request timout or connection timeout should have also > been logged. I could not find any other associated errors. > I added some instrumenting to my app and have traced this down to broker > connections hanging in CHECKING_API_VERSIONS state. -It appears there is no > effective timeout for Kafka Producer broker connections in > CHECKING_API_VERSIONS state.- > In the code see the after the NetworkClient connects to a broker node it > makes a request to check api versions, when it receives the response it marks > the node as ready. -I am seeing that sometimes a reply is not received for > the check api versions request the connection just hangs in > CHECKING_API_VERSIONS state until it is disposed I assume after the idle > connection timeout.- > Update: not actually sure what causes the connection to get stuck in > CHECKING_API_VERSIONS. > -I am guessing the connection setup timeout should be still in play for this, > but it is not.- > -There is a connectingNodes set that is consulted when checking timeouts and > the node is removed- > -when ClusterConnectionStates.checkingApiVersions(String id) is called to > transition the node into CHECKING_API_VERSIONS- -- This message was sent by Atlassian Jira (v8.20.1#820001)