[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS
[ https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457216#comment-17457216 ] David Mao edited comment on KAFKA-13388 at 12/10/21, 4:43 PM: -- We should probably bump up the priority of this Jira to Major or Critical since this prevents a producer from being able to recover its connection to a node until it restarts, or the connection gets idle-killed. I'm not sure what the impact is on the consumer or admin client. was (Author: david.mao): We should probably bump up the priority of this Jira to Major or Critical since this prevents a producer from being able to recover its connection to a node until it restarts, or the connection gets idle-killed. > Kafka Producer nodes stuck in CHECKING_API_VERSIONS > --- > > Key: KAFKA-13388 > URL: https://issues.apache.org/jira/browse/KAFKA-13388 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: David Hoffman >Priority: Critical > Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, > image-2021-10-21-13-42-06-528.png > > > I have been seeing expired batch errors in my app. > {code:java} > org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for > xxx-17:120002 ms has passed since batch creation > {code} > I would have assumed a request timout or connection timeout should have also > been logged. I could not find any other associated errors. > I added some instrumenting to my app and have traced this down to broker > connections hanging in CHECKING_API_VERSIONS state. -It appears there is no > effective timeout for Kafka Producer broker connections in > CHECKING_API_VERSIONS state.- > In the code see the after the NetworkClient connects to a broker node it > makes a request to check api versions, when it receives the response it marks > the node as ready. -I am seeing that sometimes a reply is not received for > the check api versions request the connection just hangs in > CHECKING_API_VERSIONS state until it is disposed I assume after the idle > connection timeout.- > Update: not actually sure what causes the connection to get stuck in > CHECKING_API_VERSIONS. > -I am guessing the connection setup timeout should be still in play for this, > but it is not.- > -There is a connectingNodes set that is consulted when checking timeouts and > the node is removed- > -when ClusterConnectionStates.checkingApiVersions(String id) is called to > transition the node into CHECKING_API_VERSIONS- -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS
[ https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456854#comment-17456854 ] David Mao edited comment on KAFKA-13388 at 12/10/21, 2:10 AM: -- [~dhofftgt] Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we see: {code:java} if (discoverBrokerVersions) { this.connectionStates.checkingApiVersions(node); nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder()); {code} which is a separate queue for nodes needing to send the api versions request. Then in {code:java} private void handleInitiateApiVersionRequests(long now) { Iterator> iter = nodesNeedingApiVersionsFetch.entrySet().iterator(); while (iter.hasNext()) { Map.Entry entry = iter.next(); String node = entry.getKey(); if (selector.isChannelReady(node) && inFlightRequests.canSendMore(node)) { log.debug("Initiating API versions fetch from node {}.", node); ApiVersionsRequest.Builder apiVersionRequestBuilder = entry.getValue(); ClientRequest clientRequest = newClientRequest(node, apiVersionRequestBuilder, now, true); doSend(clientRequest, true, now); iter.remove(); }{code} we only send out the api versions request if the channel is ready (TLS handshake complete, SASL handshake complete). This is actually a pretty insidious bug because I think we end up in a state where we do not apply any request timeout to the channel if there is some delay in completing any of the handshaking/authentication steps, since the inflight requests are empty. was (Author: david.mao): [~dhofftgt] Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we see: {code:java} if (discoverBrokerVersions) { this.connectionStates.checkingApiVersions(node); nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder()); {code} which is a separate queue for nodes needing to send the api versions request. Then in {code:java} private void handleInitiateApiVersionRequests(long now) { Iterator> iter = nodesNeedingApiVersionsFetch.entrySet().iterator(); while (iter.hasNext()) { Map.Entry entry = iter.next(); String node = entry.getKey(); if (selector.isChannelReady(node) && inFlightRequests.canSendMore(node)) { log.debug("Initiating API versions fetch from node {}.", node); ApiVersionsRequest.Builder apiVersionRequestBuilder = entry.getValue(); ClientRequest clientRequest = newClientRequest(node, apiVersionRequestBuilder, now, true); doSend(clientRequest, true, now); iter.remove(); }{code} we only send out the api versions request if the channel is ready (TLS handshake complete, SASL handshake complete). This is actually a pretty insidious bug because I think we end up in a state where we do not apply any request timeout to the channel if there is some problem completing any of the handshaking/authentication steps, since the inflight requests are empty. > Kafka Producer nodes stuck in CHECKING_API_VERSIONS > --- > > Key: KAFKA-13388 > URL: https://issues.apache.org/jira/browse/KAFKA-13388 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: David Hoffman >Priority: Minor > Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, > image-2021-10-21-13-42-06-528.png > > > I have been seeing expired batch errors in my app. > {code:java} > org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for > xxx-17:120002 ms has passed since batch creation > {code} > I would have assumed a request timout or connection timeout should have also > been logged. I could not find any other associated errors. > I added some instrumenting to my app and have traced this down to broker > connections hanging in CHECKING_API_VERSIONS state. -It appears there is no > effective timeout for Kafka Producer broker connections in > CHECKING_API_VERSIONS state.- > In the code see the after the NetworkClient connects to a broker node it > makes a request to check api versions, when it receives the response it marks > the node as ready. -I am seeing that sometimes a reply is not received for > the check api versions request the connection just hangs in > CHECKING_API_VERSIONS state until it is disposed I assume after the idle > connection timeout.- > Update: not actually sure what causes the connection to get stuck in > CHECKING_API_VERSIONS. > -I am guessing the connection setup timeout should be still in play for this, > but it is not.- > -There is a connectingNodes set that is consulted when checking timeouts and > the node is removed- > -when
[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS
[ https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456854#comment-17456854 ] David Mao edited comment on KAFKA-13388 at 12/10/21, 1:56 AM: -- [~dhofftgt] Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we see: {code:java} if (discoverBrokerVersions) { this.connectionStates.checkingApiVersions(node); nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder()); {code} which is a separate queue for nodes needing to send the api versions request. Then in {code:java} private void handleInitiateApiVersionRequests(long now) { Iterator> iter = nodesNeedingApiVersionsFetch.entrySet().iterator(); while (iter.hasNext()) { Map.Entry entry = iter.next(); String node = entry.getKey(); if (selector.isChannelReady(node) && inFlightRequests.canSendMore(node)) { log.debug("Initiating API versions fetch from node {}.", node); ApiVersionsRequest.Builder apiVersionRequestBuilder = entry.getValue(); ClientRequest clientRequest = newClientRequest(node, apiVersionRequestBuilder, now, true); doSend(clientRequest, true, now); iter.remove(); }{code} we only send out the api versions request if the channel is ready (TLS handshake complete, SASL handshake complete). This is actually a pretty insidious bug because I think we end up in a state where we do not apply any request timeout to the channel if there is some problem completing any of the handshaking/authentication steps, since the inflight requests are empty. was (Author: david.mao): [~dhofftgt] Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we see: {code:java} if (discoverBrokerVersions) { this.connectionStates.checkingApiVersions(node); nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder()); {code} which is a separate queue for nodes needing to send the api versions request. Then in {code:java} private void handleInitiateApiVersionRequests(long now) { Iterator> iter = nodesNeedingApiVersionsFetch.entrySet().iterator(); while (iter.hasNext()) { Map.Entry entry = iter.next(); String node = entry.getKey(); if (selector.isChannelReady(node) && inFlightRequests.canSendMore(node)) { log.debug("Initiating API versions fetch from node {}.", node); ApiVersionsRequest.Builder apiVersionRequestBuilder = entry.getValue(); ClientRequest clientRequest = newClientRequest(node, apiVersionRequestBuilder, now, true); doSend(clientRequest, true, now); iter.remove(); }{code} we only send out the api versions request if the channel is ready (TLS handshake complete, SASL handshake complete). This is actually a pretty insidious bug because I think we end up in a state where we do not apply any request timeout to the channel, since the inflight requests are empty. > Kafka Producer nodes stuck in CHECKING_API_VERSIONS > --- > > Key: KAFKA-13388 > URL: https://issues.apache.org/jira/browse/KAFKA-13388 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: David Hoffman >Priority: Minor > Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, > image-2021-10-21-13-42-06-528.png > > > I have been seeing expired batch errors in my app. > {code:java} > org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for > xxx-17:120002 ms has passed since batch creation > {code} > I would have assumed a request timout or connection timeout should have also > been logged. I could not find any other associated errors. > I added some instrumenting to my app and have traced this down to broker > connections hanging in CHECKING_API_VERSIONS state. -It appears there is no > effective timeout for Kafka Producer broker connections in > CHECKING_API_VERSIONS state.- > In the code see the after the NetworkClient connects to a broker node it > makes a request to check api versions, when it receives the response it marks > the node as ready. -I am seeing that sometimes a reply is not received for > the check api versions request the connection just hangs in > CHECKING_API_VERSIONS state until it is disposed I assume after the idle > connection timeout.- > Update: not actually sure what causes the connection to get stuck in > CHECKING_API_VERSIONS. > -I am guessing the connection setup timeout should be still in play for this, > but it is not.- > -There is a connectingNodes set that is consulted when checking timeouts and > the node is removed- > -when ClusterConnectionStates.checkingApiVersions(String id) is called to > transition the node into CHECKING_API_VERSIONS-
[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS
[ https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439526#comment-17439526 ] David Mao edited comment on KAFKA-13388 at 11/5/21, 10:20 PM: -- [~dhofftgt] Why do we expect a connection in CHECKING_API_VERSIONS to have in-flight requests? I would expect the opposite: if a connection is in CHECKING_API_VERSIONS, it will *not* be ready for requests (at this point, the client does not know what API versions the broker supports, so it can't serialize requests to the appropriate version), so it should not have any inflight requests. was (Author: david.mao): [~dhofftgt] Why do we expect a connection in CHECKING_API_VERSIONS to have in-flight requests? I would expect the opposite: if a connection is in CHECKING_API_VERSIONS, it will *not* be ready for requests (at this point, the client does not know what API versions the broker supports, so it can't serialize requests to the appropriate version). > Kafka Producer nodes stuck in CHECKING_API_VERSIONS > --- > > Key: KAFKA-13388 > URL: https://issues.apache.org/jira/browse/KAFKA-13388 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: David Hoffman >Priority: Minor > Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, > image-2021-10-21-13-42-06-528.png > > > I have been seeing expired batch errors in my app. > {code:java} > org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for > xxx-17:120002 ms has passed since batch creation > {code} > I would have assumed a request timout or connection timeout should have also > been logged. I could not find any other associated errors. > I added some instrumenting to my app and have traced this down to broker > connections hanging in CHECKING_API_VERSIONS state. -It appears there is no > effective timeout for Kafka Producer broker connections in > CHECKING_API_VERSIONS state.- > In the code see the after the NetworkClient connects to a broker node it > makes a request to check api versions, when it receives the response it marks > the node as ready. -I am seeing that sometimes a reply is not received for > the check api versions request the connection just hangs in > CHECKING_API_VERSIONS state until it is disposed I assume after the idle > connection timeout.- > Update: not actually sure what causes the connection to get stuck in > CHECKING_API_VERSIONS. > -I am guessing the connection setup timeout should be still in play for this, > but it is not.- > -There is a connectingNodes set that is consulted when checking timeouts and > the node is removed- > -when ClusterConnectionStates.checkingApiVersions(String id) is called to > transition the node into CHECKING_API_VERSIONS- -- This message was sent by Atlassian Jira (v8.3.4#803005)