[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS

2021-12-10 Thread David Mao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457216#comment-17457216
 ] 

David Mao edited comment on KAFKA-13388 at 12/10/21, 4:43 PM:
--

We should probably bump up the priority of this Jira to Major or Critical since 
this prevents a producer from being able to recover its connection to a node 
until it restarts, or the connection gets idle-killed. I'm not sure what the 
impact is on the consumer or admin client.


was (Author: david.mao):
We should probably bump up the priority of this Jira to Major or Critical since 
this prevents a producer from being able to recover its connection to a node 
until it restarts, or the connection gets idle-killed.

> Kafka Producer nodes stuck in CHECKING_API_VERSIONS
> ---
>
> Key: KAFKA-13388
> URL: https://issues.apache.org/jira/browse/KAFKA-13388
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: David Hoffman
>Priority: Critical
> Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, 
> image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for 
> xxx-17:120002 ms has passed since batch creation
> {code}
>  I would have assumed a request timout or connection timeout should have also 
> been logged. I could not find any other associated errors. 
> I added some instrumenting to my app and have traced this down to broker 
> connections hanging in CHECKING_API_VERSIONS state. -It appears there is no 
> effective timeout for Kafka Producer broker connections in 
> CHECKING_API_VERSIONS state.-
> In the code see the after the NetworkClient connects to a broker node it 
> makes a request to check api versions, when it receives the response it marks 
> the node as ready. -I am seeing that sometimes a reply is not received for 
> the check api versions request the connection just hangs in 
> CHECKING_API_VERSIONS state until it is disposed I assume after the idle 
> connection timeout.-
> Update: not actually sure what causes the connection to get stuck in 
> CHECKING_API_VERSIONS.
> -I am guessing the connection setup timeout should be still in play for this, 
> but it is not.- 
>  -There is a connectingNodes set that is consulted when checking timeouts and 
> the node is removed- 
>  -when ClusterConnectionStates.checkingApiVersions(String id) is called to 
> transition the node into CHECKING_API_VERSIONS-



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS

2021-12-09 Thread David Mao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456854#comment-17456854
 ] 

David Mao edited comment on KAFKA-13388 at 12/10/21, 2:10 AM:
--

[~dhofftgt] 

Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we 
see:

 
{code:java}
if (discoverBrokerVersions) {
 this.connectionStates.checkingApiVersions(node); 
 nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder());
 {code}
which is a separate queue for nodes needing to send the api versions request.

Then in 

 
{code:java}
private void handleInitiateApiVersionRequests(long now) {
Iterator> iter = 
nodesNeedingApiVersionsFetch.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry entry = iter.next();
String node = entry.getKey();
if (selector.isChannelReady(node) && 
inFlightRequests.canSendMore(node)) {
log.debug("Initiating API versions fetch from node {}.", node);
ApiVersionsRequest.Builder apiVersionRequestBuilder = 
entry.getValue();
ClientRequest clientRequest = newClientRequest(node, 
apiVersionRequestBuilder, now, true);
doSend(clientRequest, true, now);
iter.remove();
}{code}
we only send out the api versions request if the channel is ready (TLS 
handshake complete, SASL handshake complete).

This is actually a pretty insidious bug because I think we end up in a state 
where we do not apply any request timeout to the channel if there is some delay 
in completing any of the handshaking/authentication steps, since the inflight 
requests are empty.


was (Author: david.mao):
[~dhofftgt] 

Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we 
see:

 
{code:java}
if (discoverBrokerVersions) {
 this.connectionStates.checkingApiVersions(node); 
 nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder());
 {code}
which is a separate queue for nodes needing to send the api versions request.

Then in 

 
{code:java}
private void handleInitiateApiVersionRequests(long now) {
Iterator> iter = 
nodesNeedingApiVersionsFetch.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry entry = iter.next();
String node = entry.getKey();
if (selector.isChannelReady(node) && 
inFlightRequests.canSendMore(node)) {
log.debug("Initiating API versions fetch from node {}.", node);
ApiVersionsRequest.Builder apiVersionRequestBuilder = 
entry.getValue();
ClientRequest clientRequest = newClientRequest(node, 
apiVersionRequestBuilder, now, true);
doSend(clientRequest, true, now);
iter.remove();
}{code}
we only send out the api versions request if the channel is ready (TLS 
handshake complete, SASL handshake complete).

This is actually a pretty insidious bug because I think we end up in a state 
where we do not apply any request timeout to the channel if there is some 
problem completing any of the handshaking/authentication steps, since the 
inflight requests are empty.

> Kafka Producer nodes stuck in CHECKING_API_VERSIONS
> ---
>
> Key: KAFKA-13388
> URL: https://issues.apache.org/jira/browse/KAFKA-13388
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: David Hoffman
>Priority: Minor
> Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, 
> image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for 
> xxx-17:120002 ms has passed since batch creation
> {code}
>  I would have assumed a request timout or connection timeout should have also 
> been logged. I could not find any other associated errors. 
> I added some instrumenting to my app and have traced this down to broker 
> connections hanging in CHECKING_API_VERSIONS state. -It appears there is no 
> effective timeout for Kafka Producer broker connections in 
> CHECKING_API_VERSIONS state.-
> In the code see the after the NetworkClient connects to a broker node it 
> makes a request to check api versions, when it receives the response it marks 
> the node as ready. -I am seeing that sometimes a reply is not received for 
> the check api versions request the connection just hangs in 
> CHECKING_API_VERSIONS state until it is disposed I assume after the idle 
> connection timeout.-
> Update: not actually sure what causes the connection to get stuck in 
> CHECKING_API_VERSIONS.
> -I am guessing the connection setup timeout should be still in play for this, 
> but it is not.- 
>  -There is a connectingNodes set that is consulted when checking timeouts and 
> the node is removed- 
>  -when 

[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS

2021-12-09 Thread David Mao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456854#comment-17456854
 ] 

David Mao edited comment on KAFKA-13388 at 12/10/21, 1:56 AM:
--

[~dhofftgt] 

Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we 
see:

 
{code:java}
if (discoverBrokerVersions) {
 this.connectionStates.checkingApiVersions(node); 
 nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder());
 {code}
which is a separate queue for nodes needing to send the api versions request.

Then in 

 
{code:java}
private void handleInitiateApiVersionRequests(long now) {
Iterator> iter = 
nodesNeedingApiVersionsFetch.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry entry = iter.next();
String node = entry.getKey();
if (selector.isChannelReady(node) && 
inFlightRequests.canSendMore(node)) {
log.debug("Initiating API versions fetch from node {}.", node);
ApiVersionsRequest.Builder apiVersionRequestBuilder = 
entry.getValue();
ClientRequest clientRequest = newClientRequest(node, 
apiVersionRequestBuilder, now, true);
doSend(clientRequest, true, now);
iter.remove();
}{code}
we only send out the api versions request if the channel is ready (TLS 
handshake complete, SASL handshake complete).

This is actually a pretty insidious bug because I think we end up in a state 
where we do not apply any request timeout to the channel if there is some 
problem completing any of the handshaking/authentication steps, since the 
inflight requests are empty.


was (Author: david.mao):
[~dhofftgt] 

Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we 
see:

 
{code:java}
if (discoverBrokerVersions) {
 this.connectionStates.checkingApiVersions(node); 
 nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder());
 {code}
which is a separate queue for nodes needing to send the api versions request.

Then in 

 
{code:java}
private void handleInitiateApiVersionRequests(long now) {
Iterator> iter = 
nodesNeedingApiVersionsFetch.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry entry = iter.next();
String node = entry.getKey();
if (selector.isChannelReady(node) && 
inFlightRequests.canSendMore(node)) {
log.debug("Initiating API versions fetch from node {}.", node);
ApiVersionsRequest.Builder apiVersionRequestBuilder = 
entry.getValue();
ClientRequest clientRequest = newClientRequest(node, 
apiVersionRequestBuilder, now, true);
doSend(clientRequest, true, now);
iter.remove();
}{code}
we only send out the api versions request if the channel is ready (TLS 
handshake complete, SASL handshake complete).

This is actually a pretty insidious bug because I think we end up in a state 
where we do not apply any request timeout to the channel, since the inflight 
requests are empty.

> Kafka Producer nodes stuck in CHECKING_API_VERSIONS
> ---
>
> Key: KAFKA-13388
> URL: https://issues.apache.org/jira/browse/KAFKA-13388
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: David Hoffman
>Priority: Minor
> Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, 
> image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for 
> xxx-17:120002 ms has passed since batch creation
> {code}
>  I would have assumed a request timout or connection timeout should have also 
> been logged. I could not find any other associated errors. 
> I added some instrumenting to my app and have traced this down to broker 
> connections hanging in CHECKING_API_VERSIONS state. -It appears there is no 
> effective timeout for Kafka Producer broker connections in 
> CHECKING_API_VERSIONS state.-
> In the code see the after the NetworkClient connects to a broker node it 
> makes a request to check api versions, when it receives the response it marks 
> the node as ready. -I am seeing that sometimes a reply is not received for 
> the check api versions request the connection just hangs in 
> CHECKING_API_VERSIONS state until it is disposed I assume after the idle 
> connection timeout.-
> Update: not actually sure what causes the connection to get stuck in 
> CHECKING_API_VERSIONS.
> -I am guessing the connection setup timeout should be still in play for this, 
> but it is not.- 
>  -There is a connectingNodes set that is consulted when checking timeouts and 
> the node is removed- 
>  -when ClusterConnectionStates.checkingApiVersions(String id) is called to 
> transition the node into CHECKING_API_VERSIONS-




[jira] [Comment Edited] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS

2021-11-05 Thread David Mao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439526#comment-17439526
 ] 

David Mao edited comment on KAFKA-13388 at 11/5/21, 10:20 PM:
--

[~dhofftgt]

Why do we expect a connection in CHECKING_API_VERSIONS to have in-flight 
requests?

I would expect the opposite: 

if a connection is in CHECKING_API_VERSIONS, it will *not* be ready for 
requests (at this point, the client does not know what API versions the broker 
supports, so it can't serialize requests to the appropriate version), so it 
should not have any inflight requests.


was (Author: david.mao):
[~dhofftgt]

Why do we expect a connection in CHECKING_API_VERSIONS to have in-flight 
requests?

I would expect the opposite: 

if a connection is in CHECKING_API_VERSIONS, it will *not* be ready for 
requests (at this point, the client does not know what API versions the broker 
supports, so it can't serialize requests to the appropriate version).

> Kafka Producer nodes stuck in CHECKING_API_VERSIONS
> ---
>
> Key: KAFKA-13388
> URL: https://issues.apache.org/jira/browse/KAFKA-13388
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: David Hoffman
>Priority: Minor
> Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, 
> image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for 
> xxx-17:120002 ms has passed since batch creation
> {code}
>  I would have assumed a request timout or connection timeout should have also 
> been logged. I could not find any other associated errors. 
> I added some instrumenting to my app and have traced this down to broker 
> connections hanging in CHECKING_API_VERSIONS state. -It appears there is no 
> effective timeout for Kafka Producer broker connections in 
> CHECKING_API_VERSIONS state.-
> In the code see the after the NetworkClient connects to a broker node it 
> makes a request to check api versions, when it receives the response it marks 
> the node as ready. -I am seeing that sometimes a reply is not received for 
> the check api versions request the connection just hangs in 
> CHECKING_API_VERSIONS state until it is disposed I assume after the idle 
> connection timeout.-
> Update: not actually sure what causes the connection to get stuck in 
> CHECKING_API_VERSIONS.
> -I am guessing the connection setup timeout should be still in play for this, 
> but it is not.- 
>  -There is a connectingNodes set that is consulted when checking timeouts and 
> the node is removed- 
>  -when ClusterConnectionStates.checkingApiVersions(String id) is called to 
> transition the node into CHECKING_API_VERSIONS-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)