[jira] [Commented] (KAFKA-14317) ProduceRequest timeouts are logged as network exceptions

Kirk True (Jira) Mon, 24 Oct 2022 13:57:06 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623413#comment-17623413
 ]


Kirk True commented on KAFKA-14317:
-----------------------------------

This looks related to KAFKA-10228, but that Jira is still open and seems to 
suggest only a logging change.

I _believe_ we want to change the behavior to complete the batch using a 
different {{Errors}} type.

> ProduceRequest timeouts are logged as network exceptions
> --------------------------------------------------------
>
>                 Key: KAFKA-14317
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14317
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, logging, producer 
>    Affects Versions: 3.3.0
>            Reporter: Kirk True
>            Assignee: Kirk True
>            Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In NetworkClient.handleTimedOutRequests, we disconnect the broker connection:
>  
> {code:java}
> private void handleTimedOutRequests(List<ClientResponse> responses, long now) 
> {
>     List<String> nodeIds = 
> this.inFlightRequests.nodesWithTimedOutRequests(now);
>     for (String nodeId : nodeIds) {
>         // close connection to the node
>         this.selector.close(nodeId);
>         log.debug("Disconnecting from node {} due to request timeout.", 
> nodeId);
>         processDisconnection(responses, nodeId, now, 
> ChannelState.LOCAL_CLOSE);
>     }
> }
> {code}
> This eventually calls cancelInFlightRequests:
> {code:java}
> for (InFlightRequest request : inFlightRequests) {
>     log.trace("Cancelled request {} {} with correlation id {} due to node {} 
> being disconnected",
>     request.header.apiKey(), request.request, request.header.correlationId(), 
> nodeId);
>     
>     if (!request.isInternalRequest) {
>         if (responses != null)
>             responses.add(request.disconnected(now, null));
>     } else if (request.header.apiKey() == ApiKeys.METADATA) {
>         metadataUpdater.handleFailedRequest(now, Optional.empty());
>     }
> }
> {code}
> We set the response to disconnected. In the producer, we complete the record 
> batch with:
> {code:java}
> if (response.wasDisconnected()) {
>     log.trace("Cancelled request with header {} due to node {} being 
> disconnected",
>     requestHeader, response.destination());
>     for (ProducerBatch batch : batches.values())
>         completeBatch(batch, new 
> ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, 
> String.format("Disconnected from node %s", response.destination())),
>     correlationId, now);
> }
> {code}
> This seems like it could be confusing for customers that they would see 
> network exceptions on a request timeout instead of a timeout error.
> One implication of completing the batch with a network exception is that the 
> producer will try to refresh metadata after a request timeout. I can see 
> arguments for why this is necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14317) ProduceRequest timeouts are logged as network exceptions

Reply via email to