[ https://issues.apache.org/jira/browse/KAFKA-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623413#comment-17623413 ]
Kirk True commented on KAFKA-14317: ----------------------------------- This looks related to KAFKA-10228, but that Jira is still open and seems to suggest only a logging change. I _believe_ we want to change the behavior to complete the batch using a different {{Errors}} type. > ProduceRequest timeouts are logged as network exceptions > -------------------------------------------------------- > > Key: KAFKA-14317 > URL: https://issues.apache.org/jira/browse/KAFKA-14317 > Project: Kafka > Issue Type: Bug > Components: clients, logging, producer > Affects Versions: 3.3.0 > Reporter: Kirk True > Assignee: Kirk True > Priority: Major > Original Estimate: 48h > Remaining Estimate: 48h > > In NetworkClient.handleTimedOutRequests, we disconnect the broker connection: > > {code:java} > private void handleTimedOutRequests(List<ClientResponse> responses, long now) > { > List<String> nodeIds = > this.inFlightRequests.nodesWithTimedOutRequests(now); > for (String nodeId : nodeIds) { > // close connection to the node > this.selector.close(nodeId); > log.debug("Disconnecting from node {} due to request timeout.", > nodeId); > processDisconnection(responses, nodeId, now, > ChannelState.LOCAL_CLOSE); > } > } > {code} > This eventually calls cancelInFlightRequests: > {code:java} > for (InFlightRequest request : inFlightRequests) { > log.trace("Cancelled request {} {} with correlation id {} due to node {} > being disconnected", > request.header.apiKey(), request.request, request.header.correlationId(), > nodeId); > > if (!request.isInternalRequest) { > if (responses != null) > responses.add(request.disconnected(now, null)); > } else if (request.header.apiKey() == ApiKeys.METADATA) { > metadataUpdater.handleFailedRequest(now, Optional.empty()); > } > } > {code} > We set the response to disconnected. In the producer, we complete the record > batch with: > {code:java} > if (response.wasDisconnected()) { > log.trace("Cancelled request with header {} due to node {} being > disconnected", > requestHeader, response.destination()); > for (ProducerBatch batch : batches.values()) > completeBatch(batch, new > ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, > String.format("Disconnected from node %s", response.destination())), > correlationId, now); > } > {code} > This seems like it could be confusing for customers that they would see > network exceptions on a request timeout instead of a timeout error. > One implication of completing the batch with a network exception is that the > producer will try to refresh metadata after a request timeout. I can see > arguments for why this is necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)