[ https://issues.apache.org/jira/browse/KAFKA-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685353#comment-17685353 ]
Kirk True commented on KAFKA-14317: ----------------------------------- [~dajac] could you take another look at the PR? Thanks! > ProduceRequest timeouts are logged as network exceptions > -------------------------------------------------------- > > Key: KAFKA-14317 > URL: https://issues.apache.org/jira/browse/KAFKA-14317 > Project: Kafka > Issue Type: Bug > Components: clients, logging, producer > Affects Versions: 3.3.0 > Reporter: Kirk True > Assignee: Kirk True > Priority: Major > Original Estimate: 48h > Remaining Estimate: 48h > > In {{{}NetworkClient.handleTimedOutRequests{}}}, we disconnect the broker > connection: > {code:java} > private void handleTimedOutRequests(List<ClientResponse> responses, long now) > { > List<String> nodeIds = > this.inFlightRequests.nodesWithTimedOutRequests(now); > for (String nodeId : nodeIds) { > // close connection to the node > this.selector.close(nodeId); > log.info("Disconnecting from node {} due to request timeout.", > nodeId); > processDisconnection(responses, nodeId, now, > ChannelState.LOCAL_CLOSE); > } > } > {code} > This calls {{processDisconnection}} which calls > {{{}cancelInFlightRequests{}}}: > {code:java} > for (InFlightRequest request : inFlightRequests) { > if (log.isDebugEnabled()) { > log.debug("Cancelled in-flight {} request with correlation id {} due > to node {} being disconnected " + > "(elapsed time since creation: {}ms, elapsed time since send: > {}ms, request timeout: {}ms): {}", > request.header.apiKey(), request.header.correlationId(), nodeId, > request.timeElapsedSinceCreateMs(now), > request.timeElapsedSinceSendMs(now), > request.requestTimeoutMs, request.request); > } else { > log.info("Cancelled in-flight {} request with correlation id {} due > to node {} being disconnected " + > "(elapsed time since creation: {}ms, elapsed time since send: > {}ms, request timeout: {}ms)", > request.header.apiKey(), request.header.correlationId(), nodeId, > request.timeElapsedSinceCreateMs(now), > request.timeElapsedSinceSendMs(now), > request.requestTimeoutMs); > } > if (!request.isInternalRequest) { > if (responses != null) > responses.add(request.disconnected(now, null)); > } else if (request.header.apiKey() == ApiKeys.METADATA) { > metadataUpdater.handleFailedRequest(now, Optional.empty()); > } > } > {code} > We create a new {{ClientResponse}} in which the {{disconnected}} flag is set. > We then complete the record batch In {{Sender.handleProduceResponse}} with: > {code:java} > if (response.wasDisconnected()) { > log.trace("Cancelled request with header {} due to node {} being > disconnected", > requestHeader, response.destination()); > for (ProducerBatch batch : batches.values()) > completeBatch(batch, new > ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, > String.format("Disconnected from node %s", response.destination())), > correlationId, now); > } > {code} > This seems like it could be confusing for customers that they would see > network exceptions on a request timeout instead of a timeout error. > One implication of completing the batch with a network exception is that the > producer will try to refresh metadata after a request timeout. I can see > arguments for why this is necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)