[ https://issues.apache.org/jira/browse/KAFKA-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christian Becker updated KAFKA-10228: ------------------------------------- Description: We're currently seeing an issue with the java client (producer), when message producing runs into a timeout. Namely a NETWORK_EXCEPTION is thrown instead of a timeout exception. *Situation and relevant code:* Config {code:java} request.timeout.ms: 200 retries: 3 acks: all{code} {code:java} for (UnpublishedEvent event : unpublishedEvents) { ListenableFuture<SendResult<String, String>> future; future = kafkaTemplate.send(new ProducerRecord<>(event.getTopic(), event.getKafkaKey(), event.getPayload())); futures.add(future.completable()); } CompletableFuture.allOf(futures.stream().toArray(CompletableFuture[]::new)).join();{code} We're using the KafkaTemplate from SpringBoot here, but it shouldn't matter, as it's merely a wrapper. There we put in batches of messages to be sent. 200ms later, we can see the following in the logs: (not sure about the order, they've arrived in the same ms, so our logging system might not display them in the right order) {code:java} [Producer clientId=producer-1] Received invalid metadata error in produce request on partition events-6 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now [Producer clientId=producer-1] Got error produce response with correlation id 3094 on topic-partition events-6, retrying (2 attempts left). Error: NETWORK_EXCEPTION {code} This was somewhat unexpected and sent us for a hunt across the infrastructure for possible connection issues, but we've found none. Side note: In some cases the retries worked and the messages were successfully produced. Only many hours of heavy debugging, we've noticed, that the error might be related to the low timeout setting. We've removed that setting now, as it was a remnant from the past and no longer valid for our use-case. However in order to avoid other people having that issue again and to simplify future debugging, some form of timeout exception should be thrown. was: We're currently seeing an issue with the java client (producer), when message producing runs into a timeout. Namely a NETWORK_EXCEPTION is thrown instead of a timeout exception. *Situation and relevant code:* Config {code:java} request.timeout.ms: 200 retries: 3 acks: all{code} {code:java} for (UnpublishedEvent event : unpublishedEvents) { ListenableFuture<SendResult<String, String>> future; future = kafkaTemplate.send(new ProducerRecord<>(event.getTopic(), event.getKafkaKey(), event.getPayload())); futures.add(future.completable()); } CompletableFuture.allOf(futures.stream().toArray(CompletableFuture[]::new)).join();{code} We're using the KafkaTemplate from SpringBoot here, but it shouldn't matter, as it's merely a wrapper. There we put in batches of messages to be sent. 200ms later, we can see the following in the logs: {code:java} [Producer clientId=producer-1] Received invalid metadata error in produce request on partition events-6 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now [Producer clientId=producer-1] Got error produce response with correlation id 3094 on topic-partition events-6, retrying (2 attempts left). Error: NETWORK_EXCEPTION {code} This was somewhat unexpected and sent us for a hunt across the infrastructure for possible connection issues, but we've found none. Side note: In some cases the retries worked and the messages were successfully produced. Only many hours of heavy debugging, we've noticed, that the error might be related to the low timeout setting. We've removed that setting now, as it was a remnant from the past and no longer valid for our use-case. However in order to avoid other people having that issue again and to simplify future debugging, some form of timeout exception should be thrown. > producer: NETWORK_EXCEPTION is thrown instead of a request timeout > ------------------------------------------------------------------ > > Key: KAFKA-10228 > URL: https://issues.apache.org/jira/browse/KAFKA-10228 > Project: Kafka > Issue Type: Improvement > Components: clients > Affects Versions: 2.3.1 > Reporter: Christian Becker > Priority: Major > > We're currently seeing an issue with the java client (producer), when message > producing runs into a timeout. Namely a NETWORK_EXCEPTION is thrown instead > of a timeout exception. > *Situation and relevant code:* > Config > {code:java} > request.timeout.ms: 200 > retries: 3 > acks: all{code} > {code:java} > for (UnpublishedEvent event : unpublishedEvents) { > ListenableFuture<SendResult<String, String>> future; > future = kafkaTemplate.send(new ProducerRecord<>(event.getTopic(), > event.getKafkaKey(), event.getPayload())); > futures.add(future.completable()); > } > CompletableFuture.allOf(futures.stream().toArray(CompletableFuture[]::new)).join();{code} > We're using the KafkaTemplate from SpringBoot here, but it shouldn't matter, > as it's merely a wrapper. There we put in batches of messages to be sent. > 200ms later, we can see the following in the logs: (not sure about the order, > they've arrived in the same ms, so our logging system might not display them > in the right order) > {code:java} > [Producer clientId=producer-1] Received invalid metadata error in produce > request on partition events-6 due to > org.apache.kafka.common.errors.NetworkException: The server disconnected > before a response was received.. Going to request metadata update now > [Producer clientId=producer-1] Got error produce response with correlation id > 3094 on topic-partition events-6, retrying (2 attempts left). Error: > NETWORK_EXCEPTION {code} > This was somewhat unexpected and sent us for a hunt across the infrastructure > for possible connection issues, but we've found none. > Side note: In some cases the retries worked and the messages were > successfully produced. > Only many hours of heavy debugging, we've noticed, that the error might be > related to the low timeout setting. We've removed that setting now, as it was > a remnant from the past and no longer valid for our use-case. However in > order to avoid other people having that issue again and to simplify future > debugging, some form of timeout exception should be thrown. -- This message was sent by Atlassian Jira (v8.3.4#803005)