[ 
https://issues.apache.org/jira/browse/KAFKA-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530152#comment-17530152
 ] 

Sean Chang commented on KAFKA-10335:
------------------------------------

hi,how to resovle the problem?it is a perfmance issue with producer?

 

> Blocking of producer IO thread when calling send() from callback
> ----------------------------------------------------------------
>
>                 Key: KAFKA-10335
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10335
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, producer 
>            Reporter: Alexander Sibiryakov
>            Priority: Minor
>
> We had application which supposed to be using KafkaProducer to deliver 
> results of some work. Sometimes delivery of results weren't successful 
> because of network connectivity errors or maintenance happening on the broker 
> side. In such cases we wanted application to send an event with error and 
> original message details. All good, but we wanted errors to be delivered to a 
> separate topic. So we implemented a callback in send() method, using the same 
> producer instance and calling send() from there.
> This application worked for some time, but then we encountered that its 
> producer was stuck. Almost no CPU consumption and expiring batches for hours. 
> After connecting with debugger it turned out that sender IO thread is 
> blocking. When record is expired, a callback was called, which contained a 
> call to send(), implying usage of a new topic, which metadata is not present 
> in producer's client cache. When send() is missing metadata, it is allowed to 
> block for up to max.block.ms interval, which is 60 secs by default. If 
> application is active, then it will quickly result in a large amount of 
> accumulated records. Every record will block IO thread for 60s. Therefore 
> sender IO thread is essentially blocked. In Producer only Sender IO thread 
> contains a call to client's poll() method, which is responsible for all the 
> network communication and metadata update. If poll() is executed with 
> significant delay then it will result to errors, connected with various 
> timeouts. That's it we've got a stuck producer with little chance to recover.
> To summarise, pre-requisites for the problem are sending from callback, using 
> the same producer instance and usage of topic which wasn't seen before.
> I think it is important to decide if the issue is KafkaProducer misuse or its 
> bug. Code is callbacks shouldn't block, that is clear, but at the same time, 
> no one expects already initialised producer to block.
> Depending on decision I could produce a fix, it can be either a warning when 
> user is trying to call a send() from callback, or reduction of max allowed 
> blocking time for metadata update. It could be just docs changes, or even 
> nothing.
> Here is code to reproduce the issue, the output it is producing follows the 
> code snippet. Tested on Confluent Cloud, from my desktop with 100 Mbps 
> connection.
> {code:java}
>     public static void main(String[] args) throws IOException {
>         byte[] blob = new byte[262144];
>         Properties properties = new Properties();
>         properties.load(new FileReader("kafka-staging.properties"));
>         properties.setProperty("key.serializer", 
> "org.apache.kafka.common.serialization.StringSerializer");
>         properties.setProperty("value.serializer", 
> "org.apache.kafka.common.serialization.ByteArraySerializer");
>         properties.setProperty("request.timeout.ms", "5000");
>         properties.setProperty("delivery.timeout.ms", "5000");
>         KafkaProducer<String, byte[]> producer = new 
> KafkaProducer(properties);
>         while (true) {
>             ProducerRecord<String, byte[]> record = new 
> ProducerRecord<>("alex-test-valid-data", blob);
>             producer.send(record, new Callback() {
>                 @Override
>                 public void onCompletion(RecordMetadata metadata, Exception 
> exception) {
>                     if (exception != null) {
>                         System.err.println(exception);
>                         long start = System.currentTimeMillis();
>                         ProducerRecord<String, byte[]> record = new 
> ProducerRecord<>("alex-test-errors", blob);
>                         producer.send(record);  // blocking caused by 
> metadata update
>                         long timeElapsed = System.currentTimeMillis() - start;
>                         System.err.println("time spent blocking IO thread: " 
> + timeElapsed);
>                     }
>                 }
>             });
>         }
>     }
> {code}
> {noformat}
> [2020-07-31 14:35:51,936: INFO/main] (AbstractConfig.java:347) - 
> ProducerConfig values: 
>       acks = 1
>       batch.size = 16384
>       bootstrap.servers = [pkc-l915e.europe-west1.gcp.confluent.cloud:9092]
>       buffer.memory = 33554432
>       client.dns.lookup = default
>       client.id = 
>       compression.type = none
>       connections.max.idle.ms = 540000
>       delivery.timeout.ms = 5000
>       enable.idempotence = false
>       interceptor.classes = []
>       key.serializer = class 
> org.apache.kafka.common.serialization.StringSerializer
>       linger.ms = 0
>       max.block.ms = 60000
>       max.in.flight.requests.per.connection = 5
>       max.request.size = 1048576
>       metadata.max.age.ms = 300000
>       metric.reporters = []
>       metrics.num.samples = 2
>       metrics.recording.level = INFO
>       metrics.sample.window.ms = 30000
>       partitioner.class = class 
> org.apache.kafka.clients.producer.internals.DefaultPartitioner
>       receive.buffer.bytes = 32768
>       reconnect.backoff.max.ms = 1000
>       reconnect.backoff.ms = 50
>       request.timeout.ms = 5000
>       retries = 2147483647
>       retry.backoff.ms = 100
>       sasl.client.callback.handler.class = null
>       sasl.jaas.config = [hidden]
>       sasl.kerberos.kinit.cmd = /usr/bin/kinit
>       sasl.kerberos.min.time.before.relogin = 60000
>       sasl.kerberos.service.name = null
>       sasl.kerberos.ticket.renew.jitter = 0.05
>       sasl.kerberos.ticket.renew.window.factor = 0.8
>       sasl.login.callback.handler.class = null
>       sasl.login.class = null
>       sasl.login.refresh.buffer.seconds = 300
>       sasl.login.refresh.min.period.seconds = 60
>       sasl.login.refresh.window.factor = 0.8
>       sasl.login.refresh.window.jitter = 0.05
>       sasl.mechanism = PLAIN
>       security.protocol = SASL_SSL
>       security.providers = null
>       send.buffer.bytes = 131072
>       ssl.cipher.suites = null
>       ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
>       ssl.endpoint.identification.algorithm = https
>       ssl.key.password = null
>       ssl.keymanager.algorithm = SunX509
>       ssl.keystore.location = null
>       ssl.keystore.password = null
>       ssl.keystore.type = JKS
>       ssl.protocol = TLS
>       ssl.provider = null
>       ssl.secure.random.implementation = null
>       ssl.trustmanager.algorithm = PKIX
>       ssl.truststore.location = null
>       ssl.truststore.password = null
>       ssl.truststore.type = JKS
>       transaction.timeout.ms = 60000
>       transactional.id = null
>       value.serializer = class 
> org.apache.kafka.common.serialization.ByteArraySerializer
> [2020-07-31 14:35:52,099: INFO/main] (AbstractLogin.java:61) - Successfully 
> logged in.
> [2020-07-31 14:35:52,291: INFO/main] (AppInfoParser.java:117) - Kafka 
> version: 5.4.0-ccs
> [2020-07-31 14:35:52,291: INFO/main] (AppInfoParser.java:118) - Kafka 
> commitId: f4201a82bea68cc7
> [2020-07-31 14:35:52,291: INFO/main] (AppInfoParser.java:119) - Kafka 
> startTimeMs: 1596198952287
> [2020-07-31 14:35:52,853: INFO/kafka-producer-network-thread | producer-1] 
> (Metadata.java:261) - [Producer clientId=producer-1] Cluster ID: lkc-43m2m
> org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for 
> alex-test-valid-data-0:5001 ms has passed since batch creation
> org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory 
> within the configured max blocking time 60000 ms.
> time spent blocking IO thread: 60001
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> time spent blocking IO thread: 60002
> time spent blocking IO thread: 60017
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> [2020-07-31 14:38:07,219: WARN/kafka-producer-network-thread | producer-1] 
> (Sender.java:682) - [Producer clientId=producer-1] Received invalid metadata 
> error in produce request on partition alex-test-valid-data-3 due to 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.. Going to request metadata update now
> org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory 
> within the configured max blocking time 60000 ms.
> time spent blocking IO thread: 60003
> [2020-07-31 14:39:07,223: WARN/kafka-producer-network-thread | producer-1] 
> (Sender.java:682) - [Producer clientId=producer-1] Received invalid metadata 
> error in produce request on partition alex-test-valid-data-0 due to 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.. Going to request metadata update now
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> time spent blocking IO thread: 60002
> time spent blocking IO thread: 60001
> [2020-07-31 14:40:07,224: WARN/kafka-producer-network-thread | producer-1] 
> (Sender.java:682) - [Producer clientId=producer-1] Received invalid metadata 
> error in produce request on partition alex-test-valid-data-5 due to 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.. Going to request metadata update now
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory 
> within the configured max blocking time 60000 ms.
> time spent blocking IO thread: 60001
> [2020-07-31 14:41:07,225: WARN/kafka-producer-network-thread | producer-1] 
> (Sender.java:682) - [Producer clientId=producer-1] Received invalid metadata 
> error in produce request on partition alex-test-valid-data-1 due to 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.. Going to request metadata update now
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> time spent blocking IO thread: 60004
> time spent blocking IO thread: 60004
> [2020-07-31 14:42:07,229: WARN/kafka-producer-network-thread | producer-1] 
> (Sender.java:682) - [Producer clientId=producer-1] Received invalid metadata 
> error in produce request on partition alex-test-valid-data-4 due to 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.. Going to request metadata update now
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory 
> within the configured max blocking time 60000 ms.
> time spent blocking IO thread: 60000
> [2020-07-31 14:43:07,229: WARN/kafka-producer-network-thread | producer-1] 
> (Sender.java:682) - [Producer clientId=producer-1] Received invalid metadata 
> error in produce request on partition alex-test-valid-data-2 due to 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.. Going to request metadata update now
> org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for 
> alex-test-valid-data-5:422600 ms has passed since batch creation
> time spent blocking IO thread: 60003
> time spent blocking IO thread: 60001
> org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for 
> alex-test-valid-data-5:422490 ms has passed since batch creation
> org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory 
> within the configured max blocking time 60000 ms.
> time spent blocking IO thread: 60002
> org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for 
> alex-test-valid-data-5:422315 ms has passed since batch creation
> time spent blocking IO thread: 60003
> time spent blocking IO thread: 60003
> org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for 
> alex-test-valid-data-5:422124 ms has passed since batch creation
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to