[ https://issues.apache.org/jira/browse/KAFKA-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901429#comment-16901429 ]
Bob Barrett commented on KAFKA-8325: ------------------------------------ Looks like the problem is that when handling a MESSAGE_TOO_LARGE error, we don't correctly remove the original batch from the list of in-flight batches, but we do deallocate it in the accumulator. When we check the in-flight batches for expiration, we then try to deallocate the batch a second time, which causes this error. I'll have a fix out this week. Thanks for the report and the logs, [~mbarbon] and [~lukestephenson]! [~lukestephenson], thanks for providing that demo code! Regarding the OutOfMemory you found, I think the underlying cause is the same: because we don't remove the batch from the list of in-flight batches, and because we retry MESSAGE_TOO_LARGE errors infinitely, the batches build up and eventually use all the available memory. I'll run your program with my fix and see if it fixes the issue. As for why we don't decrement retries after splitting batches, it's because we want to treat the new, smaller batches as separate requests that get the same number of attempts as any other request. If we didn't do this, and the producer batch size was too high relative to the number of retries, we might run out of retries before splitting down to a safe size and fail to produce the records, even though each individual one is viable. Eventually we'll split large batches down to a single record, and if that is still too large we don't retry. In the case of your demo, I suspect the memory ran out before the split batches got down below the broker size limit, but that should be addressed by the fix for this bug. > Remove from the incomplete set failed. This should be impossible > ---------------------------------------------------------------- > > Key: KAFKA-8325 > URL: https://issues.apache.org/jira/browse/KAFKA-8325 > Project: Kafka > Issue Type: Bug > Components: producer > Affects Versions: 2.1.0, 2.3.0 > Reporter: Mattia Barbon > Assignee: Bob Barrett > Priority: Major > > I got this error when using the Kafka producer. So far it happened twice, > with an interval of about 1 week. > {{ERROR [2019-05-05 08:43:07,505] > org.apache.kafka.clients.producer.internals.Sender: [Producer > clientId=<redacted>, transactionalId=<redacted>] Uncaught error in kafka > producer I/O thread:}} > {{ ! java.lang.IllegalStateException: Remove from the incomplete set failed. > This should be impossible.}} > {{ ! at > org.apache.kafka.clients.producer.internals.IncompleteBatches.remove(IncompleteBatches.java:44)}} > {{ ! at > org.apache.kafka.clients.producer.internals.RecordAccumulator.deallocate(RecordAccumulator.java:645)}} > {{ ! at > org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:717)}} > {{ ! at > org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:365)}} > {{ ! at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:308)}} > {{ ! at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:233)}} > {{ ! at java.lang.Thread.run(Thread.java:748)}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)