[ 
https://issues.apache.org/jira/browse/KAFKA-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901429#comment-16901429
 ] 

Bob Barrett commented on KAFKA-8325:
------------------------------------

Looks like the problem is that when handling a MESSAGE_TOO_LARGE error, we 
don't correctly remove the original batch from the list of in-flight batches, 
but we do deallocate it in the accumulator. When we check the in-flight batches 
for expiration, we then try to deallocate the batch a second time, which causes 
this error. I'll have a fix out this week. Thanks for the report and the logs, 
[~mbarbon] and [~lukestephenson]!

[~lukestephenson], thanks for providing that demo code! Regarding the 
OutOfMemory you found, I think the underlying cause is the same: because we 
don't remove the batch from the list of in-flight batches, and because we retry 
MESSAGE_TOO_LARGE errors infinitely, the batches build up and eventually use 
all the available memory. I'll run your program with my fix and see if it fixes 
the issue. As for why we don't decrement retries after splitting batches, it's 
because we want to treat the new, smaller batches as separate requests that get 
the same number of attempts as any other request. If we didn't do this, and the 
producer batch size was too high relative to the number of retries, we might 
run out of retries before splitting down to a safe size and fail to produce the 
records, even though each individual one is viable. Eventually we'll split 
large batches down to a single record, and if that is still too large we don't 
retry. In the case of your demo, I suspect the memory ran out before the split 
batches got down below the broker size limit, but that should be addressed by 
the fix for this bug.

> Remove from the incomplete set failed. This should be impossible
> ----------------------------------------------------------------
>
>                 Key: KAFKA-8325
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8325
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 2.1.0, 2.3.0
>            Reporter: Mattia Barbon
>            Assignee: Bob Barrett
>            Priority: Major
>
> I got this error when using the Kafka producer. So far it happened twice, 
> with an interval of about 1 week.
> {{ERROR [2019-05-05 08:43:07,505] 
> org.apache.kafka.clients.producer.internals.Sender: [Producer 
> clientId=<redacted>, transactionalId=<redacted>] Uncaught error in kafka 
> producer I/O thread:}}
> {{ ! java.lang.IllegalStateException: Remove from the incomplete set failed. 
> This should be impossible.}}
> {{ ! at 
> org.apache.kafka.clients.producer.internals.IncompleteBatches.remove(IncompleteBatches.java:44)}}
> {{ ! at 
> org.apache.kafka.clients.producer.internals.RecordAccumulator.deallocate(RecordAccumulator.java:645)}}
> {{ ! at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:717)}}
> {{ ! at 
> org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:365)}}
> {{ ! at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:308)}}
> {{ ! at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:233)}}
> {{ ! at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to