Hi all, We have been seeing this issue intermittently, and hence it's difficult to give a step by step instructions to reproduce it. I have been studying the code base of the Sender.java (org.apache.kafka.clients.producer.internals.Sender.java), but haven't been able to find the possible bug.
We are using setup is 3 node Kafka cluster. Here are some relevant logs: 2018-03-28 09:50:54,290 ERROR [kafka-producer-network-thread | producer-1] o.a.k.c.producer.internals.Sender:301 - [Producer clientId=producer-1] The broker returned org.apache.kafka.common.errors.UnknownProducerIdException: This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producerId are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception. for topic-partition pipeline-0 at offset -1. This indicates data loss on the broker, and should be investigated. 2018-03-28 09:51:13,394 WARN [kafka-producer-network-thread | producer-1] o.a.k.c.producer.internals.Sender:251 - [Producer clientId=producer-1] Got error produce response with correlation id 1000 on topic-partition pipeline-3, retrying (2147483459 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER 2018-03-28 10:48:33,365 WARN [kafka-producer-network-thread | producer-1] o.a.k.c.producer.internals.Sender:251 - [Producer clientId=producer-1] Got error produce response with correlation id 34893 on topic-partition pipeline-3, retrying (2147449585 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER [2018-03-28 09:50:54,421] ERROR [ReplicaManager broker=1001] Error processing append operation on partition pipeline-3 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producerId 5102: 2 (incoming seq. number), 7 (current end sequence number) 1. We have some sort of Admin API, which deletes and recreates topics (and loads them), and when we delete a topic it creates a new producerId, which uses the same producer instance to write messages. (This might be a problem, but we don't know for sure) 2. We don't always get stuck in this INT_MAX retries (because we have enabled idempotence), many times it stops after 30 seconds, as expected and sets a new producerId. (But sometimes that timeout exception doesn't get triggered) 2018-03-29 10:16:54,826 INFO [kafka-producer-network-thread | producer-1] o.a.k.c.p.i.TransactionManager:346 - [Producer clientId=producer-1] ProducerId set to -1 with epoch -1 2018-03-29 10:16:54,827 INFO [kafka-producer-network-thread | producer-1] o.a.k.c.p.i.TransactionManager:346 - [Producer clientId=producer-1] ProducerId set to 9002 with epoch 0 --- We are looking to eliminate this indeterministic behaviour, by handling the OUT_OF_ORDER_SEQUENCE_NUMBER in a better way (maybe re-instantiate the producer, but not sure if that would solve anything as Kafka has ways to reset producerId after timeout). Any ideas/comments on why this is happening, regardless of having a default timeout of 30 seconds? Please let me know if you need more information in understanding the problem we are facing. Regards, Saheb -- ... [image: cake bamtech_logo_rgb signature.jpg] <http://www.cakesolutions.net> Saheb Motiani (Office) 0845 617 1200 Houldsworth Mill, Houldsworth Street, Reddish, Stockport, SK5 6DA, UK www.cakesolutions.net [image: twitter-circle-darkgrey.png] <https://twitter.com/cakesolutions> [image: facebook-circle-darkgrey.png] <https://www.facebook.com/cakesolutionslimited/> [image: linkedin-circle-darkgrey.png] <https://www.linkedin.com/company/cake-solutions-limited> [image: Reactive Applications] <https://cakesolutions.sigstr.net/uc/588780e60e0f7519396890f3> Company registered in the UK, No. 4184567 If you have received this e-mail in error, please accept our apologies, destroy it immediately, and it would be greatly appreciated if you notified the sender. It is your responsibility to protect your system from viruses and any other harmful code or device. We try to eliminate them from e-mails and attachments, but we accept no liability for any which remain. We may monitor or access any or all e-mails sent to us. [image: Powered by Sigstr] <https://cakesolutions.sigstr.net/uc/588780e60e0f7519396890f3/watermark>