[ https://issues.apache.org/jira/browse/KAFKA-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486780#comment-17486780 ]
Guozhang Wang commented on KAFKA-7572: -------------------------------------- Thanks for pinging me. I've just made a pass on the PR. > Producer should not send requests with negative partition id > ------------------------------------------------------------ > > Key: KAFKA-7572 > URL: https://issues.apache.org/jira/browse/KAFKA-7572 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 1.0.1, 1.1.1 > Reporter: Yaodong Yang > Assignee: Wenhao Ji > Priority: Major > Labels: patch-available > > h3. Issue: > In one Kafka producer log from our users, we found the following weird one: > timestamp="2018-10-09T17:37:41,237-0700",level="ERROR", Message="Write to > Kafka failed with: ",exception="java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for > topicName--2: 30042 ms has passed since batch creation plus linger time > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1 > record(s) for topicName--2: 30042 ms has passed since batch creation plus > linger time" > After a few hours debugging, we finally understood the root cause of this > issue: > # The producer used a buggy custom Partitioner, which sometimes generates > negative partition ids for new records. > # The corresponding produce requests were rejected by brokers, because it's > illegal to have a partition with a negative id. > # The client kept refreshing its local cluster metadata, but could not send > produce requests successfully. > # From the above log, we found a suspicious string "topicName--2": > # According to the source code, the format of this string in the log is > TopicName+"-"+PartitionId. > # It's not easy to notice that there were 2 consecutive dash in the above > log. > # Eventually, we found that the second dash was a negative sign. Therefore, > the partition id is -2, rather than 2. > # The bug the custom Partitioner. > h3. Proposal: > # Producer code should check the partitionId before sending requests to > brokers. > # If there is a negative partition Id, just throw an IllegalStateException{{ > }}exception. > # Such a quick check can save lots of time for people debugging their > producer code. -- This message was sent by Atlassian Jira (v8.20.1#820001)