[ 
https://issues.apache.org/jira/browse/KAFKA-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaodong Yang updated KAFKA-7572:
--------------------------------
                Flags: Patch
    Affects Version/s: 1.1.1

> Producer should not send requests with negative partition id
> ------------------------------------------------------------
>
>                 Key: KAFKA-7572
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7572
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 1.0.1, 1.1.1
>            Reporter: Yaodong Yang
>            Priority: Major
>
> h3. Issue:
> In one Kafka producer log from our users, we found the following weird one:
> timestamp="2018-10-09T17:37:41,237-0700",level="ERROR", Message="Write to 
> Kafka failed with: ",exception="java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for 
> topicName--2: 30042 ms has passed since batch creation plus linger time
>  at 
> org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
>  at 
> org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64)
>  at 
> org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1 
> record(s) for topicName--2: 30042 ms has passed since batch creation plus 
> linger time"
> After a few hours debugging, we finally understood the root cause of this 
> issue:
>  # The producer used a buggy custom Partitioner, which sometimes generates 
> negative partition ids for new records.
>  # The corresponding produce requests were rejected by brokers, because it's 
> illegal to have a partition with a negative id.
>  # The client kept refreshing its local cluster metadata, but could not send 
> produce requests successfully.
>  # From the above log, we found a suspicious string "topicName--2":
>  # According to the source code, the format of this string in the log is 
> TopicName+"-"+PartitionId.
>  # It's not easy to notice that there were 2 consecutive dash in the above 
> log.
>  # Eventually, we found that the second dash was a negative sign. Therefore, 
> the partition id is -2, rather than 2.
>  # The bug the custom Partitioner.
> h3. Proposal:
>  # Producer code should check the partitionId before sending requests to 
> brokers.
>  # If there is a negative partition Id, just throw an IllegalStateException{{ 
> }}exception.
>  # Such a quick check can save lots of time for people debugging their 
> producer code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to