[ 
https://issues.apache.org/jira/browse/KAFKA-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468275#comment-17468275
 ] 

Jun Rao commented on KAFKA-13574:
---------------------------------

[~aphyr] : Thanks for reporting this issue. [~hachikuji] : Thanks for the 
investigation.

NOT_LEADER_OR_FOLLOWER is considered an indefinite error, which means that the 
produce request may or may not have succeeded. We could wait a bit longer to 
return a more precise response. However, we need to make sure that the record 
committed at offset N is indeed the one from the producer to be acknowledged. 
To do that, we could save the leader epoch after the produce records are 
appended to the leader's log. When checking the purgatory for completion, we 
could compare the records' committed leader epoch (probably through leader 
epoch cache) with the expected one.

> NotLeaderOrFollowerException thrown for a successful send
> ---------------------------------------------------------
>
>                 Key: KAFKA-13574
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13574
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 3.0.0
>         Environment: openjdk version "11.0.13" 2021-10-19
>            Reporter: Kyle Kingsbury
>            Priority: Minor
>              Labels: error-handling
>
> With org.apache.kafka/kafka-clients 3.0.0, under rare circumstances involving 
> multiple node and network failures, I've observed a call to `producer.send()` 
> throw `NotLeaderOrFollowerException` for a message which later appears in 
> `consumer.poll()` return values.
> I don't have a reliable repro case for this yet, but the case I hit involved 
> retries=1000, acks=all, and idempotence enabled. I suspect what might be 
> happening here is that an initial attempt to send the message makes it to the 
> server and is committed, but the acknowledgement is lost e.g. due to timeout; 
> the Kafka producer then automatically retries the send attempt, and on that 
> retry hits a NotLeaderOrFollowerException, which is thrown back to the 
> caller. If we interpret NotLeaderOrFollowerException as a definite failure, 
> then this would constitute an aborted read.
> I've seen issues like this in a number of databases around client or 
> server-internal retry mechanisms, and I think the thing to do is: rather than 
> throwing the most *recent* error, throw the {*}most indefinite{*}. That way 
> clients know that their request may have actually succeeded, and they won't 
> (e.g.) attempt to re-submit a non-idempotent request again.
> As a side note: is there... perhaps documentation on which errors in Kafka 
> are supposed to be definite vs indefinite? NotLeaderOrFollowerException is a 
> subclass of RetriableException, but it looks like RetriableException is more 
> about transient vs permanent errors than whether it's safe to retry.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to