[
https://issues.apache.org/jira/browse/KAFKA-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900192#comment-17900192
]
Justine Olshan commented on KAFKA-14920:
----------------------------------------
Hey there. I think I meant: https://issues.apache.org/jira/browse/KAFKA-14884.
>From the comment in the PR linked:
??While working on this change, we also noticed that when the add partitions
response returns a retriable error, sometimes the next batch succeeds before
the old one can retry. This exposed the known out of order sequence issue when
we have no state yet in the PSM. (We accept a later sequence and the earlier
one will get stuck in a retry loop.) In order to prevent this issue from
getting worse when verifying, we also set state about the tentative first
sequence. When we start to verify for a partition, and we have no batch
information for a given epoch, we also store the lowest sequence we see. That
way, in the case above, we will return a retriable error and prevent writing a
later sequence before the earlier one can be retried. Once any batch exists in
the state, we no longer track this sequence, since we have reliable sequence
information in the batch.??
For transactions in KIP-890 part 2, we plan to re-introduce an error that will
prevent the first produce request from writing if the sequence is not zero.
This should fix the issue fully for transactions, but idempotent producers will
still require some changes.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense#KIP890:TransactionsServerSideDefense-BumpEpochonEachTransactionforNewClients(1)
*Return Error for Non-Zero Sequence on New Producers*
For new clients, we can once again return an error for sequences that are
non-zero when there is no producer state present on the server. This will
indicate we missed the 0 sequence and we don't yet want to write to the log.
Previously this error was {{UNKNOWN_PRODUCER_ID}} but historically, this error
code was handled in a complicated way. Now this scenario can be covered with
the retriable OutOfOrderSequence error.
> Address timeouts and out of order sequences
> -------------------------------------------
>
> Key: KAFKA-14920
> URL: https://issues.apache.org/jira/browse/KAFKA-14920
> Project: Kafka
> Issue Type: Sub-task
> Affects Versions: 3.6.0
> Reporter: Justine Olshan
> Assignee: Justine Olshan
> Priority: Blocker
> Fix For: 3.6.0
>
>
> KAFKA-14844 showed the destructive nature of a timeout on the first produce
> request for a topic partition (ie one that has no state in psm)
> Since we currently don't validate the first sequence (we will in part 2 of
> kip-890), any transient error on the first produce can lead to out of order
> sequences that never recover.
> Originally, KAFKA-14561 relied on the producer's retry mechanism for these
> transient issues, but until that is fixed, we may need to retry from in the
> AddPartitionsManager instead. We addressed the concurrent transactions, but
> there are other errors like coordinator loading that we could run into and
> see increased out of order issues.
> 由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。
> 最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager
> 中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。
--
This message was sent by Atlassian Jira
(v8.20.10#820010)