[
https://issues.apache.org/jira/browse/KAFKA-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564222#comment-17564222
]
Daniel Urban commented on KAFKA-14053:
--
About to submit a PR, but wanted to bring up a few more points:
* When we bump in the middle of an ongoing transaction, the coordinator
handles that as a fencing operation, and does not return a valid epoch - the
coordinator assumes that a new producer was started with the same ID, and
forces the InitProducerIDRequest to be retried. This will result in a 2nd bump,
in which the epoch is returned.
* If we go with the simplest solution, and bump once, the producer must
transition into a fatal state. The producer cannot safely get a new epoch from
the coordinator, as sending the original epoch multiple times will first result
in CONCURRENT_TRANSACTIONS (while the ongoing transaction is aborted) then in
PRODUCER_FENCED (when the coordinator realizes that the epoch was bumped
already). It is not safe to send an empty InitProducerIDRequest, as we might
fence a newer producer instance with that.
* In my PR, I try to bump twice. First, with using the original epoch (which
will eventually get a PRODUCER_FENCED), then I increase the epoch by 1 on the
client side, and try again. Conceptually, this might sound wrong (the producer
is using an epoch which wasn't returned by the coordinator), but I think that
theoretically it is safe to do. The first bump during an ongoing transaction
results in an epoch which is never returned to any producers. If we retry with
the increased epoch, that won't collide with any other producers. If the bump
with the increased epoch succeeds, then there was no concurrent bump from other
instances, and the producer can continue using the new epoch. If the increased
epoch receives a PRODUCER_FENCED, then the producer was actually fenced by
another instance.
* Due to the previous point, the producer needs to make sure that it can
safely increase the epoch by 1 at any point. This means that if the producer
got Short.MAX as the epoch, it won't be able to do the +1 increase locally. To
avoid this, I made a change in the producer to force an epoch reset from the
coordinator when the epoch is Short.MAX.
* I understand that the local epoch increase and the intentional epoch reset
might not be ideal, but without those, the producer cannot be safely used after
a delivery timeout. So in my current understanding, a delivery timeout either
means a fatal error, or we need to accept the local epoch increase and the
epoch reset on Short.MAX.
> Transactional producer should bump the epoch when a batch encounters delivery
> timeout
> -
>
> Key: KAFKA-14053
> URL: https://issues.apache.org/jira/browse/KAFKA-14053
> Project: Kafka
> Issue Type: Bug
>Reporter: Daniel Urban
>Assignee: Daniel Urban
>Priority: Major
>
> When a batch fails due to delivery timeout, it is possible that the batch is
> still in-flight. Due to underlying infra issues, it is possible that an
> EndTxnRequest and a WriteTxnMarkerRequest is processed before the in-flight
> batch is processed on the leader. This can cause transactional batches to be
> appended to the log after the corresponding abort marker.
> This can cause the LSO to be infinitely blocked in the partition, or can even
> violate processing guarantees, as the out-of-order batch can become part of
> the next transaction.
> Because of this, the producer should skip aborting the partition, and bump
> the epoch to fence the in-flight requests.
>
> More detail can be found here:
> [https://lists.apache.org/thread/8d2oblsjtdv7740glc37v79f0r7p99dp]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)