[jira] [Comment Edited] (KAFKA-20090) TV2 can allow for ongoing transactions with max epoch that never complete

sanghyeok An (Jira) Fri, 23 Jan 2026 18:18:08 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054041#comment-18054041
 ]


sanghyeok An edited comment on KAFKA-20090 at 1/24/26 2:17 AM:
---------------------------------------------------------------

[~jolshan] [~chia7712] 
{quote}agreed. Since this is a rare edge case, maybe we could just return error 
if isProducerEpochExhausted is true when handling an abort request for a 
time-out (aborted) transaction.
{quote}
 
I also think this suggestion is reasonable. 
In addition, if we go with this approach, it seems we may need to introduce a 
new checked error on the producer client side as well.
 

I’d also like to propose this idea as well.
Given TV2’s per-transaction epoch bumps, once the epoch reaches 
{{Short.MAX_VALUE}} it can no longer advance. In a rare/unexpected edge 
case—where a timeout-driven abort races with a late client-initiated ABORT near 
the max-epoch boundary—treating the late ABORT as an idempotent retry can 
inadvertently hand {{MAX_EPOCH}} back to the producer as a usable epoch, which 
may lead to an {{ONGOING}} transaction that neither completes nor times out.
 
As an additional defense-in-depth option (if we want to be stricter), we could 
also consider rejecting exhausted epochs(Short.MAX) on “transaction entry” 
requests that effectively transition a {{transactionalId}} back to {{ONGOING}} 
(e.g., {{{}AddPartitionsToTxn{}}}) when the epoch is  Short.MAX, using an 
ProducerEpochExhausted error.

 

What do you think? 

But I agree we should prioritize the targeted guard first to minimize behavior 
changes. :)


was (Author: JIRAUSER303328):
[~jolshan] [~chia7712] 
{quote}agreed. Since this is a rare edge case, maybe we could just return error 
if isProducerEpochExhausted is true when handling an abort request for a 
time-out (aborted) transaction.
{quote}
 
I also think this suggestion is reasonable. 
In addition, if we go with this approach, it seems we may need to introduce a 
new checked error on the producer client side as well.
 

I’d also like to propose this idea as well.
Given TV2’s per-transaction epoch bumps, once the epoch reaches 
{{Short.MAX_VALUE}} it can no longer advance. In a rare/unexpected edge 
case—where a timeout-driven abort races with a late client-initiated ABORT near 
the max-epoch boundary—treating the late ABORT as an idempotent retry can 
inadvertently hand {{MAX_EPOCH}} back to the producer as a usable epoch, which 
may lead to an {{ONGOING}} transaction that neither completes nor times out.
 
As an additional defense-in-depth option (if we want to be stricter), we could 
also consider rejecting exhausted epochs(Short.MAX) on “transaction entry” 
requests that effectively transition a {{transactionalId}} back to {{ONGOING}} 
(e.g., {{{}AddPartitionsToTxn{}}}) when the epoch is  Short.MAX, using an 
ProducerEpochExhausted error.
But I agree we should prioritize the targeted guard first to minimize behavior 
changes. 

> TV2 can allow for ongoing transactions with max epoch that never complete
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-20090
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20090
>             Project: Kafka
>          Issue Type: Task
>            Reporter: Justine Olshan
>            Priority: Major
>
> When transaction version 2 was introduced, epoch bumps happen on every 
> transaction. 
> The original EndTransaction logic considers retries and because of epoch 
> bumps we wanted to be careful to not fence ourselves. This means that for 
> EndTransaction retries, we have to check if the epoch has been bumped to 
> consider a retry.
> The original logic returns the current producer ID and epoch in the 
> transaction metadata when a retry has been identified. The normal end 
> transaction case with max epoch - 1 was considered and accounted for – the 
> state there is safe to return to the producer.
> However, we didn't consider that in the case of fencing epoch bumps with max 
> epoch - 1, where we also bump the epoch, but don't create a new producer ID 
> and epoch. In this scenario the producer was expected to be fenced and call 
> init producer ID, so this isn't a problem, but it is a problem if we try to 
> return it to the producer.
> There is a scenario we race a timeout and end transaction abort with max 
> epoch - 1, we can consider the end transaction request a "retry" and return 
> max epoch as the current producer's epoch instead of fencing. 
> 1. The fencing abort on transactional timeout bumps the epoch to max
> 2. The EndTxn request with max epoch - 1 is considered a "retry" and we 
> return max epoch
> 3. The producer can start a transaction since we don't check epochs on 
> starting transactions
> 4. We cannot commit this transaction with TV2 and we cannot timeout the 
> transaction. It is stuck in Ongoing forever. 
> I modified 
> [https://github.com/apache/kafka/blob/aad33e4e41aaa94b06f10a5be0094b717b98900f/core/src/test/scala/unit/kafka/coordinator/transaction/TransactionCoordinatorTest.scala#L1329]
>  to capture this behavior. I added the following code to the end:
> {code:java}
> // Transition to COMPLETE_ABORT since we can't do it via writing markers 
> response callback 
> txnMetadata.completeTransitionTo(new 
> TxnTransitMetadata(txnMetadata.producerId(), txnMetadata.prevProducerId(), 
> txnMetadata.nextProducerId(), Short.MaxValue, Short.MaxValue -1, 
> txnTimeoutMs, txnMetadata.pendingState().get(), new 
> util.HashSet[TopicPartition](), txnMetadata.txnLastUpdateTimestamp(), 
> txnMetadata.txnLastUpdateTimestamp(), TV_2)) 
> coordinator.handleEndTransaction(transactionalId, producerId, 
> epochAtMaxBoundary, TransactionResult.ABORT, TV_2, endTxnCallback) 
> assertEquals(10, newProducerId) assertEquals(Short.MaxValue, newEpoch) 
> assertEquals(Errors.NONE, error){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-20090) TV2 can allow for ongoing transactions with max epoch that never complete

Reply via email to