Re: [PR] KAFKA-19999 Transaction coordinator livelock caused by invalid producer epoch [kafka]

via GitHub Mon, 22 Dec 2025 21:00:30 -0800


m1a2st commented on code in PR #21176:
URL: https://github.com/apache/kafka/pull/21176#discussion_r2641934124



##########
storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerAppendInfo.java:
##########
@@ -119,6 +119,21 @@ private void checkProducerEpoch(short producerEpoch, long 
offset, short transact
         boolean invalidEpoch = (transactionVersion >= 2) ? (producerEpoch <= 
current) : (producerEpoch < current);
 
         if (invalidEpoch) {
+            // TV2 Idempotent Retry Detection:
+            // When markerEpoch == currentEpoch and no transaction is ongoing, 
this is a retry
+            // of a marker that was already successfully written. Common 
scenarios:
+            // 1. Coordinator recovery: reloading PREPARE_COMMIT/ABORT from 
the transaction log
+            // 2. Network retry: marker was written but response was lost due 
to disconnection
+            // In both cases, the transaction has already ended 
(currentTxnFirstOffset is empty),
+            // so we can safely treat this as idempotent success.
+            if (transactionVersion >= 2 &&
+                    producerEpoch == current &&
+                    updatedEntry.currentTxnFirstOffset().isEmpty()) {
+                log.debug("Received duplicate transaction marker for producer 
{} with epoch {} " +
+                                "but transaction is no longer ongoing, 
treating as idempotent success",
+                        producerId, producerEpoch);
+                throw new IdempotentTransactionMarkerException();

Review Comment:
   If we simply return early from `ProducerAppendInfo#checkProducerEpoch`, we 
still need to prevent duplicate data from being written to the log. The 
validation happens before the actual append operation, so just returning allows 
the code to continue and write the duplicate marker anyway.
   
   The general produce path can handle duplicate writes more easily through 
conditional logic after validation. However, the write path for the offset 
topic appears to lack built-in idempotency logic. Therefore, throwing an 
exception provides a clean way to immediately break the append flow once we 
detect an idempotent retry. The exception propagates up through the call stack, 
preventing the duplicate write regardless of which code path is handling the 
marker. The caller can then catch this specific exception type and treat it as 
a successful operation.
   
   While using exceptions for control flow is not ideal, it's a pragmatic 
solution given the current code structure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] KAFKA-19999 Transaction coordinator livelock caused by invalid producer epoch [kafka]

Reply via email to