jolshan commented on code in PR #18730:
URL: https://github.com/apache/kafka/pull/18730#discussion_r1936172868
##########
core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala:
##########
@@ -408,13 +408,16 @@ class TransactionCoordinator(txnConfig: TransactionConfig,
// generate the new transaction metadata with added partitions
txnMetadata.inLock {
- if (txnMetadata.producerId != producerId) {
+ if (txnMetadata.pendingTransitionInProgress) {
+ // return a retriable exception to let the client backoff and
retry
+ // This check is performed first so that the pending transition
can complete before subsequent checks.
+ // With TV2, we may be transitioning over a producer epoch
overflow, and the producer may be using the
+ // new producer ID that is still only in pending state.
Review Comment:
We were hitting the invalid producer ID mapping in the overflow case. Let me
explain briefly.
For EndTxn, we don't return until the PrepareX transition has completed on
the state machine. For TV2 in both epoch overflow and normal case, this will be
the previous epoch + 1. (In the overflow case, this is max short)
At this point, metadata is pending the CompleteX state. This is where the
value differs depending on the epoch. If the epoch overflowed, the state will
contain a new producer ID and epoch 0. Otherwise it is the same as PrepareX
(same producer id and epoch + 1).
We intended to return the values of the CompleteX state to the producer so
the producer can use the correct producer ID and epoch going forward, but we
were accidentally returning the PrepareX state instead. This was the first bug.
We would hit invalid pid mapping when the transition completed becauase the
state would contain the new producer ID and the producer was still trying to
use the one that had epoch overflow. Thus, producer ID mismatch.
When I fixed this bug by returning the correct values to the producer, we
had the opposite problem. When the producer started using the new producer ID
when the CompleteX state was still pending, we would have the opposite producer
ID mismatch. In order to avoid this, we should return with a retriable error
and wait for the state to complete transition rather than the fatal invalid pid
mapping.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]