[
https://issues.apache.org/jira/browse/KAFKA-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Jacot resolved KAFKA-17877.
---------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
> IllegalStateException: missing producer id from the WriteTxnMarkersResponse
> ---------------------------------------------------------------------------
>
> Key: KAFKA-17877
> URL: https://issues.apache.org/jira/browse/KAFKA-17877
> Project: Kafka
> Issue Type: Bug
> Reporter: Calvin Liu
> Assignee: Calvin Liu
> Priority: Major
> Fix For: 4.0.0
>
>
> {code:java}
> java.lang.IllegalStateException: WriteTxnMarkerResponse for
> lkc-devcv9jg9n_transaction-bench-transaction-id-72UwIuNVQkOxl4y_OEBAlA does
> not contain expected error map for producer id 8308
> {code}
> [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerRequestCompletionHandler.scala#L100]
> ------
> It is a data partition side bug. The leader may return the response early
> without all the producer ID included in the response.
> Consider the following 2 cases:
> # We have 2 markers to append, one for producer-0, one for producer-1
> # When we first process producer-0, it appends a marker to the
> __consumer_offset.
> # The __consumer_offset append finishes very fast because the group
> coordinator is no longer the leader. So the coordinator directly returns
> NOT_LEADER_OR_FOLLOWER. In its callback, it calls the {{maybeComplete()}} for
> the first time, and because there is only one partition to append, it is able
> to go further to call {{maybeSendResponseCallback()}} and decrement
> {{{}numAppends{}}}.
> # Then it calls the replica manager append for nothing, in the callback, it
> calls the {{maybeComplete()}} for the second time. This time, it also
> decrements {{{}numAppends{}}}.
>
> # We have 2 markers to append, one for producer-0, one for producer-1
> # When we first process producer-0, it appends a marker to the
> __consumer_offset and a data topic foo.
> # The 2 appends will be handled by group coordinator and replica manager
> asynchronously.
> # It can be a race that, both appends finishes together, then they can fill
> the {{markerResults}} at the same time, then call the {{{}maybeComplete{}}}.
> Because the {{partitionsWithCompatibleMessageFormat.size ==
> markerResults.size}} condition is satisfied, both {{maybeComplete}} calls can
> go through to decrement the {{numAppends}} and cause a premature response.
> Remember, because we only have 2 markers, the initial value for
> {{numAppends}} is also 2. So in step 4, it is able to finish the request
> without even processing producer-1. This will cause the producer-1 missing
> from the WriteTxnMarkers response.
> ----
> As a result, the txn coordinator will not update the txn state correctly
> though the markers may have been written in the data partitions. There is an
> impact on the clients. the client believes the txn is completed but when it
> tries to send any request for the new transaction with the same transaction
> ID, the request will fail with CONCURRENT_TRANSACTIONS.
> Note, this can only happen with the KIP-848 coordinator enabled.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)