[ 
https://issues.apache.org/jira/browse/KAFKA-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu updated KAFKA-17877:
-------------------------------
    Description: 
{code:java}
java.lang.IllegalStateException: WriteTxnMarkerResponse for 
lkc-devcv9jg9n_transaction-bench-transaction-id-72UwIuNVQkOxl4y_OEBAlA does not 
contain expected error map for producer id 8308
{code}
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerRequestCompletionHandler.scala#L100]

------

It is a data partition side bug. The leader may return the response early 
without all the producer ID included in the response.

Consider the following case:
 # We have 2 markers to append, one for producer-0, one for producer-1
 # When we first process producer-0, it appends a marker to the 
__consumer_offset.
 # The __consumer_offset append finishes very fast because the group 
coordinator is no longer the leader. So the coordinator directly returns 
NOT_LEADER_OR_FOLLOWER. In its callback, it calls the {{maybeComplete()}} for 
the first time, and because there is only one partition to append, it is able 
to go further to call {{maybeSendResponseCallback()}} and decrement 
{{{}numAppends{}}}.
 # Then it calls the replica manager append for nothing, in the callback, it 
calls the {{maybeComplete()}} for the second time. This time, it also 
decrements {{{}numAppends{}}}.

Remember, because we only have 2 markers, the initial value for {{numAppends}} 
is also 2. So in step 4, it is able to finish the request without even 
processing producer-1. This will cause the producer-1 missing from the 
WriteTxnMarkers response.
----
As a result, the txn coordinator will not update the txn state correctly though 
the markers may have been written in the data partitions. There is an impact on 
the clients. the client believes the txn is completed but when it tries to send 
any request for the new transaction with the same transaction ID, the request 
will fail with CONCURRENT_TRANSACTIONS. 

Note, this can only happen with the KIP-848 coordinator enabled.

  was:
{code:java}
java.lang.IllegalStateException: WriteTxnMarkerResponse for 
lkc-devcv9jg9n_transaction-bench-transaction-id-72UwIuNVQkOxl4y_OEBAlA does not 
contain expected error map for producer id 8308
{code}
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerRequestCompletionHandler.scala#L100]

------

It is a data partition side bug. The leader may return the response early 
without all the producer ID included in the response.

Consider the following case:
 # We have 2 markers to append, one for producer-0, one for producer-1
 # When we first process producer-0, it appends a marker to the 
__consumer_offset.
 # The __consumer_offset append finishes very fast because the group 
coordinator is no longer the leader. So the coordinator directly returns 
NOT_LEADER_OR_FOLLOWER. In its callback, it calls the {{maybeComplete()}} for 
the first time, and because there is only one partition to append, it is able 
to go further to call {{maybeSendResponseCallback()}} and decrement 
{{{}numAppends{}}}.
 # Then it calls the replica manager append for nothing, in the callback, it 
calls the {{maybeComplete()}} for the second time. This time, it also 
decrements {{{}numAppends{}}}.

Remember, because we only have 2 markers, the initial value for {{numAppends}} 
is also 2. So in step 4, it is able to finish the request without even 
processing producer-1. This will cause the producer-1 missing from the 
WriteTxnMarkers response.
----
As a result, the txn coordinator will not update the txn state correctly though 
the markers may have been written in the data partitions. There is an  impact 
on the clients. the 


> IllegalStateException: missing producer id from the WriteTxnMarkersResponse
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-17877
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17877
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Calvin Liu
>            Assignee: Calvin Liu
>            Priority: Major
>
> {code:java}
> java.lang.IllegalStateException: WriteTxnMarkerResponse for 
> lkc-devcv9jg9n_transaction-bench-transaction-id-72UwIuNVQkOxl4y_OEBAlA does 
> not contain expected error map for producer id 8308
> {code}
> [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerRequestCompletionHandler.scala#L100]
> ------
> It is a data partition side bug. The leader may return the response early 
> without all the producer ID included in the response.
> Consider the following case:
>  # We have 2 markers to append, one for producer-0, one for producer-1
>  # When we first process producer-0, it appends a marker to the 
> __consumer_offset.
>  # The __consumer_offset append finishes very fast because the group 
> coordinator is no longer the leader. So the coordinator directly returns 
> NOT_LEADER_OR_FOLLOWER. In its callback, it calls the {{maybeComplete()}} for 
> the first time, and because there is only one partition to append, it is able 
> to go further to call {{maybeSendResponseCallback()}} and decrement 
> {{{}numAppends{}}}.
>  # Then it calls the replica manager append for nothing, in the callback, it 
> calls the {{maybeComplete()}} for the second time. This time, it also 
> decrements {{{}numAppends{}}}.
> Remember, because we only have 2 markers, the initial value for 
> {{numAppends}} is also 2. So in step 4, it is able to finish the request 
> without even processing producer-1. This will cause the producer-1 missing 
> from the WriteTxnMarkers response.
> ----
> As a result, the txn coordinator will not update the txn state correctly 
> though the markers may have been written in the data partitions. There is an 
> impact on the clients. the client believes the txn is completed but when it 
> tries to send any request for the new transaction with the same transaction 
> ID, the request will fail with CONCURRENT_TRANSACTIONS. 
> Note, this can only happen with the KIP-848 coordinator enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to