[ 
https://issues.apache.org/jira/browse/KAFKA-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824512#comment-17824512
 ] 

Justine Olshan commented on KAFKA-16352:
----------------------------------------

Thanks for filing [~alivshits]. If I understand correctly, the transaction is 
completed on the data partition side (in terms of writing the marker, lso, etc) 
but the coordinator is not able to complete the final book-keeping so we are 
unable to continue using the transactional id.

> Transaction may get get stuck in PrepareCommit or PrepareAbort state
> --------------------------------------------------------------------
>
>                 Key: KAFKA-16352
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16352
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Artem Livshits
>            Assignee: Artem Livshits
>            Priority: Major
>
> A transaction took a long time to complete, trying to restart a producer 
> would lead to CONCURRENT_TRANSACTION errors.  Investigation has shown that 
> the transaction was stuck in PrepareCommit for a few days:
> (current time when the investigation happened: Feb 27 2024), transaction 
> state:
> {{Type   |Name                  |Value}}
> {{-----------------------------------------------------------------------------------------}}
> {{ref    |transactionalId       |xxx-yyy}}
> {{long   |producerId            |299364}}
> {{ref    |state                 |kafka.coordinator.transaction.PrepareCommit$ 
> @ 0x44fe22760}}
> {{long   |txnStartTimestamp     |1708619624810  Thu Feb 22 2024 16:33:44.810 
> GMT+0000}}
> {{long   |txnLastUpdateTimestamp|1708619625335  Thu Feb 22 2024 16:33:45.335 
> GMT+0000}}
> {{-----------------------------------------------------------------------------------------}}
> The partition list was empty and transactionsWithPendingMarkers didn't 
> contain the reference to the transactional state.  In the log there were the 
> following relevant messages:
> {{22 Feb 2024 @ 16:33:45.623 UTC [Transaction State Manager 1]: Completed 
> loading transaction metadata from __transaction_state-3 for coordinator epoch 
> 611}}
> (this is the partition that contains the transactional id).  After the data 
> is loaded, it sends out markers and etc.
> Then there is this message:
> {{22 Feb 2024 @ 16:33:45.696 UTC [Transaction Marker Request Completion 
> Handler 4]: Transaction coordinator epoch for xxx-yyy has changed from 610 to 
> 611; cancel sending transaction markers TxnMarkerEntry\{producerId=299364, 
> producerEpoch=1005, coordinatorEpoch=610, result=COMMIT, 
> partitions=[foo-bar]} to the brokers}}
> this message is logged just before the state is removed 
> transactionsWithPendingMarkers, but the state apparently contained the entry 
> that was created by the load operation.  So the sequence of events probably 
> looked like the following:
>  # partition load completed
>  # commit markers were sent for transactional id xxx-yyy; entry in 
> transactionsWithPendingMarkers was created
>  # zombie reply from the previous epoch completed, removed entry from 
> transactionsWithPendingMarkers
>  # commit markers properly completed, but couldn't transition to 
> CommitComplete state because transactionsWithPendingMarkers didn't have the 
> proper entry, so it got stuck there until the broker was restarted
> Looking at the code there are a few cases that could lead to similar race 
> conditions.  The fix it to keep track of the PendingCompleteTxn value that 
> was used when sending the marker, so that we can only remove the state that 
> was created when the marker was sent and not accidentally remove the state 
> someone else created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to