dajac opened a new pull request, #18206:
URL: https://github.com/apache/kafka/pull/18206

   We have observed the below errors in some cluster:
   
   Uncaught exception in scheduled task 'handleTxnCompletion-902667' 
exception.message:Trying to complete a transactional offset commit for 
producerId *** and groupId *** even though the offset commit record itself 
hasn't been appended to the log.
   
   When a transaction is completed, the transaction coordinator sends a 
WriteTxnMarkers request to all the partitions involved in the transaction to 
write the markers to them. When the broker receives it, it writes the markers 
and if markers are written to the __consumer_offsets partitions, it informs the 
group coordinator that it can materialize the pending transactional offsets in 
its main cache. The group coordinator does this asynchronously since Apache 
Kafka 2.0, see this patch.
   
   The above error appends when the asynchronous operation is executed by the 
scheduler and the operation finds that there are pending transactional offsets 
that were not written yet. How come?
   
   There is actually an issue is the steps described above. The group 
coordinator does not wait until the asynchronous operation completes to return 
to the api layer. Hence the WriteTxnMarkers response may be send back to the 
transaction coordinator before the async operation is actually completed. Hence 
it is possible that the next transactional produce to be started also before 
the operation is completed too. This could explain why the group coordinator 
has pending transactional offsets that are not written yet.
   
   There is a similar issue when the transaction is aborted. However on this 
path, we don't have any checks to verify whether all the pending transactional 
offsets have been written or not so we don't see any errors in our logs. Due to 
the same race condition, it is possible to actually remove the wrong pending 
transactional offsets.
   
   PS: The new group coordinator is not impacted by this bug.
   
   Reviewers: Justine Olshan <[email protected]>
   
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to