Lianet Magrans created KAFKA-16954:
--------------------------------------
Summary: Move consumer leave operations on close to background
thread
Key: KAFKA-16954
URL: https://issues.apache.org/jira/browse/KAFKA-16954
Project: Kafka
Issue Type: Bug
Reporter: Lianet Magrans
When a consumer unsubscribes, the app thread simply triggers an Unsubscribe
event that will take care of it all in the background thread: release
assignment (callbacks), clear assigned partitions, and send leave group HB.
On the contrary, when a consumer is closed, these actions happen in both
threads:
* release assignment -> in the app thread by directly running the callbacks
* clear assignment -> in app thread by updating the subscriptionState
* send leave group HB -> in the background thread via an event LeaveOnClose
This situation could lead to race conditions, mainly because of the close
updating the subscription state in the app thread, when other operations in the
background could be already running based on it. Ex.
* unsubscribe in app thread (triggers background UnsubscribeEvent to revoke
and leave)
* unsubscribe fails (ex. interrupted, leaving operation running in the
background thread to revoke partitions and leave)
* consumer close (will revoke and clear assignment in the app thread)
* UnsubscribeEvent in the background may fail by trying to revoke partitions
that it does not own anymore - _No current assignment for partition ..._
A basic check has been added to the background thread revocation to avoid the
race condition, ensuring that we only revoke partitions we own, but still we
should avoid the root cause, which is updating the assignment on the app
thread. We should consider having the close operation as a single LeaveOnClose
event handled in the background. That even already takes cares of revoking the
partitions and clearing assignment on the background, so no need to take care
of it in the app thread. We should only ensure that we processBackgroundEvents
until the LeaveOnClose completes (to allow for callbacks to run in the app
thread)
Trying to understand the current approach, I imagine the initial motivation to
have the callabacks (and assignment cleared) in the app thread was to avoid the
back-and-forth: app thread close -> background thread leave event -> app thread
to run callback -> background thread to clear assignment and send HB. But
updating the assignment on the app thread ends up being problematic, as it
mainly happens in the background so it opens up the door for race conditions on
the subscription state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)