Guozhang Wang created KAFKA-10614:
-------------------------------------
Summary: Group coordinator onElection/onResignation should guard
against leader epoch
Key: KAFKA-10614
URL: https://issues.apache.org/jira/browse/KAFKA-10614
Project: Kafka
Issue Type: Bug
Components: core
Reporter: Guozhang Wang
When there are a sequence of LeaderAndISR or StopReplica requests sent from
different controllers causing the group coordinator to elect / resign, we may
re-order the events due to race condition. For example:
1) First LeaderAndISR request received from old controller to resign as the
group coordinator.
2) Second LeaderAndISR request received from new controller to elect as the
group coordinator.
3) Although threads handling the 1/2) requests are synchronized on the replica
manager, their callback {{onLeadershipChange}} would trigger
{{onElection/onResignation}} which would schedule the loading / unloading on
background threads, and are not synchronized.
4) As a result, the {{onElection}} maybe triggered by the thread first, and
then {{onResignation}}. As a result, the coordinator would not recognize it
self as the coordinator and hence would respond any coordinator request with
{{NOT_COORDINATOR}}.
Here are two proposals on top of my head:
1) Let the scheduled load / unload function to keep the passed in leader epoch,
and also materialize the epoch in memory. Then when execute the unloading check
against the leader epoch.
2) This may be a bit simpler: using a single background thread working on a
FIFO queue of loading / unloading jobs, since the caller are actually
synchronized on replica manager and order preserved, the enqueued loading /
unloading job would be correctly ordered as well. In that case we would avoid
the reordering.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)