[jira] [Created] (KAFKA-10614) Group coordinator onElection/onResignation should guard against leader epoch

Guozhang Wang (Jira) Wed, 14 Oct 2020 21:42:11 -0700

Guozhang Wang created KAFKA-10614:
-------------------------------------

             Summary: Group coordinator onElection/onResignation should guard 
against leader epoch
                 Key: KAFKA-10614
                 URL: https://issues.apache.org/jira/browse/KAFKA-10614
             Project: Kafka
          Issue Type: Bug
          Components: core
            Reporter: Guozhang Wang



When there are a sequence of LeaderAndISR or StopReplica requests sent from 
different controllers causing the group coordinator to elect / resign, we may 
re-order the events due to race condition. For example:

1) First LeaderAndISR request received from old controller to resign as the 
group coordinator.
2) Second LeaderAndISR request received from new controller to elect as the 
group coordinator.
3) Although threads handling the 1/2) requests are synchronized on the replica 
manager, their callback {{onLeadershipChange}} would trigger 
{{onElection/onResignation}} which would schedule the loading / unloading on 
background threads, and are not synchronized.
4) As a result, the {{onElection}} maybe triggered by the thread first, and 
then {{onResignation}}. As a result, the coordinator would not recognize it 
self as the coordinator and hence would respond any coordinator request with 
{{NOT_COORDINATOR}}.

Here are two proposals on top of my head:

1) Let the scheduled load / unload function to keep the passed in leader epoch, 
and also materialize the epoch in memory. Then when execute the unloading check 
against the leader epoch.

2) This may be a bit simpler: using a single background thread working on a 
FIFO queue of loading / unloading jobs, since the caller are actually 
synchronized on replica manager and order preserved, the enqueued loading / 
unloading job would be correctly ordered as well. In that case we would avoid 
the reordering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-10614) Group coordinator onElection/onResignation should guard against leader epoch

Reply via email to