David Morávek created FLINK-32010:
-------------------------------------

             Summary: KubernetesLeaderRetrievalDriver always waits for lease 
update to resolve leadership
                 Key: FLINK-32010
                 URL: https://issues.apache.org/jira/browse/FLINK-32010
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.16.1, 1.17.0, 1.18.0
            Reporter: David Morávek


The k8s-based leader retrieval is based on ConfigMap watching. The config map 
lifecycle (from the consumer point of view) is handled as a series of events 
with the following types:
 * ADDED -> the first time the consumer has seen the CM
 * UPDATED -> any further changes to the CM
 * DELETED -> ... you get the idea

The implementation assumes that ElectionDriver (the one that creates the CM) 
and ElectionRetriver are started simultaneously and therefore ignore the ADDED 
events because the CM is always created as empty and is updated with the 
leadership information later on.

This assumption is incorrect in the following cases (I might be missing some, 
but that's not important, the goal is to illustrate the problem):
 * TM joining the cluster later when the leaders are established to discover RM 
/ JM
 * RM tries to discover JM when 
MultipleComponentLeaderElectionDriver is used

This, for example, leads to higher job submission latencies that could be 
unnecessarily held back for up to the lease retry period [1].

[1] Configured by _high-availability.kubernetes.leader-election.retry-period_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to