[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805960#comment-17805960
 ] 

Matthias Pohl commented on FLINK-34007:
---------------------------------------

I couldn't find anything related to ConfigMap cleanup being triggered during 
leadership loss in the Flink code (Flink's 
[KubernetesLeaderElector:82|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L82]
 sets up the [callback for the leadership 
loss|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L124]
 which is implemented by 
[KubernetesLeaderElectionDriver#LeaderCallbackHandlerImpl|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L212]).
 This behavior is on-par with the old (i.e. pre-FLINK-24038) version of the 
Flink 1.15 codebase (see 1.15 class 
[KubernetesLeaderElectionDriver:202|https://github.com/apache/flink/blob/bba7c417217be878fffb12efedeac50dec2a7459/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L202]).

That's also not something I would expect. It should be handled by the 
LeaderElector, instead, because the LeaderElector knows the state of the leader 
election and can trigger a clean up before the new leader information is 
written to the ConfigMap entry. Flink shouldn't trigger a clean up because it 
doesn't know whether a new leader was already elected (in which case cleaning 
up the ConfigMap entry would result in losing the leadership information of the 
new leader). And the Flink process wouldn't be able to clean it up, anyway, 
because the process isn't the leader anymore. Or am I missing something here?

On another note: I came across [this 
change|https://github.com/apache/flink/pull/22540/files#diff-0e859df42954459619211d2ec60957742b24c9fc6ce55526616fddc540f0f8ffL59-R60]
 in the FLINK-31997 PR (k8s client update to 6.6.2): We're changing the thread 
pool size from 1 to 3 essentially allowing the same internal LeaderElector 
being executed multiple times (because we trigger another 
{{KubernetesLeaderElector#run}} call when the leadership is revoked). The old 
version of the code use a single thread which would mean that the run call 
would get queued until the previous {{LeaderElector#run}} failed for whatever 
reason. That change sounds strange but shouldn't be the cause of this Jira 
issue because the change only went into 1.18 and we're experiencing this also 
in older versions of Flink.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to