[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805411#comment-17805411
 ] 

Matthias Pohl commented on FLINK-34007:
---------------------------------------

I still don't fully understand the error you shared: Shouldn't the 
KubernetesClientException resolve itself because the logic runs in a loop? Is 
this stacktrace you shared only a one-time thing or does it reoccur (which 
would confirm the execution in the loop and indicate that the ConfigMap is in 
some odd state)? Another thing I'm wondering is why the ConfigMap was 
concurrently updated (which caused the KubernetesClientException as far as I 
understand) when there's only one JM running. Are there other processes 
accessing the ConfigMap?

{quote}
[...] flink will not able to restart services (such RM and dispatcher) as 
DefaultLeaderRetrievalService is stopped also [...]
{quote}
The DefaultLeaderRetrievalService is not in charge of restarting any services. 
The LeaderElectionService will trigger the restart of any shut down services 
(in that case the SessionDispatcherLeaderProcess which would be started by the 
DefaultDispatcherRunner; the latter one maintains the Dispatcher's leader 
election) as soon as the JobManager gets the leadership again.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to