[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807017#comment-17807017
 ] 

Zhenqiu Huang commented on FLINK-34007:
---------------------------------------

[~mapohl]
I am intensively testing flink 1.18. Within two days, there are users reported 
the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are 
running in the same cluster. 1.16 is in different cluster.

I attached another LeaderElector-Debug.json file that contains debug log of a 
flink 1.18 app. The issue happened several times:
1. due to the configmap not accessible from api sever then renew timeout 
exceeded 
2. a failure on patch on a updated configmap

The interesting part of the behavior of last several days is that job manager 
was not stuck but exit directly. Then, new job manager pod started correctly 
that is why new leader is selected in the log above. Hopefully, it is useful 
for your diagnosis.






> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to