[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Matthias Pohl (Jira) Thu, 01 Feb 2024 02:13:20 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813150#comment-17813150
 ]


Matthias Pohl commented on FLINK-34007:
---------------------------------------

Hi [~yunta] we're not aware of any issues related to FLINK-34007 in Flink 1.17. 
The issue started to appear with the upgrade of the k8s client dependency to 
v6.6.2 in FLINK-31997 (which ended up in Flink 1.18).

That said, [~ZhenqiuHuang] reported similar errors in Flink 1.17 and 1.16 
deployments as well which we cannot explain. We were not able to investigate 
the cause due to missing logs. We agreed to cover any other problems in a 
separate Jira issue if [~ZhenqiuHuang] comes up with new information (see his 
comment above).

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Assignee: Matthias Pohl
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.19.0
>
>         Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Reply via email to