[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Yang Wang (Jira) Mon, 15 Jan 2024 18:14:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807009#comment-17807009
 ]


Yang Wang commented on FLINK-34007:
-----------------------------------

{quote}At least based on the reports of this Jira issue, there must have been 
an incident (which caused the lease to not be renewed)
{quote}
I am afraid we could not get this conclusion before we have the K8s APIServer 
audit logs to verify that the lease annotation did not get renewed. Because it 
could also happen that the lease annotation get renewed normally while the 
onStartLeading callback is not executed somehow. 

 
{quote}Therefore, the issue should exist in the entire version range [5.12.3, 
6.6.2].
{quote}
If this issue only happened in the Flink 1.18, then it should be related with 
the fabric8 K8s client 6.6.2 behavior change. Otherwise, we still have not find 
the root cause.

 

You are right. The slight difference in the revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes about clear the leader information in ConfigMap is not related with 
this issue.

 

BTW, if we know how to reproduce this issue, it will be easier to find the root 
cause. Because we might also need the K8s APIServer audit log to do some deep 
analysis.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Reply via email to