[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Zhenqiu Huang (Jira) Thu, 18 Jan 2024 10:04:23 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808334#comment-17808334
 ]


Zhenqiu Huang commented on FLINK-34007:
---------------------------------------

[~mapohl] Ack. There are no new observations from last 2 days' testing result. 
The only thing that probably worth to mention is that when the LeaderElector (3 
thread executor) exit from renew deadline out, it is actually one of the thread 
exit from the loop. From the debug log, I can still observe 2 thread 
consistently failed to acquire the leadership due to it the stop flag.


For the 1.17, I will create an instance for testing in the same cluster today. 
Let's see what's the result.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Blocker
>              Labels: pull-request-available
>         Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Reply via email to