[jira] [Updated] (FLINK-37811) Flink Job stuck in suspend state after losing leadership in Zookeeper HA

Arun Lakshman (Jira) Fri, 16 May 2025 17:42:27 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-37811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arun Lakshman updated FLINK-37811:
----------------------------------
    Description: 
We have observed an inconsistent behavior pattern where the JobManager 
encounters ZooKeeper session timeout exceptions, leading to leadership loss 
across multiple components including Resource Manager, Job Master, and 
Dispatcher. When this occurs, the system exhibits an unexpected sequence - 
while components are in the process of shutting down, the ZooKeeper connection 
gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, the 
JobManager process continues to run without performing a system exit. The 
initial trigger appears as a session timeout exception with message "Client 
session timed out, have not heard from server in 26678ms".

 

I have attached the logs of Job manager

 

  was:We have observed an inconsistent behavior pattern where the JobManager 
encounters ZooKeeper session timeout exceptions, leading to leadership loss 
across multiple components including Resource Manager, Job Master, and 
Dispatcher. When this occurs, the system exhibits an unexpected sequence - 
while components are in the process of shutting down, the ZooKeeper connection 
gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, the 
JobManager process continues to run without performing a system exit. The 
initial trigger appears as a session timeout exception with message "Client 
session timed out, have not heard from server in 26678ms".


> Flink Job stuck in suspend state after losing leadership in Zookeeper HA
> ------------------------------------------------------------------------
>
>                 Key: FLINK-37811
>                 URL: https://issues.apache.org/jira/browse/FLINK-37811
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.20.0
>            Reporter: Arun Lakshman
>            Priority: Major
>         Attachments: notRecovered.csv
>
>
> We have observed an inconsistent behavior pattern where the JobManager 
> encounters ZooKeeper session timeout exceptions, leading to leadership loss 
> across multiple components including Resource Manager, Job Master, and 
> Dispatcher. When this occurs, the system exhibits an unexpected sequence - 
> while components are in the process of shutting down, the ZooKeeper 
> connection gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, 
> the JobManager process continues to run without performing a system exit. The 
> initial trigger appears as a session timeout exception with message "Client 
> session timed out, have not heard from server in 26678ms".
>  
> I have attached the logs of Job manager
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37811) Flink Job stuck in suspend state after losing leadership in Zookeeper HA

Reply via email to