Arun Lakshman created FLINK-37811:
-------------------------------------
Summary: Flink Job stuck in suspend state after losing leadership
in Zookeeper HA
Key: FLINK-37811
URL: https://issues.apache.org/jira/browse/FLINK-37811
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.20.0, 1.15.0
Reporter: Arun Lakshman
Attachments: notRecovered.csv
We have observed an inconsistent behavior pattern where the JobManager
encounters ZooKeeper session timeout exceptions, leading to leadership loss
across multiple components including Resource Manager, Job Master, and
Dispatcher. When this occurs, the system exhibits an unexpected sequence -
while components are in the process of shutting down, the ZooKeeper connection
gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, the
JobManager process continues to run without performing a system exit. The
initial trigger appears as a session timeout exception with message "Client
session timed out, have not heard from server in 26678ms".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)