[
https://issues.apache.org/jira/browse/FLINK-37811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun Lakshman updated FLINK-37811:
----------------------------------
Description:
We have observed an inconsistent behavior pattern where the JobManager
encounters ZooKeeper session timeout exceptions, leading to leadership loss
across multiple components including Resource Manager, Job Master, and
Dispatcher. When this occurs, the system exhibits an unexpected sequence -
while components are in the process of shutting down, the ZooKeeper connection
gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, the
JobManager process continues to run without performing a system exit. The
initial trigger appears as a session timeout exception with message "Client
session timed out, have not heard from server in 26678ms".
I have attached the logs of Job manager
was:We have observed an inconsistent behavior pattern where the JobManager
encounters ZooKeeper session timeout exceptions, leading to leadership loss
across multiple components including Resource Manager, Job Master, and
Dispatcher. When this occurs, the system exhibits an unexpected sequence -
while components are in the process of shutting down, the ZooKeeper connection
gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, the
JobManager process continues to run without performing a system exit. The
initial trigger appears as a session timeout exception with message "Client
session timed out, have not heard from server in 26678ms".
> Flink Job stuck in suspend state after losing leadership in Zookeeper HA
> ------------------------------------------------------------------------
>
> Key: FLINK-37811
> URL: https://issues.apache.org/jira/browse/FLINK-37811
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0, 1.20.0
> Reporter: Arun Lakshman
> Priority: Major
> Attachments: notRecovered.csv
>
>
> We have observed an inconsistent behavior pattern where the JobManager
> encounters ZooKeeper session timeout exceptions, leading to leadership loss
> across multiple components including Resource Manager, Job Master, and
> Dispatcher. When this occurs, the system exhibits an unexpected sequence -
> while components are in the process of shutting down, the ZooKeeper
> connection gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably,
> the JobManager process continues to run without performing a system exit. The
> initial trigger appears as a session timeout exception with message "Client
> session timed out, have not heard from server in 26678ms".
>
> I have attached the logs of Job manager
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)