Lucas Borges created FLINK-38325:
------------------------------------
Summary: Checkpoints are hanging and timing out frequently
Key: FLINK-38325
URL: https://issues.apache.org/jira/browse/FLINK-38325
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 2.0.0, 2.1.0
Environment: Flink version 2.1 (also observed on 2.0) with Forst state
backend.
Running on kubernetes using the Flink apache kubernetes operator.
Reporter: Lucas Borges
Attachments: Screenshot 2025-09-03 at 14.53.56.png, Screenshot
2025-09-03 at 14.54.21.png, Screenshot 2025-09-03 at 14.54.36.png
This issue is being observed on a Flink 2.1 job running with Forst state
backend. We noticed that checkpoints are failing due to timeouts/hanging more
frequently than other Flink 1.x jobs.
We suspect maybe there is a deadlock somewhere, based on one task-manager's
thread dump (could not attach it to the Jira issue due to size limits).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)