dongwoo.kim created FLINK-33324:
-----------------------------------
Summary: Add flink managed timeout mechanism for backend restore
operation
Key: FLINK-33324
URL: https://issues.apache.org/jira/browse/FLINK-33324
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing, Runtime / State Backends
Reporter: dongwoo.kim
Attachments: image-2023-10-20-15-16-53-324.png,
image-2023-10-20-17-30-02-290.png
Hello community, I would like to share an issue our team recently faced and
propose a feature to mitigate similar problems in the future.
h2. Issue
Our Flink streaming job encountered consecutive checkpoint failures and
subsequently attempted a restart.
This failure occurred due to timeouts in two subtasks located within the same
task manager.
The restore operation for this particular task manager also got stuck,
resulting in an "initializing" state lasting over an hour.
Once we realized the hang during the restore operation, we terminated the task
manager pod, resolving the issue.
!image-2023-10-20-15-16-53-324.png|width=847,height=749!
The sequence of events was as follows:
1. Checkpoint timed out for subtasks within the task manager, referred to as
tm-32.
2. The Flink job failed and initiated a restart.
3. Restoration was successful for 282 subtasks, but got stuck for the 2
subtasks in tm-32.
4. While the Flink tasks weren't fully in running state, checkpointing was
still being triggered, leading to consecutive checkpoint failures.
5. These checkpoint failures seemed to be ignored, and did not count to the
execution.checkpointing.tolerable-failed-checkpoints configuration. As a
result, the job remained in the initialization phase for very long period.
6. Once we found this, we terminated the tm-32 pod, leading to a successful
Flink job restart.
h2. Suggestion
I feel that, a Flink job remaining in the initializing state indefinitely is
not ideal.
To enhance resilience, I think it would be helpful if we could add timeout
feature for restore operation.
If the restore operation exceeds a specified duration, an exception should be
thrown, causing the job to fail.
This way, we can address restore-related issues similarly to how we handle
checkpoint failures.
h2. Notes
Just to add, I've made a basic version of this feature to see if it works.
I've attached a picture from the Flink UI that shows the timeout exception
happened during restore operation.
It's just a start, but I hope it helps with our discussion.
I've simulated network chaos, using litmus chaos engineering tool.
!image-2023-10-20-17-30-10-751.png|width=839,height=283!
Thank you for considering my proposal. I'm looking forward to hear your
thoughts.
If there's agreement on this, I'd be happy to work on implementing this feature.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)