dongwoo.kim created FLINK-33324:
-----------------------------------

             Summary: Add flink managed timeout mechanism for backend restore 
operation
                 Key: FLINK-33324
                 URL: https://issues.apache.org/jira/browse/FLINK-33324
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing, Runtime / State Backends
            Reporter: dongwoo.kim
         Attachments: image-2023-10-20-15-16-53-324.png, 
image-2023-10-20-17-30-02-290.png

Hello community, I would like to share an issue our team recently faced and 
propose a feature to mitigate similar problems in the future.
h2. Issue

Our Flink streaming job encountered consecutive checkpoint failures and 
subsequently attempted a restart. 
This failure occurred due to timeouts in two subtasks located within the same 
task manager. 
The restore operation for this particular task manager also got stuck, 
resulting in an "initializing" state lasting over an hour. 
Once we realized the hang during the restore operation, we terminated the task 
manager pod, resolving the issue.

!image-2023-10-20-15-16-53-324.png|width=847,height=749!

The sequence of events was as follows:

1. Checkpoint timed out for subtasks within the task manager, referred to as 
tm-32.
2. The Flink job failed and initiated a restart.
3. Restoration was successful for 282 subtasks, but got stuck for the 2 
subtasks in tm-32.
4. While the Flink tasks weren't fully in running state, checkpointing was 
still being triggered, leading to consecutive checkpoint failures.
5. These checkpoint failures seemed to be ignored, and did not count to the 
execution.checkpointing.tolerable-failed-checkpoints configuration. As a 
result, the job remained in the initialization phase for very long period.
6. Once we found this, we terminated the tm-32 pod, leading to a successful 
Flink job restart.
h2. Suggestion

I feel that, a Flink job remaining in the initializing state indefinitely is 
not ideal. 
To enhance resilience, I think it would be helpful if we could add timeout 
feature for restore operation. 
If the restore operation exceeds a specified duration, an exception should be 
thrown, causing the job to fail. 
This way, we can address restore-related issues similarly to how we handle 
checkpoint failures.
h2. Notes

Just to add, I've made a basic version of this feature to see if it works. 
I've attached a picture from the Flink UI that shows the timeout exception 
happened during restore operation. 
It's just a start, but I hope it helps with our discussion. 
I've simulated network chaos, using litmus chaos engineering tool.
!image-2023-10-20-17-30-10-751.png|width=839,height=283!

Thank you for considering my proposal. I'm looking forward to hear your 
thoughts. 
If there's agreement on this, I'd be happy to work on implementing this feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to