[ https://issues.apache.org/jira/browse/FLINK-33324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777755#comment-17777755 ]
dongwoo.kim edited comment on FLINK-33324 at 10/20/23 1:25 PM: --------------------------------------------------------------- Hi, [~pnowojski] Thanks for the opinion. First about the code, I just simply wrapped the main logic code [here|https://github.com/apache/flink/blob/72e302310ba55bb5f35966ed448243aae36e193e/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/BackendRestorerProcedure.java#L94] with callable object and combined with future.get(timeout). Please consider that it was just initial check for feasibility without a deep dive into the Flink code. When considering manual action from human, I agree solving this issue with alert system seem practical. However, our goal for handling the failover loop was to automate operations using the failure-rate restart strategy and a cronJob that monitors the Flink job's status. Instead of adding ambiguous conditions in the cronJob, treating an unusually long restore operation as a failure simplifies our process. Yet, I understand from the feedback that this approach might fit more to our team's unique needs and might not be as helpful for everyone else. was (Author: JIRAUSER300481): Hi, [~pnowojski] Thanks for the opinion. First about the code, I just simply wrapped the main logic code [here|https://github.com/apache/flink/blob/72e302310ba55bb5f35966ed448243aae36e193e/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/BackendRestorerProcedure.java#L94] with callable object and combined with future.get(timeout). Please consider that it was just initial check for feasibility without a deep dive into the Flink code. When considering manual action from human, I agree solving this issue with alert system seem practical. However, our goal for handling the failover loop was to automate operations using the failure-rate restart strategy and a cronJob that monitors the Flink job's status. Instead of adding complex conditions in the cronJob, treating an unusually long restore operation as a failure simplifies our process. Yet, I understand from the feedback that this approach might fit more to our team's unique needs and might not be as helpful for everyone else. > Add flink managed timeout mechanism for backend restore operation > ----------------------------------------------------------------- > > Key: FLINK-33324 > URL: https://issues.apache.org/jira/browse/FLINK-33324 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends > Reporter: dongwoo.kim > Priority: Minor > Attachments: image-2023-10-20-15-16-53-324.png, > image-2023-10-20-17-42-11-504.png > > > Hello community, I would like to share an issue our team recently faced and > propose a feature to mitigate similar problems in the future. > h2. Issue > Our Flink streaming job encountered consecutive checkpoint failures and > subsequently attempted a restart. > This failure occurred due to timeouts in two subtasks located within the same > task manager. > The restore operation for this particular task manager also got stuck, > resulting in an "initializing" state lasting over an hour. > Once we realized the hang during the restore operation, we terminated the > task manager pod, resolving the issue. > !image-2023-10-20-15-16-53-324.png|width=683,height=604! > The sequence of events was as follows: > 1. Checkpoint timed out for subtasks within the task manager, referred to as > tm-32. > 2. The Flink job failed and initiated a restart. > 3. Restoration was successful for 282 subtasks, but got stuck for the 2 > subtasks in tm-32. > 4. While the Flink tasks weren't fully in running state, checkpointing was > still being triggered, leading to consecutive checkpoint failures. > 5. These checkpoint failures seemed to be ignored, and did not count to the > execution.checkpointing.tolerable-failed-checkpoints configuration. > As a result, the job remained in the initialization phase for very long > period. > 6. Once we found this, we terminated the tm-32 pod, leading to a successful > Flink job restart. > h2. Suggestion > I feel that, a Flink job remaining in the initializing state indefinitely is > not ideal. > To enhance resilience, I think it would be helpful if we could add timeout > feature for restore operation. > If the restore operation exceeds a specified duration, an exception should be > thrown, causing the job to fail. > This way, we can address restore-related issues similarly to how we handle > checkpoint failures. > h2. Notes > Just to add, I've made a basic version of this feature to see if it works as > expected. > I've attached a picture from the Flink UI that shows the timeout exception > happened during restore operation. > It's just a start, but I hope it helps with our discussion. > (I've simulated network chaos, using > [litmus|https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-network-latency/#destination-ips-and-destination-hosts] > chaos engineering tool.) > !image-2023-10-20-17-42-11-504.png|width=940,height=317! > > Thank you for considering my proposal. I'm looking forward to hear your > thoughts. > If there's agreement on this, I'd be happy to work on implementing this > feature. -- This message was sent by Atlassian Jira (v8.20.10#820010)