[ https://issues.apache.org/jira/browse/FLINK-34519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824599#comment-17824599 ]
Hangxiang Yu commented on FLINK-34519: -------------------------------------- Thanks for reporting this. {quote}stopCheckpointScheduler() only needs to cancel previous periodic checkpoints, while the current behavior cancels ongoing savepoints as well. {quote} I agree that it's not reasonble. Seems it only happens when there are more than one on-going checkpoints which has different checkpoint type, right ? {quote}However, as the Batch-Streaming Unification optimizations need to change some of these assumptions, the checkpoint coordinator should fix this problem. {quote} So Could you share more about how the "Batch-Streaming Unification optimizations" suffered from it ? It may help me to better understand the affected scope. Thanks. > Refine checkpoint scheduling and canceling logic > ------------------------------------------------ > > Key: FLINK-34519 > URL: https://issues.apache.org/jira/browse/FLINK-34519 > Project: Flink > Issue Type: Technical Debt > Components: Runtime / Checkpointing > Affects Versions: 1.20.0 > Reporter: Yunfeng Zhou > Priority: Major > > In the current implementation, CheckpointCoordinator#startCheckpointScheduler > would stop the checkpoint scheduler before starting it, and > CheckpointCoordinator#stopCheckpointScheduler would cancel all ongoing and > pending checkpoints. When a stop-with-savepoint request is received, > checkpoint coordinator would trigger stopCheckpointScheduler before creating > the savepoint, and start the scheduler afterwards if the savepoint fails. > The problem with this behavior is that it mixed up behavior different > checkpointing types. For example, stopCheckpointScheduler() only needs to > cancel previous periodic checkpoints, while the current behavior cancels > ongoing savepoints as well. This behavior is still acceptable for now, given > that there have only been periodic checkpoints and manual savepoints, and > savepoints are the only one to change checkpointing behavior once a Flink job > starts. However, as the Batch-Streaming Unification optimizations need to > change some of these assumptions, the checkpoint coordinator should fix this > problem. > To be exact, checkpoint coordinator should at least distinguish between the > following semantics. > - Periodic checkpoint is enabled to ensure that failover recovery time should > be kept within a time limit. > - Periodic checkpoint is disabled to reduce corresponding performance > overhead, but the ability to checkpoint still exists and users can trigger a > savepoint anytime. > - Checkpoint or savepoint is not allowed due to job status or topological > requirements. There might be multiple requirements applicable to a Flink job > at the same time, and releasing one of them is not enough to enable > checkpoints. > It should also be supported for a Flink job to change between the > checkpointing semantics mentioned above dynamically during runtime. > Besides, checkpoints canceled in stopCheckpointScheduler() would fail with an > error message saying "Checkpoint Coordinator is suspending", which is > ambiguous for debugging. The detailed reason should be recorded as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)