[jira] [Commented] (FLINK-34519) Refine checkpoint scheduling and canceling logic

Hangxiang Yu (Jira) Thu, 07 Mar 2024 18:44:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824599#comment-17824599
 ]


Hangxiang Yu commented on FLINK-34519:
--------------------------------------

Thanks for reporting this.
{quote}stopCheckpointScheduler() only needs to cancel previous periodic 
checkpoints, while the current behavior cancels ongoing savepoints as well.
{quote}
I agree that it's not reasonble. Seems it only happens when there are more than 
one on-going checkpoints which has different checkpoint type, right ?
{quote}However, as the Batch-Streaming Unification optimizations need to change 
some of these assumptions, the checkpoint coordinator should fix this problem.
{quote}
So Could you share more about how the "Batch-Streaming Unification 
optimizations" suffered from it ? It may help me to better understand the 
affected scope. Thanks.

 

> Refine checkpoint scheduling and canceling logic
> ------------------------------------------------
>
>                 Key: FLINK-34519
>                 URL: https://issues.apache.org/jira/browse/FLINK-34519
>             Project: Flink
>          Issue Type: Technical Debt
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.20.0
>            Reporter: Yunfeng Zhou
>            Priority: Major
>
> In the current implementation, CheckpointCoordinator#startCheckpointScheduler 
> would stop the checkpoint scheduler before starting it, and 
> CheckpointCoordinator#stopCheckpointScheduler would cancel all ongoing and 
> pending checkpoints. When a stop-with-savepoint request is received, 
> checkpoint coordinator would trigger stopCheckpointScheduler before creating 
> the savepoint, and start the scheduler afterwards if the savepoint fails.
> The problem with this behavior is that it mixed up behavior different 
> checkpointing types. For example, stopCheckpointScheduler() only needs to 
> cancel previous periodic checkpoints, while the current behavior cancels 
> ongoing savepoints as well. This behavior is still acceptable for now, given 
> that there have only been periodic checkpoints and manual savepoints, and 
> savepoints are the only one to change checkpointing behavior once a Flink job 
> starts. However, as the Batch-Streaming Unification optimizations need to 
> change some of these assumptions, the checkpoint coordinator should fix this 
> problem.
> To be exact, checkpoint coordinator should at least distinguish between the 
> following semantics.
> - Periodic checkpoint is enabled to ensure that failover recovery time should 
> be kept within a time limit.
> - Periodic checkpoint is disabled to reduce corresponding performance 
> overhead, but the ability to checkpoint still exists and users can trigger a 
> savepoint anytime.
> - Checkpoint or savepoint is not allowed due to job status or topological 
> requirements. There might be multiple requirements applicable to a Flink job 
> at the same time, and releasing one of them is not enough to enable 
> checkpoints.
> It should also be supported for a Flink job to change between the 
> checkpointing semantics mentioned above dynamically during runtime.
> Besides, checkpoints canceled in stopCheckpointScheduler() would fail with an 
> error message saying "Checkpoint Coordinator is suspending", which is 
> ambiguous for debugging. The detailed reason should be recorded as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34519) Refine checkpoint scheduling and canceling logic

Reply via email to