Re: [DISCUSS] Add config to enable job stop with savepoint on exceeding tolerable checkpoint Failures

Yanfei Lei Tue, 05 Sep 2023 20:05:29 -0700

Hi Dongwoo,

If the checkpoint has failed
`execution.checkpointing.tolerable-failed-checkpoints` times, then
stopWithSavepoint is likely to fail as well.
If stopWithSavepoint succeeds or fails, will the job just stop?  I am
more curious about how this option works with the restart strategy?


Best,
Yanfei


Dongwoo Kim <dongwoo7....@gmail.com> 于2023年9月4日周一 22:17写道：
>
> Hi all,
> I have a proposal that aims to enhance the flink application's resilience in 
> cases of unexpected failures in checkpoint storages like S3 or HDFS,
>
> [Background]
> When using self managed S3-compatible object storage, we faced checkpoint 
> async failures lasting for an extended period more than 30 minutes,
> leading to multiple job restarts and causing lags in our streaming 
> application.
>
> [Current Behavior]
> Currently, when the number of checkpoint failures exceeds a predefined 
> tolerable limit, flink will either restart or fail the job based on how it's 
> configured.
> In my opinion this does not handle scenarios where the checkpoint storage 
> itself may be unreliable or experiencing downtime.
>
> [Proposed Feature]
> I propose a config that allows for a graceful job stop with a savepoint when 
> the tolerable checkpoint failure limit is reached.
> Instead of restarting/failing the job when tolerable checkpoint failure 
> exceeds, when this new config is set to true just trigger stopWithSavepoint.
>
> This could offer the following benefits.
> - Indication of Checkpoint Storage State: Exceeding tolerable checkpoint 
> failures could indicate unstable checkpoint storage.
> - Automated Fallback Strategy: When combined with a monitoring cron job, this 
> feature could act as an automated fallback strategy for handling unstable 
> checkpoint storage.
>   The job would stop safely, take a savepoint, and then you could 
> automatically restart with different checkpoint storage configured like 
> switching from S3 to HDFS.
>
> For example let's say checkpoint path is configured to s3 and savepoint path 
> is configured to hdfs.
> When the new config is set to true the job stops with savepoint like below 
> when tolerable checkpoint failure exceeds.
> And we can restart the job from that savepoint while the checkpoint 
> configured as hdfs.
>
>
>
> Looking forward to hearing the community's thoughts on this proposal.
> And also want to ask how the community is handling long lasting unstable 
> checkpoint storage issues.
>
> Thanks in advance.
>
> Best dongwoo,

Re: [DISCUSS] Add config to enable job stop with savepoint on exceeding tolerable checkpoint Failures

Reply via email to