Hi Dongwoo, If the checkpoint has failed `execution.checkpointing.tolerable-failed-checkpoints` times, then stopWithSavepoint is likely to fail as well. If stopWithSavepoint succeeds or fails, will the job just stop? I am more curious about how this option works with the restart strategy?
Best, Yanfei Dongwoo Kim <dongwoo7....@gmail.com> 于2023年9月4日周一 22:17写道: > > Hi all, > I have a proposal that aims to enhance the flink application's resilience in > cases of unexpected failures in checkpoint storages like S3 or HDFS, > > [Background] > When using self managed S3-compatible object storage, we faced checkpoint > async failures lasting for an extended period more than 30 minutes, > leading to multiple job restarts and causing lags in our streaming > application. > > [Current Behavior] > Currently, when the number of checkpoint failures exceeds a predefined > tolerable limit, flink will either restart or fail the job based on how it's > configured. > In my opinion this does not handle scenarios where the checkpoint storage > itself may be unreliable or experiencing downtime. > > [Proposed Feature] > I propose a config that allows for a graceful job stop with a savepoint when > the tolerable checkpoint failure limit is reached. > Instead of restarting/failing the job when tolerable checkpoint failure > exceeds, when this new config is set to true just trigger stopWithSavepoint. > > This could offer the following benefits. > - Indication of Checkpoint Storage State: Exceeding tolerable checkpoint > failures could indicate unstable checkpoint storage. > - Automated Fallback Strategy: When combined with a monitoring cron job, this > feature could act as an automated fallback strategy for handling unstable > checkpoint storage. > The job would stop safely, take a savepoint, and then you could > automatically restart with different checkpoint storage configured like > switching from S3 to HDFS. > > For example let's say checkpoint path is configured to s3 and savepoint path > is configured to hdfs. > When the new config is set to true the job stops with savepoint like below > when tolerable checkpoint failure exceeds. > And we can restart the job from that savepoint while the checkpoint > configured as hdfs. > > > > Looking forward to hearing the community's thoughts on this proposal. > And also want to ask how the community is handling long lasting unstable > checkpoint storage issues. > > Thanks in advance. > > Best dongwoo,