Re: Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

Dongwoo Kim Wed, 21 Jun 2023 02:57:31 -0700

Hi Yanfei,
Thanks for the reply.

So uploading changelog doesn't count as checkpointing phase, so I
understood that "execution.checkpointing.tolerable-failed-checkpoints"
is not related to changelog failure.
However, how about making something like a tolerable-failed-changelog
configuration?
This can allow the system to store the changelog in memory when an upload
failure occurs and attempt the upload again during the next cycle.


We believe this could help us avoid unnecessary application restarts caused
by temporary s3 issues.
Is there any expected side effect about this approach?
Currently we are managing this issue by giving a higher number of retries
and timeout threshold for the upload process. But it would be great if we
can just tolerate a configured number of changelog failures.
Thanks in advance

Best regards
dongwoo

2023년 6월 21일 (수) 오후 12:38, Yanfei Lei <fredia...@gmail.com>님이 작성:

> Hi Dongwoo,
>
> State changelogs are continuously uploaded to the durable storage when
> Changelog state backend is enabled. In other words, it will also
> persist data **outside the checkpoint phase**, and the exception at
> this time will directly cause the job to fail.  And only exceptions in
> the checkpoint phase will be counted as checkpoint failures.
>
> Dongwoo Kim <dongwoo7....@gmail.com> 于2023年6月20日周二 18:31写道：
> >
> > Hello all, I have a question about changelog persist failure.
> > When changelog persist process fails due to an S3 timeout, it seems to
> lead to the job failure regardless of our
> "execution.checkpointing.tolerable-failed-checkpoints" configuration being
> set to 5 with this log.
> >
> > Caused by: java.io.IOException: The upload for 522 has already failed
> previously
> >
> > Upon digging into the source code, I observed that Flink consistently
> checks the sequence number against the latest failed sequence number,
> resulting in an IOException. I am curious about the reasoning behind this
> check as it seems to interfere with the "tolerable-failed-checkpoint"
> configuration working as expected.
> > Can anyone explain the goal behind this design?
> > Additionally, I'd like to propose a potential solution: What if we
> adjusted this section to allow failed changelogs to be uploaded on
> subsequent attempts, up to the "tolerable-failed-checkpoint" limit, before
> declaring the job as failed?
> >
> > Thanks in advance
> >
> > Best regards
> > dongwoo
> >
> >
> >
> >
> >
> >
> >
>
>
> --
> Best,
> Yanfei
>

Re: Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

Reply via email to