Hi,

This is a very good proposal, as far as I know, it can solve some very
critical production operations in certain scenarios. I have two minor
issues:

As far as I know, there are multiple job managers on standby in some
scenarios. In this case, is your design still effective? I'm unsure if you
have conducted any tests. For instance, standby job managers might take
over these failed jobs more quickly.
Regarding the part about the operator coordinator, how can you ensure that
the checkpoint mechanism can restore the state of the operator coordinator:
For example:
How do you rule out that there might still be some states in the memory of
the original operator coordinator? After all, the implementation was done
under the assumption of scenarios where the job manager doesn't fail.
Additionally, using NO_CHECKPOINT seems a bit odd. Why not use a normal
checkpoint ID greater than 0 and record it in the event store?
If the issues raised in point 2 cannot be resolved in the short term, would
it be possible to consider not supporting failover with a source job
manager?

Best,
Guowei


On Thu, Nov 2, 2023 at 6:01 PM Lijie Wang <wangdachui9...@gmail.com> wrote:

> Hi devs,
>
> Zhu Zhu and I would like to start a discussion about FLIP-383: Support Job
> Recovery for Batch Jobs[1]
>
> Currently, when Flink’s job manager crashes or gets killed, possibly due to
> unexpected errors or planned nodes decommission, it will cause the
> following two situations:
> 1. Failed, if the job does not enable HA.
> 2. Restart, if the job enable HA. If it’s a streaming job, the job will be
> resumed from the last successful checkpoint. If it’s a batch job, it has to
> run from beginning, all previous progress will be lost.
>
> In view of this, we think the JM crash may cause great regression for batch
> jobs, especially long running batch jobs. This FLIP is mainly to solve this
> problem so that batch jobs can recover most job progress after JM crashes.
> In this FLIP, our goal is to let most finished tasks not need to be re-run.
>
> You can find more details in the FLIP-383[1]. Looking forward to your
> feedback.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-383%3A+Support+Job+Recovery+for+Batch+Jobs
>
> Best,
> Lijie
>

Reply via email to