Thanks for addressing my comments, Lijie. LGTM

Best,

Xintong



On Tue, Dec 5, 2023 at 2:56 PM Paul Lam <paullin3...@gmail.com> wrote:

> Hi Lijie,
>
> Recovery for batch jobs is no doubt a long-awaited feature. Thanks for
> the proposal!
>
> I’m concerned about the multi-job scenario. In session mode, users could
> use web submission to upload and run jars which may produce multiple
> Flink jobs. However, these jobs may not be submitted at once and run in
> parallel. Instead, they could be dependent on other jobs like a DAG. The
> schedule of the jobs is controlled by the user's main method.
>
> IIUC, in the FLIP, the main method is lost after the recovery, and only
> submitted jobs would be recovered. Is that right?
>
> Best,
> Paul Lam
>
> > 2023年11月2日 18:00,Lijie Wang <wangdachui9...@gmail.com> 写道:
> >
> > Hi devs,
> >
> > Zhu Zhu and I would like to start a discussion about FLIP-383: Support
> Job
> > Recovery for Batch Jobs[1]
> >
> > Currently, when Flink’s job manager crashes or gets killed, possibly due
> to
> > unexpected errors or planned nodes decommission, it will cause the
> > following two situations:
> > 1. Failed, if the job does not enable HA.
> > 2. Restart, if the job enable HA. If it’s a streaming job, the job will
> be
> > resumed from the last successful checkpoint. If it’s a batch job, it has
> to
> > run from beginning, all previous progress will be lost.
> >
> > In view of this, we think the JM crash may cause great regression for
> batch
> > jobs, especially long running batch jobs. This FLIP is mainly to solve
> this
> > problem so that batch jobs can recover most job progress after JM
> crashes.
> > In this FLIP, our goal is to let most finished tasks not need to be
> re-run.
> >
> > You can find more details in the FLIP-383[1]. Looking forward to your
> > feedback.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-383%3A+Support+Job+Recovery+for+Batch+Jobs
> >
> > Best,
> > Lijie
>
>

Reply via email to