Thanks for addressing my comments, Lijie. LGTM Best,
Xintong On Tue, Dec 5, 2023 at 2:56 PM Paul Lam <paullin3...@gmail.com> wrote: > Hi Lijie, > > Recovery for batch jobs is no doubt a long-awaited feature. Thanks for > the proposal! > > I’m concerned about the multi-job scenario. In session mode, users could > use web submission to upload and run jars which may produce multiple > Flink jobs. However, these jobs may not be submitted at once and run in > parallel. Instead, they could be dependent on other jobs like a DAG. The > schedule of the jobs is controlled by the user's main method. > > IIUC, in the FLIP, the main method is lost after the recovery, and only > submitted jobs would be recovered. Is that right? > > Best, > Paul Lam > > > 2023年11月2日 18:00,Lijie Wang <wangdachui9...@gmail.com> 写道: > > > > Hi devs, > > > > Zhu Zhu and I would like to start a discussion about FLIP-383: Support > Job > > Recovery for Batch Jobs[1] > > > > Currently, when Flink’s job manager crashes or gets killed, possibly due > to > > unexpected errors or planned nodes decommission, it will cause the > > following two situations: > > 1. Failed, if the job does not enable HA. > > 2. Restart, if the job enable HA. If it’s a streaming job, the job will > be > > resumed from the last successful checkpoint. If it’s a batch job, it has > to > > run from beginning, all previous progress will be lost. > > > > In view of this, we think the JM crash may cause great regression for > batch > > jobs, especially long running batch jobs. This FLIP is mainly to solve > this > > problem so that batch jobs can recover most job progress after JM > crashes. > > In this FLIP, our goal is to let most finished tasks not need to be > re-run. > > > > You can find more details in the FLIP-383[1]. Looking forward to your > > feedback. > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-383%3A+Support+Job+Recovery+for+Batch+Jobs > > > > Best, > > Lijie > >