Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

Matthias Pohl Tue, 30 Nov 2021 00:53:51 -0800

Hi Zhu Zhu,
thanks for your reply. Your concern is valid. Our goal is to only touch the
CompletedCheckpointStore and CheckpointIDCounter without instantiating
JobMaster/Scheduler/ExecutionGraph. We would have to initialize these
classes (and for the CompletedCheckpointStore reload the
CompletedCheckpoints) in order to rerun their shutdown functionality. That
can be achieved through the CheckpointRecoveryFactory which is provided by
the HighAvailabilityServices. The shutdown call itself only requires the
JobID and the JobStatus, both being provided by the JobResultStore.


We're not planning to touch the process which is currently available on
master for the case where the failover happens before the job was marked as
globally-terminated. That means that the JobMaster still owns instances to
the CompletedCheckpointStore and CheckpointIDCounter via the scheduler.
Therefore, the JobMaster is responsible for cleaning up these components in
that case.

Matthias

On Mon, Nov 29, 2021 at 1:21 PM Zhu Zhu <reed...@gmail.com> wrote:

> Thanks for drafting this FLIP, Matthias, Mika and David.
>
> I like the proposed JobResultStore. Besides addressing the problem of
> re-executing finished jobs, it's also an important step towards HA of
> multi-job Flink applications.
>
> I have one question that, in the "Cleanup" section, it shows that the
> JobMaster is responsible for cleaning up CheckpointCounter/CheckpointStore.
> Does this mean Flink will have to re-create
> JobMaster/Scheduler/ExecutionGraph for a terminated job to do the cleanup?
> If so, this can be heavy in certain cases because the ExecutionGraph
> creation may conduct connector initialization. So I'm thinking whether it's
> possible to make CheckpointCounter/CheckpointStore a component of
> Dispatcher?
>
> Thanks,
> Zhu
>
> Till Rohrmann <trohrm...@apache.org> 于2021年11月27日周六 上午1:29写道：
>
>> Thanks for creating this FLIP Matthias, Mika and David.
>>
>> I think the JobResultStore is an important piece for fixing Flink's last
>> high-availability problem (afaik). Once we have this piece in place, users
>> no longer risk to re-execute a successfully completed job.
>>
>> I have one comment concerning breaking interfaces:
>>
>> If we don't want to break interfaces, then we could keep the
>> HighAvailabilityServices.getRunningJobsRegistry() method and add a default
>> implementation for HighAvailabilityServices.getJobResultStore(). We could
>> then deprecate the former method and then remove it in the subsequent
>> release (1.16).
>>
>> Apart from that, +1 for the FLIP.
>>
>> Cheers,
>> Till
>>
>> On Wed, Nov 17, 2021 at 6:05 PM David Morávek <d...@apache.org> wrote:
>>
>> > Hi everyone,
>> >
>> > Matthias, Mika and I want to start a discussion about introduction of a
>> new
>> > Flink component, the *JobResultStore*.
>> >
>> > The main motivation is to address shortcomings of the
>> *RunningJobsRegistry*
>> > and surpass it with the new component. These shortcomings have been
>> first
>> > described in FLINK-11813 [1].
>> >
>> > This change should improve the overall stability of the JobManager's
>> > components and address the race conditions in some of the fail over
>> > scenarios during the job cleanup lifecycle.
>> >
>> > It should also help to ensure that Flink doesn't leave any uncleaned
>> > resources behind.
>> >
>> > We've prepared a FLIP-194 [2], which outlines the design and reasoning
>> > behind this new component.
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-11813
>> > [2]
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435
>> >
>> > We're looking forward for your feedback ;)
>> >
>> > Best,
>> > Matthias, Mika and David
>> >
>
>

Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

Reply via email to