Thanks for creating this FLIP Matthias, Mika and David. I think the JobResultStore is an important piece for fixing Flink's last high-availability problem (afaik). Once we have this piece in place, users no longer risk to re-execute a successfully completed job.
I have one comment concerning breaking interfaces: If we don't want to break interfaces, then we could keep the HighAvailabilityServices.getRunningJobsRegistry() method and add a default implementation for HighAvailabilityServices.getJobResultStore(). We could then deprecate the former method and then remove it in the subsequent release (1.16). Apart from that, +1 for the FLIP. Cheers, Till On Wed, Nov 17, 2021 at 6:05 PM David Morávek <d...@apache.org> wrote: > Hi everyone, > > Matthias, Mika and I want to start a discussion about introduction of a new > Flink component, the *JobResultStore*. > > The main motivation is to address shortcomings of the *RunningJobsRegistry* > and surpass it with the new component. These shortcomings have been first > described in FLINK-11813 [1]. > > This change should improve the overall stability of the JobManager's > components and address the race conditions in some of the fail over > scenarios during the job cleanup lifecycle. > > It should also help to ensure that Flink doesn't leave any uncleaned > resources behind. > > We've prepared a FLIP-194 [2], which outlines the design and reasoning > behind this new component. > > [1] https://issues.apache.org/jira/browse/FLINK-11813 > [2] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435 > > We're looking forward for your feedback ;) > > Best, > Matthias, Mika and David >