Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

David Morávek Mon, 06 Dec 2021 03:05:20 -0800

>
> I also hope that we could remove all the pointers in the HA store(ZK,
> ConfigMap) in the future.



I'll open a new thread with {user,dev}@f.a.o to verify the thoughts around
strong-read-after consistency for FileSystems. If that goes well I can see
it as one of the possible topics for 1.16 ;)

On Mon, Dec 6, 2021 at 11:03 AM Yang Wang <[email protected]> wrote:

> Thanks for the fruitful discussion. I also hope that we could remove all
> the pointers in the HA store(ZK, ConfigMap) in the future.
> After then, we only rely on the ZK/ConfigMap for leader election/retrieval.
>
>
> Best,
> Yang
>
> David Morávek <[email protected]> 于2021年12月6日周一 下午4:57写道：
>
> > as all of the concerns seems to be addressed, I'd like to proceed with
> the
> > vote to move things forward.
> >
> > Thanks everyone for the feedback, it was really helpful!
> >
> > Best,
> > D.
> >
> > On Wed, Dec 1, 2021 at 6:39 AM Zhu Zhu <[email protected]> wrote:
> >
> > > Thanks for the explanation Matthias. The solution sounds good to me.
> > > I have no more concerns and +1 for the FLIP.
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Xintong Song <[email protected]> 于2021年12月1日周三 下午12:56写道：
> > >
> > > > @David,
> > > >
> > > > Thanks for the clarification.
> > > >
> > > > No more concerns from my side. +1 for this FLIP.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Wed, Dec 1, 2021 at 12:28 AM Till Rohrmann <[email protected]>
> > > > wrote:
> > > >
> > > > > Given the other breaking changes, I think that it is ok to remove
> the
> > > > > `RunningJobsRegistry` completely.
> > > > >
> > > > > Since we allow users to specify a HighAvailabilityServices
> > > implementation
> > > > > when starting Flink via `high-availability: FQDN`, I think we
> should
> > > mark
> > > > > the interface at least @Experimental.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Nov 30, 2021 at 2:29 PM Mika Naylor <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi Till,
> > > > > >
> > > > > > We thought that breaking interfaces, specifically
> > > > > > HighAvailabilityServices and RunningJobsRegistry, was acceptable
> in
> > > > this
> > > > > > instance because:
> > > > > >
> > > > > > - Neither of these interfaces are marked @Public and so carry no
> > > > > >    guarantees about being public and stable.
> > > > > > - As far as we are aware, we currently have no users with custom
> > > > > >    HighAvailabilityServices implementations.
> > > > > > - The interface was already broken in 1.14 with the changes to
> > > > > >    CheckpointRecoveryFactory, and will likely be changed again in
> > > 1.15
> > > > > >    due to further changes in that factory.
> > > > > >
> > > > > > Given that, we thought changes to the interface would not be
> > > > disruptive.
> > > > > > Perhaps it could be annotated as @Internal - I'm not sure exactly
> > > what
> > > > > > guarantees we try and give for the stability of the
> > > > > > HighAvailabilityServices interface.
> > > > > >
> > > > > > Kind regards,
> > > > > > Mika
> > > > > >
> > > > > > On 26.11.2021 18:28, Till Rohrmann wrote:
> > > > > > >Thanks for creating this FLIP Matthias, Mika and David.
> > > > > > >
> > > > > > >I think the JobResultStore is an important piece for fixing
> > Flink's
> > > > last
> > > > > > >high-availability problem (afaik). Once we have this piece in
> > place,
> > > > > users
> > > > > > >no longer risk to re-execute a successfully completed job.
> > > > > > >
> > > > > > >I have one comment concerning breaking interfaces:
> > > > > > >
> > > > > > >If we don't want to break interfaces, then we could keep the
> > > > > > >HighAvailabilityServices.getRunningJobsRegistry() method and
> add a
> > > > > default
> > > > > > >implementation for HighAvailabilityServices.getJobResultStore().
> > We
> > > > > could
> > > > > > >then deprecate the former method and then remove it in the
> > > subsequent
> > > > > > >release (1.16).
> > > > > > >
> > > > > > >Apart from that, +1 for the FLIP.
> > > > > > >
> > > > > > >Cheers,
> > > > > > >Till
> > > > > > >
> > > > > > >On Wed, Nov 17, 2021 at 6:05 PM David Morávek <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi everyone,
> > > > > > >>
> > > > > > >> Matthias, Mika and I want to start a discussion about
> > introduction
> > > > of
> > > > > a
> > > > > > new
> > > > > > >> Flink component, the *JobResultStore*.
> > > > > > >>
> > > > > > >> The main motivation is to address shortcomings of the
> > > > > > *RunningJobsRegistry*
> > > > > > >> and surpass it with the new component. These shortcomings have
> > > been
> > > > > > first
> > > > > > >> described in FLINK-11813 [1].
> > > > > > >>
> > > > > > >> This change should improve the overall stability of the
> > > JobManager's
> > > > > > >> components and address the race conditions in some of the fail
> > > over
> > > > > > >> scenarios during the job cleanup lifecycle.
> > > > > > >>
> > > > > > >> It should also help to ensure that Flink doesn't leave any
> > > uncleaned
> > > > > > >> resources behind.
> > > > > > >>
> > > > > > >> We've prepared a FLIP-194 [2], which outlines the design and
> > > > reasoning
> > > > > > >> behind this new component.
> > > > > > >>
> > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-11813
> > > > > > >> [2]
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435
> > > > > > >>
> > > > > > >> We're looking forward for your feedback ;)
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Matthias, Mika and David
> > > > > > >>
> > > > > >
> > > > > > Mika Naylor
> > > > > > https://autophagy.io
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

Reply via email to