Re: [DISCUSS] FLIP-158: Generalized incremental checkpoints

Khachatryan Roman Thu, 28 Jan 2021 14:15:35 -0800

Thanks a lot for your replies!

Yes, feedback is very much appreciated! Especially regarding the approach
in general.


I think that's a good idea to discuss some details in the follow-up docs or
tickets (but I'd be happy to discuss it here as well).

As for the PoC, I hope we'll publish it soon (logging only), so all
interested users will be able to try it out.

Regards,
Roman


On Thu, Jan 28, 2021 at 1:58 PM Yuan Mei <yuanmei.w...@gmail.com> wrote:

> Big +1 onto this FLIP!
>
> Great to see it is stepping forward since this idea is discussed for quite
> a while. :-)
>
> 1. I totally agree that the critical part is the overhead added during
> normal state updates (forward additional state updates to log as well as
> state updates itself). Once we have this part ready and assessed, we can
> have a better understanding/confidence of impact introduced by writing
> additional log over normal processing.
>
> 2. Besides that, it would be also helpful to understand how this idea fits
> into the overall picture of the fault-tolerance story. But this question
> might be out of the scope of this FLIP, I guess.
>
> Best
> Yuan
>
> On Thu, Jan 28, 2021 at 8:12 PM Stephan Ewen <se...@apache.org> wrote:
>
> > +1 to this FLIP in general.
> >
> > I like the general idea very much (full disclosure, have been involved in
> > the discussions and drafting of the design for a while, so I am not a
> > new/neutral reviewer here).
> >
> > One thing I would like to see us do here, is reaching out to users early
> > with this, and validating this approach. It is a very fundamental change
> > that also shifts certain tradeoffs, like "cost during execution" vs.
> "cost
> > on recovery". This approach will increase the data write rate to
> > S3/HDFS/...
> > So before we build every bit of the complex implementation, let's try and
> > validate/test the critical bits with the users.
> >
> > In my assessment, the most critical bit is the continuous log writing,
> > which adds overhead during execution time. Recovery is less critical,
> > there'll be no overhead or additional load, so recovery should be
> strictly
> > better than currently.
> > I would propose we hence focus on the implementation of the logging first
> > (ignoring recovery, focusing on one target FileSystem/Object store) and
> > test run this with a few users, see that it works well and whether they
> > like the new characteristics.
> >
> > I am also trying to contribute some adjustments to the FLIP text, like
> more
> > illustrations/explanations, to make it easier to share this FLIP with a
> > wider audience, so we can get the above-mentioned user input and
> > validation.
> >
> > Best,
> > Stephan
> >
> >
> >
> >
> > On Thu, Jan 28, 2021 at 10:46 AM Piotr Nowojski <pnowoj...@apache.org>
> > wrote:
> >
> > > Hi Roman,
> > >
> > > +1 from my side on this proposal. Also big +1 for the recent changes in
> > > this FLIP in the motivation and high level overview sections.
> > >
> > > For me there are quite a bit of unanswered things around how to
> actually
> > > implement the proposed changes and especially how to integrate it with
> > the
> > > state backends and checkpointing, but maybe we can do that in either a
> > > follow up design docs or discuss it in the tickets or even maybe some
> > PoC.
> > >
> > > Piotrek
> > >
> > > pt., 15 sty 2021 o 07:49 Khachatryan Roman <
> khachatryan.ro...@gmail.com>
> > > napisał(a):
> > >
> > > > Hi devs,
> > > >
> > > > I'd like to start a discussion of FLIP-158: Generalized incremental
> > > > checkpoints [1]
> > > >
> > > > FLIP motivation:
> > > > Low end-to-end latency is a much-demanded property in many Flink
> > setups.
> > > > With exactly-once, this latency depends on checkpoint
> interval/duration
> > > > which in turn is defined by the slowest node (usually the one doing a
> > > full
> > > > non-incremental snapshot). In large setups with many nodes, the
> > > probability
> > > > of at least one node being slow gets higher, making almost every
> > > checkpoint
> > > > slow.
> > > >
> > > > This FLIP proposes a mechanism to deal with this by materializing and
> > > > uploading state continuously and only uploading the changed part
> during
> > > the
> > > > checkpoint itself. It differs from other approaches in that 1)
> > > checkpoints
> > > > are always incremental; 2) works for any state backend.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints
> > > >
> > > > Any feedback highly appreciated!
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-158: Generalized incremental checkpoints

Reply via email to