Thanks a lot for your replies! Yes, feedback is very much appreciated! Especially regarding the approach in general.
I think that's a good idea to discuss some details in the follow-up docs or tickets (but I'd be happy to discuss it here as well). As for the PoC, I hope we'll publish it soon (logging only), so all interested users will be able to try it out. Regards, Roman On Thu, Jan 28, 2021 at 1:58 PM Yuan Mei <yuanmei.w...@gmail.com> wrote: > Big +1 onto this FLIP! > > Great to see it is stepping forward since this idea is discussed for quite > a while. :-) > > 1. I totally agree that the critical part is the overhead added during > normal state updates (forward additional state updates to log as well as > state updates itself). Once we have this part ready and assessed, we can > have a better understanding/confidence of impact introduced by writing > additional log over normal processing. > > 2. Besides that, it would be also helpful to understand how this idea fits > into the overall picture of the fault-tolerance story. But this question > might be out of the scope of this FLIP, I guess. > > Best > Yuan > > On Thu, Jan 28, 2021 at 8:12 PM Stephan Ewen <se...@apache.org> wrote: > > > +1 to this FLIP in general. > > > > I like the general idea very much (full disclosure, have been involved in > > the discussions and drafting of the design for a while, so I am not a > > new/neutral reviewer here). > > > > One thing I would like to see us do here, is reaching out to users early > > with this, and validating this approach. It is a very fundamental change > > that also shifts certain tradeoffs, like "cost during execution" vs. > "cost > > on recovery". This approach will increase the data write rate to > > S3/HDFS/... > > So before we build every bit of the complex implementation, let's try and > > validate/test the critical bits with the users. > > > > In my assessment, the most critical bit is the continuous log writing, > > which adds overhead during execution time. Recovery is less critical, > > there'll be no overhead or additional load, so recovery should be > strictly > > better than currently. > > I would propose we hence focus on the implementation of the logging first > > (ignoring recovery, focusing on one target FileSystem/Object store) and > > test run this with a few users, see that it works well and whether they > > like the new characteristics. > > > > I am also trying to contribute some adjustments to the FLIP text, like > more > > illustrations/explanations, to make it easier to share this FLIP with a > > wider audience, so we can get the above-mentioned user input and > > validation. > > > > Best, > > Stephan > > > > > > > > > > On Thu, Jan 28, 2021 at 10:46 AM Piotr Nowojski <pnowoj...@apache.org> > > wrote: > > > > > Hi Roman, > > > > > > +1 from my side on this proposal. Also big +1 for the recent changes in > > > this FLIP in the motivation and high level overview sections. > > > > > > For me there are quite a bit of unanswered things around how to > actually > > > implement the proposed changes and especially how to integrate it with > > the > > > state backends and checkpointing, but maybe we can do that in either a > > > follow up design docs or discuss it in the tickets or even maybe some > > PoC. > > > > > > Piotrek > > > > > > pt., 15 sty 2021 o 07:49 Khachatryan Roman < > khachatryan.ro...@gmail.com> > > > napisaĆ(a): > > > > > > > Hi devs, > > > > > > > > I'd like to start a discussion of FLIP-158: Generalized incremental > > > > checkpoints [1] > > > > > > > > FLIP motivation: > > > > Low end-to-end latency is a much-demanded property in many Flink > > setups. > > > > With exactly-once, this latency depends on checkpoint > interval/duration > > > > which in turn is defined by the slowest node (usually the one doing a > > > full > > > > non-incremental snapshot). In large setups with many nodes, the > > > probability > > > > of at least one node being slow gets higher, making almost every > > > checkpoint > > > > slow. > > > > > > > > This FLIP proposes a mechanism to deal with this by materializing and > > > > uploading state continuously and only uploading the changed part > during > > > the > > > > checkpoint itself. It differs from other approaches in that 1) > > > checkpoints > > > > are always incremental; 2) works for any state backend. > > > > > > > > [1] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints > > > > > > > > Any feedback highly appreciated! > > > > > > > > Regards, > > > > Roman > > > > > > > > > >