Hi Kostas, It makes a lot of sense to just have one underlying mechanism (snapshot) to save the state of a Flink job. And we can use that mechanism in different scenarios, including checkpoint and user-triggered savepoint.
To facilitate the discussion, maybe it is useful to clarify a few design goals, for example: 1. one unified snapshot format that supports - both incremental and global state saving - rescaling on recovery - compatibility check / migration across different Flink versions? 2. The snapshot can easily be managed by users. And I have two questions regarding the FLIP. 1. What are the side-effects when taking a snapshot? Do you mean taking snapshot may triggers some action other than saving the state of the Job. Technically speaking, taking snapshot should be a "read-only" action to the Flink jobs. So I assume by side-effects, you meant it's no-longer read-only. If so, can you be more specific on what are the side-effects you are referring to? 2. In the rejected alternative, you mentioned a scenario of AB testing. It seems that if execution A and execution B runs different configurations after the savepoints, the history of the two jobs will always be different after that, right? Thanks, Jiangjie (Becket) Qin On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <kklou...@gmail.com> wrote: > Hi Devs, > > Currently there is a number of efforts around checkpoints/savepoints, as > reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41, > FLIP-43, and FLIP-45 are all directly related to these topics. This > reflects the importance of these two notions/features to the users of the > framework. > > Although many efforts are centred around these notions, their semantics and > the interplay between them is not always clearly defined. This makes them > difficult to explain them to the users (all the different combinations of > state-backends, formats and tradeoffs) and in some cases it may have > negative effects to the users (e.g. the already-fixed-some-time-ago issue > of savepoints not being considered for recovery although they committed > side-effects). > > FLIP-47 [1] and the related Document [2] is aiming at starting a discussion > around the semantics of savepoints/checkpoints and their interplay, and to > some extent help us fix the future steps concerning these notions. As an > example, should we work towards bringing them closer, or moving them > further apart. > > This is not a complete proposal (by no means), as many of the practical > implications can only be fleshed out after we agree on the basic semantics > and the general frame around these notions. To that end, there are no > concrete implementation steps and the FLIP is going to be updated as the > discussion continues. > > I am really looking forward to your opinions on the topic. > > Cheers, > Kostas > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints > [2] > > https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing >