Hi Kostas,

It makes a lot of sense to just have one underlying mechanism (snapshot) to
save the state of a Flink job. And we can use that mechanism in different
scenarios, including checkpoint and user-triggered savepoint.

To facilitate the discussion, maybe it is useful to clarify a few design
goals, for example:

1. one unified snapshot format that supports
     - both incremental and global state saving
     - rescaling on recovery
     - compatibility check / migration across different Flink versions?
2. The snapshot can easily be managed by users.


And I have two questions regarding the FLIP.

1. What are the side-effects when taking a snapshot? Do you mean taking
snapshot may triggers some action other than saving the state of the Job.
Technically speaking, taking snapshot should be a "read-only" action to the
Flink jobs. So I assume by side-effects, you meant it's no-longer
read-only. If so, can you be more specific on what are the side-effects you
are referring to?

2. In the rejected alternative, you mentioned a scenario of AB testing. It
seems that if execution A and execution B runs different configurations
after the savepoints, the history of the two jobs will always be different
after that, right?

Thanks,

Jiangjie (Becket) Qin

On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <kklou...@gmail.com> wrote:

> Hi Devs,
>
> Currently there is a number of efforts around checkpoints/savepoints, as
> reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41,
> FLIP-43, and FLIP-45 are all directly related to these topics. This
> reflects the importance of these two notions/features to the users of the
> framework.
>
> Although many efforts are centred around these notions, their semantics and
> the interplay between them is not always clearly defined. This makes them
> difficult to explain them to the users (all the different combinations of
> state-backends, formats and tradeoffs) and in some cases it may have
> negative effects to the users (e.g. the already-fixed-some-time-ago issue
> of savepoints not being considered for recovery although they committed
> side-effects).
>
> FLIP-47 [1] and the related Document [2] is aiming at starting a discussion
> around the semantics of savepoints/checkpoints and their interplay, and to
> some extent help us fix the future steps concerning these notions. As an
> example, should we work towards bringing them closer, or moving them
> further apart.
>
> This is not a complete proposal (by no means), as many of the practical
> implications can only be fleshed out after we agree on the basic semantics
> and the general frame around these notions. To that end, there are no
> concrete implementation steps and the FLIP is going to be updated as the
> discussion continues.
>
> I am really looking forward to your opinions on the topic.
>
> Cheers,
> Kostas
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
> [2]
>
> https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing
>

Reply via email to