Hi Mate,
thanks for proposing this, I'm really excited about your FLIP. I hope my
questions make sense to you:

1. I would like to discuss the "FlinkStateSnapshot" name and the fact that
users have to use either the savepoint or checkpoint spec inside the
FlinkStateSnapshot.
Wouldn't it be more intuitive to introduce two CRs:
FlinkSavepoint and FlinkCheckpoint
Ideally they can internally share a lot of code paths, but from a users
perspective, the abstraction is much clearer.

2. I also would like to discuss SavepointSpec.completed, as this name is
not intuitive to me. How about "ignoreExisting"?

3. The FLIP proposal seems to leave error handling to the user, e.g. when
you create a FlinkStateSnapshot, it will just move to status FAILED.
Typically in K8s with the control loop stuff, resources are tried to get
created until success. I think it would be really nice if the
FlinkStateSnapshot or FlinkSavepoint resource would retry based on a
property in the resource. A "FlinkStateSnapshot.retries" number would
indicate how often the user wants the operator to retry creating a
savepoint, "retries = -1" means retry forever. In addition, we could
consider a timeout as well, however, I haven't seen such a concept in K8s
CRs yet.
The benefit of this is that other tools relying on the K8s operator
wouldn't have to implement this retry loop (which is quite natural for
K8s), they would just have to wait for the CR they've created to transition
into COMPLETED:

3. FlinkStateSnapshotStatus.error will only show the last error. What about
using Events, so that we can show multiple errors and use the
FlinkStateSnapshotState to report errors?

4. I wonder if it makes sense to use something like Pod Conditions (e.g.
Savepoint Conditions):
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions
to track the completion status. We could have the following conditions:
- Triggered
- Completed
- Failed
The only benefit of this proposal that I see is that it would tell the user
how long it took to create the savepoint.

5. You mention that "JobSpec.initialSavepointPath" will be deprecated. I
assume we will introduce a new field for referencing a FlinkStateSnapshot
CR? I think it would be good to cover this in the FLIP.


One minor comment:

"/** Dispose the savepoints upon CRD deletion. */"

I think this should be "upon CR deletion", not "CRD deletion".

Thanks again for this great FLIP!

Best,
Robert


On Fri, Apr 19, 2024 at 9:01 AM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Cc'ing some folks who gave positive feedback on this idea in the past.
>
> I would love to hear your thoughts on the proposed design
>
> Gyula
>
> On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás <matyas.orh...@gmail.com>
> wrote:
>
>> +1 Looking forward to it
>>
>> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany <czmat...@gmail.com> wrote:
>>
>> > Thank you Gyula!
>> >
>> > I think that is a great idea. I have updated the Google doc to only
>> have 1
>> > new configuration option of boolean type, which can be used to signal
>> the
>> > Operator to use the old mode.
>> >
>> > Also described in the configuration description, the Operator will
>> fallback
>> > to the old mode if the FlinkStateSnapshot CRD cannot be found on the
>> > Kubernetes cluster.
>> >
>> > Regards,
>> > Mate
>> >
>> > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2024. ápr. 16., K,
>> > 17:01):
>> >
>> > > Thanks Mate, this is great stuff.
>> > >
>> > > Mate, I think the new configs should probably default to the new mode
>> and
>> > > they should only be useful for users to fall back to the old
>> behaviour.
>> > > We could by default use the new Snapshot CRD if the CRD is installed,
>> > > otherwise use the old mode by default and log a warning on startup.
>> > >
>> > > So I am suggesting a "dynamic" default behaviour based on whether the
>> new
>> > > CRD was installed or not because we don't want to break operator
>> startup.
>> > >
>> > > Gyula
>> > >
>> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany <czmat...@gmail.com>
>> wrote:
>> > >
>> > > > Hi Ferenc,
>> > > >
>> > > > Thank you for your comments, I have updated the Google docs with a
>> new
>> > > > section for the new configs.
>> > > > All of the newly added config keys will have defaults set, and by
>> > default
>> > > > all the savepoint/checkpoint operations will use the old system:
>> write
>> > > > their results to the FlinkDeployment/FlinkSessionJob status field.
>> > > >
>> > > > I have also added a default for the checkpoint type to be FULL
>> (which
>> > is
>> > > > also the default currently). That was an oversight on my part to
>> miss
>> > > that.
>> > > >
>> > > > Regards,
>> > > > Mate
>> > > >
>> > > > Ferenc Csaky <ferenc.cs...@pm.me.invalid> ezt írta (időpont: 2024.
>> > ápr.
>> > > > 16., K, 16:10):
>> > > >
>> > > > > Thank you Mate for initiating this discussion. +1 for this idea.
>> > > > > Some Qs:
>> > > > >
>> > > > > Can you specify the newly introduced configurations in more
>> > > > > details? Currently, it is not fully clear to me what are the
>> > > > > possible values of `kubernetes.operator.periodic.savepoint.mode`,
>> > > > > is it optional, has a default value?
>> > > > >
>> > > > > I see that in `SavepointSpec.formatType` has a default, although
>> > > > > `CheckppointSpec.checkpointType` not. Are we inferring that from
>> > > > > the config? My point is, in general I think it would be good to
>> > > > > handle the two snapshot types in a similar way when it makes sense
>> > > > > to minimize any kind of confusion.
>> > > > >
>> > > > > Best,
>> > > > > Ferenc
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany <
>> > > czmat...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > > Hi Everyone,
>> > > > > >
>> > > > > > I would like to start a discussion on FLIP-446: Kubernetes
>> Operator
>> > > > State
>> > > > > > Snapshot CRD.
>> > > > > >
>> > > > > > This FLIP adds a new custom resource for Operator users to
>> create
>> > and
>> > > > > > manage their savepoints and checkpoints. I have also developed
>> an
>> > > > initial
>> > > > > > POC to prove that this approach is feasible, you can find the
>> link
>> > > for
>> > > > > that
>> > > > > > in the FLIP.
>> > > > > >
>> > > > > > There is a Confluence page [1] and a Google Docs page [2] as I
>> do
>> > not
>> > > > > have
>> > > > > > a Confluence account yet.
>> > > > > >
>> > > > > > [1]
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
>> > > > > > [2]
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
>> > > > > >
>> > > > > >
>> > > > > > Regards,
>> > > > > > Mate
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to