Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

Gyula Fóra Fri, 19 Apr 2024 12:45:16 -0700

Hey!

Regarding the question of initialSavepointPath and
flinkStateSnapshotReference new object, I think we could simply keep an
extra field as part of the flinkStateSnapshotReference object called path.


Then the fields could be:
namespace, name, path

If path is defined we would use that (to support the simple way also)
otherwise use the resource. I would still deprecate the
initialSavepointPath field in the jobSpec.

Regarding the Savepoint/Checkpoint vs FlinkStateSnapshot.
What we need:
 1. Easy way to list all state snapshots (to select latest)
 2. Easy way to reference a savepoint/checkpoint from a jobspec
 3. Differentiate state snapshot types (in some cases users may prefer to
use checkpoint/savepoint for certain upgrades) -> we should add a label or
something for easy selection
 4. Be able to delete savepoints (and checkpoints maybe)

I am personally slightly more in favor of having a single resource as that
ticks all the boxes, while having 2 separate resources will make both
listing and referencing harder. We would have to introduce state type in
the reference (name + namespace would not be enough to uniquely identify a
state snapshot)

I wonder if I am missing any good argument against the single
FlinkStateSnapshot here.

Cheers,
Gyula


On Fri, Apr 19, 2024 at 9:09 PM Mate Czagany <[email protected]> wrote:

> Hi Robert and Thomas,
>
> Thank you for sharing your thoughts, I will try to address your questions
> and suggestions:
>
> 1. I would really love to hear others' inputs as well about separating the
> snapshot CRD into two different CRDs instead for savepoints and
> checkpoints. I think the main upside is that we would not need the
> mandatory savepoint or checkpoint field inside the spec. The two CRs could
> share the same status fields, and their specs would be different.
> I personally like both solutions, and would love to hear others' thoughts
> as well.
>
> 2. I agree with you that "completed" is not very clear, but I would
> suggest the name "alreadyExists". WDYT?
>
> 3. I think having a retry loop inside the operator does not add too much
> complexity to the FLIP. On failure, we check if we have reached the max
> retries. If we did, the state will be set to "FAILED", else it will be set
> to "TRIGGER_PENDING", causing the operator to retry the task. The "error"
> field will always be populated with the latest error. Kubernetes Jobs
> already has a similar field called "backoffLimit", maybe we could use the
> same name, with the same logic applied, WDYT?
> About error events, I think we should keep the "error" field, and upon
> successful snapshot, we clear it. I will add to the FLIP that there will be
> an event generated for each unsuccessful snapshots.
>
> 4. I really like the idea of having something like Pod Conditions, but I
> think it wouldn't add too much value here, because the only 2 stages
> important to the user are "Triggered" and "Completed", and those timestamps
> will already be included in the status field. I think it would make more
> sense to implement this if there were more lifecycle stages.
>
> 5. There will be a new field in JobSpec called
> "flinkStateSnapshotReference" to reference a FlinkStateSnapshot to restore
> from.
>
> > How do you see potential effects on API server performance wrt. number of
> objects vs mutations? Is the proposal more or less neutral in that regard?
>
> While I am not an expert in Kubernetes internals, my understanding is that
> for the api-server, editing an existing resource or creating a new one is
> not different performance-wise, because the whole resource will always be
> written to etcd anyways.
> Retrieving the savepoints from etcd will be different though for some
> use-cases, e.g. retrieving all snapshots for a specific FlinkDeployment
> would require the api-server to retrieve every snapshots first in a
> namespace from etcd, then filter them for that specific FlinkDeployment. I
> think this is a worst-case scenario, and it will be up to the user to
> optimize their queries via e.g. watch queries [1] or resourceVersions [2].
>
> > Does that mean one would have to create a FlinkStateSnapshot CR when
> starting a new deployment from savepoint? If so, that's rather complicated.
> I would prefer something more simple/concise and would rather
> keep initialSavepointPath
>
> Starting a job from a savepoint path will indeed be deprecated with this
> FLIP. I agree that it will be more complicated to restore from a savepoint
> in those cases, but if the user decides to move away from the deprecated
> savepoint mechanisms, every savepoint will result in a new
> FlinkStateSnapshot CR. So the only situation I expect this to be an
> inconvenience is when the user onboards a new Flink job to the operator.
> But I may not be thinking this through, so please let me know if you
> disagree.
>
> Thank you very much for your questions and suggestions!
>
> [1]
> https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
> [2]
> https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions
>
> Regards,
> Mate
>
> Thomas Weise <[email protected]> ezt írta (időpont: 2024. ápr. 19., P,
> 11:31):
>
>> Thanks for the proposal.
>>
>> How do you see potential effects on API server performance wrt. number of
>> objects vs mutations? Is the proposal more or less neutral in that regard?
>>
>> Thanks for the thorough feedback Robert.
>>
>> Couple more questions below.
>>
>> -->
>>
>> On Fri, Apr 19, 2024 at 5:07 AM Robert Metzger <[email protected]>
>> wrote:
>>
>> > Hi Mate,
>> > thanks for proposing this, I'm really excited about your FLIP. I hope my
>> > questions make sense to you:
>> >
>> > 1. I would like to discuss the "FlinkStateSnapshot" name and the fact
>> that
>> > users have to use either the savepoint or checkpoint spec inside the
>> > FlinkStateSnapshot.
>> > Wouldn't it be more intuitive to introduce two CRs:
>> > FlinkSavepoint and FlinkCheckpoint
>> > Ideally they can internally share a lot of code paths, but from a users
>> > perspective, the abstraction is much clearer.
>> >
>>
>> There are probably pros and cons either way. For example it is desirable
>> to
>> have a single list of state snapshots when looking for the initial
>> savepoint for a new deployment etc.
>>
>>
>> >
>> > 2. I also would like to discuss SavepointSpec.completed, as this name is
>> > not intuitive to me. How about "ignoreExisting"?
>> >
>> > 3. The FLIP proposal seems to leave error handling to the user, e.g.
>> when
>> > you create a FlinkStateSnapshot, it will just move to status FAILED.
>> > Typically in K8s with the control loop stuff, resources are tried to get
>> > created until success. I think it would be really nice if the
>> > FlinkStateSnapshot or FlinkSavepoint resource would retry based on a
>> > property in the resource. A "FlinkStateSnapshot.retries" number would
>> > indicate how often the user wants the operator to retry creating a
>> > savepoint, "retries = -1" means retry forever. In addition, we could
>> > consider a timeout as well, however, I haven't seen such a concept in
>> K8s
>> > CRs yet.
>> > The benefit of this is that other tools relying on the K8s operator
>> > wouldn't have to implement this retry loop (which is quite natural for
>> > K8s), they would just have to wait for the CR they've created to
>> transition
>> > into COMPLETED:
>> >
>> > 3. FlinkStateSnapshotStatus.error will only show the last error. What
>> > about using Events, so that we can show multiple errors and use the
>> > FlinkStateSnapshotState to report errors?
>> >
>> > 4. I wonder if it makes sense to use something like Pod Conditions (e.g.
>> > Savepoint Conditions):
>> >
>> https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions
>> > to track the completion status. We could have the following conditions:
>> > - Triggered
>> > - Completed
>> > - Failed
>> > The only benefit of this proposal that I see is that it would tell the
>> > user how long it took to create the savepoint.
>> >
>> > 5. You mention that "JobSpec.initialSavepointPath" will be deprecated. I
>> > assume we will introduce a new field for referencing a
>> FlinkStateSnapshot
>> > CR? I think it would be good to cover this in the FLIP.
>> >
>> > Does that mean one would have to create a FlinkStateSnapshot CR when
>> starting a new deployment from savepoint? If so, that's rather
>> complicated.
>> I would prefer something more simple/concise and would rather
>> keep initialSavepointPath
>>
>>
>> >
>> > One minor comment:
>> >
>> > "/** Dispose the savepoints upon CRD deletion. */"
>> >
>> > I think this should be "upon CR deletion", not "CRD deletion".
>> >
>> > Thanks again for this great FLIP!
>> >
>> > Best,
>> > Robert
>> >
>> >
>> > On Fri, Apr 19, 2024 at 9:01 AM Gyula Fóra <[email protected]>
>> wrote:
>> >
>> >> Cc'ing some folks who gave positive feedback on this idea in the past.
>> >>
>> >> I would love to hear your thoughts on the proposed design
>> >>
>> >> Gyula
>> >>
>> >> On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás <[email protected]
>> >
>> >> wrote:
>> >>
>> >>> +1 Looking forward to it
>> >>>
>> >>> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany <[email protected]>
>> wrote:
>> >>>
>> >>> > Thank you Gyula!
>> >>> >
>> >>> > I think that is a great idea. I have updated the Google doc to only
>> >>> have 1
>> >>> > new configuration option of boolean type, which can be used to
>> signal
>> >>> the
>> >>> > Operator to use the old mode.
>> >>> >
>> >>> > Also described in the configuration description, the Operator will
>> >>> fallback
>> >>> > to the old mode if the FlinkStateSnapshot CRD cannot be found on the
>> >>> > Kubernetes cluster.
>> >>> >
>> >>> > Regards,
>> >>> > Mate
>> >>> >
>> >>> > Gyula Fóra <[email protected]> ezt írta (időpont: 2024. ápr.
>> 16.,
>> >>> K,
>> >>> > 17:01):
>> >>> >
>> >>> > > Thanks Mate, this is great stuff.
>> >>> > >
>> >>> > > Mate, I think the new configs should probably default to the new
>> >>> mode and
>> >>> > > they should only be useful for users to fall back to the old
>> >>> behaviour.
>> >>> > > We could by default use the new Snapshot CRD if the CRD is
>> installed,
>> >>> > > otherwise use the old mode by default and log a warning on
>> startup.
>> >>> > >
>> >>> > > So I am suggesting a "dynamic" default behaviour based on whether
>> >>> the new
>> >>> > > CRD was installed or not because we don't want to break operator
>> >>> startup.
>> >>> > >
>> >>> > > Gyula
>> >>> > >
>> >>> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany <[email protected]>
>> >>> wrote:
>> >>> > >
>> >>> > > > Hi Ferenc,
>> >>> > > >
>> >>> > > > Thank you for your comments, I have updated the Google docs
>> with a
>> >>> new
>> >>> > > > section for the new configs.
>> >>> > > > All of the newly added config keys will have defaults set, and
>> by
>> >>> > default
>> >>> > > > all the savepoint/checkpoint operations will use the old system:
>> >>> write
>> >>> > > > their results to the FlinkDeployment/FlinkSessionJob status
>> field.
>> >>> > > >
>> >>> > > > I have also added a default for the checkpoint type to be FULL
>> >>> (which
>> >>> > is
>> >>> > > > also the default currently). That was an oversight on my part to
>> >>> miss
>> >>> > > that.
>> >>> > > >
>> >>> > > > Regards,
>> >>> > > > Mate
>> >>> > > >
>> >>> > > > Ferenc Csaky <[email protected]> ezt írta (időpont:
>> 2024.
>> >>> > ápr.
>> >>> > > > 16., K, 16:10):
>> >>> > > >
>> >>> > > > > Thank you Mate for initiating this discussion. +1 for this
>> idea.
>> >>> > > > > Some Qs:
>> >>> > > > >
>> >>> > > > > Can you specify the newly introduced configurations in more
>> >>> > > > > details? Currently, it is not fully clear to me what are the
>> >>> > > > > possible values of
>> `kubernetes.operator.periodic.savepoint.mode`,
>> >>> > > > > is it optional, has a default value?
>> >>> > > > >
>> >>> > > > > I see that in `SavepointSpec.formatType` has a default,
>> although
>> >>> > > > > `CheckppointSpec.checkpointType` not. Are we inferring that
>> from
>> >>> > > > > the config? My point is, in general I think it would be good
>> to
>> >>> > > > > handle the two snapshot types in a similar way when it makes
>> >>> sense
>> >>> > > > > to minimize any kind of confusion.
>> >>> > > > >
>> >>> > > > > Best,
>> >>> > > > > Ferenc
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >
>> >>> > > > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany <
>> >>> > > [email protected]>
>> >>> > > > > wrote:
>> >>> > > > >
>> >>> > > > > >
>> >>> > > > > >
>> >>> > > > > > Hi Everyone,
>> >>> > > > > >
>> >>> > > > > > I would like to start a discussion on FLIP-446: Kubernetes
>> >>> Operator
>> >>> > > > State
>> >>> > > > > > Snapshot CRD.
>> >>> > > > > >
>> >>> > > > > > This FLIP adds a new custom resource for Operator users to
>> >>> create
>> >>> > and
>> >>> > > > > > manage their savepoints and checkpoints. I have also
>> developed
>> >>> an
>> >>> > > > initial
>> >>> > > > > > POC to prove that this approach is feasible, you can find
>> the
>> >>> link
>> >>> > > for
>> >>> > > > > that
>> >>> > > > > > in the FLIP.
>> >>> > > > > >
>> >>> > > > > > There is a Confluence page [1] and a Google Docs page [2]
>> as I
>> >>> do
>> >>> > not
>> >>> > > > > have
>> >>> > > > > > a Confluence account yet.
>> >>> > > > > >
>> >>> > > > > > [1]
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
>> >>> > > > > > [2]
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
>> >>> > > > > >
>> >>> > > > > >
>> >>> > > > > > Regards,
>> >>> > > > > > Mate
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>>
>

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

Reply via email to