Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-23 Thread Robert Metzger
Thank you all for the great discussion.

> 2. I agree with you that "completed" is not very clear, but I would
suggest the name "alreadyExists". WDYT?

I'm fine with "alreadyExists" and for 3. also with "backoffLimit".


>  really like the idea of having something like Pod Conditions, but I
think it wouldn't add too much value here

Okay, let's leave it out.

I'm happy to start a VOTE and +1 the FLIP as is.

Looking forward to having this feature in the operator!


On Sat, Apr 20, 2024 at 4:31 PM Mate Czagany  wrote:

> Hi,
>
> Thanks for your comments, Gyula, I really appreciate it!
>
> I have updated the following things in the FLIP, please comment on these
> changes if you have any suggestions or concerns:
> - Added path field to FlinkStateSnapshotReference
> - Added two examples at the bottom.
> - Added error handling section and the new fields associated
> ("backoffLimit" and "failures") to the interfaces.
> - Renamed field "completed" to "alreadyExists".
>
> Regarding the separate resources, I don't think that any of the two
> solutions would bring too much (dis)advantage to the table, so I am still
> neutral, and waiting for others to chime in as well with their thoughts and
> feedback!
>
> Regards,
> Mate
>
>
> Gyula Fóra  ezt írta (időpont: 2024. ápr. 19., P,
> 21:43):
>
> > Hey!
> >
> > Regarding the question of initialSavepointPath and
> > flinkStateSnapshotReference new object, I think we could simply keep an
> > extra field as part of the flinkStateSnapshotReference object called
> path.
> >
> > Then the fields could be:
> > namespace, name, path
> >
> > If path is defined we would use that (to support the simple way also)
> > otherwise use the resource. I would still deprecate the
> > initialSavepointPath field in the jobSpec.
> >
> > Regarding the Savepoint/Checkpoint vs FlinkStateSnapshot.
> > What we need:
> >  1. Easy way to list all state snapshots (to select latest)
> >  2. Easy way to reference a savepoint/checkpoint from a jobspec
> >  3. Differentiate state snapshot types (in some cases users may prefer to
> > use checkpoint/savepoint for certain upgrades) -> we should add a label
> or
> > something for easy selection
> >  4. Be able to delete savepoints (and checkpoints maybe)
> >
> > I am personally slightly more in favor of having a single resource as
> that
> > ticks all the boxes, while having 2 separate resources will make both
> > listing and referencing harder. We would have to introduce state type in
> > the reference (name + namespace would not be enough to uniquely identify
> a
> > state snapshot)
> >
> > I wonder if I am missing any good argument against the single
> > FlinkStateSnapshot here.
> >
> > Cheers,
> > Gyula
> >
> >
> > On Fri, Apr 19, 2024 at 9:09 PM Mate Czagany  wrote:
> >
> >> Hi Robert and Thomas,
> >>
> >> Thank you for sharing your thoughts, I will try to address your
> questions
> >> and suggestions:
> >>
> >> 1. I would really love to hear others' inputs as well about separating
> >> the snapshot CRD into two different CRDs instead for savepoints and
> >> checkpoints. I think the main upside is that we would not need the
> >> mandatory savepoint or checkpoint field inside the spec. The two CRs
> could
> >> share the same status fields, and their specs would be different.
> >> I personally like both solutions, and would love to hear others'
> thoughts
> >> as well.
> >>
> >> 2. I agree with you that "completed" is not very clear, but I would
> >> suggest the name "alreadyExists". WDYT?
> >>
> >> 3. I think having a retry loop inside the operator does not add too much
> >> complexity to the FLIP. On failure, we check if we have reached the max
> >> retries. If we did, the state will be set to "FAILED", else it will be
> set
> >> to "TRIGGER_PENDING", causing the operator to retry the task. The
> "error"
> >> field will always be populated with the latest error. Kubernetes Jobs
> >> already has a similar field called "backoffLimit", maybe we could use
> the
> >> same name, with the same logic applied, WDYT?
> >> About error events, I think we should keep the "error" field, and upon
> >> successful snapshot, we clear it. I will add to the FLIP that there
> will be
> >> an event generated for each unsuccessful snapshots.
> >>
> >> 4. I really like the idea of having something like Pod Conditions, but I
> >> think it wouldn't add too much value here, because the only 2 stages
> >> important to the user are "Triggered" and "Completed", and those
> timestamps
> >> will already be included in the status field. I think it would make more
> >> sense to implement this if there were more lifecycle stages.
> >>
> >> 5. There will be a new field in JobSpec called
> >> "flinkStateSnapshotReference" to reference a FlinkStateSnapshot to
> restore
> >> from.
> >>
> >> > How do you see potential effects on API server performance wrt. number
> >> of
> >> objects vs mutations? Is the proposal more or less neutral in that
> regard?
> >>
> >> While I am not an expert in Kuber

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-20 Thread Mate Czagany
Hi,

Thanks for your comments, Gyula, I really appreciate it!

I have updated the following things in the FLIP, please comment on these
changes if you have any suggestions or concerns:
- Added path field to FlinkStateSnapshotReference
- Added two examples at the bottom.
- Added error handling section and the new fields associated
("backoffLimit" and "failures") to the interfaces.
- Renamed field "completed" to "alreadyExists".

Regarding the separate resources, I don't think that any of the two
solutions would bring too much (dis)advantage to the table, so I am still
neutral, and waiting for others to chime in as well with their thoughts and
feedback!

Regards,
Mate


Gyula Fóra  ezt írta (időpont: 2024. ápr. 19., P,
21:43):

> Hey!
>
> Regarding the question of initialSavepointPath and
> flinkStateSnapshotReference new object, I think we could simply keep an
> extra field as part of the flinkStateSnapshotReference object called path.
>
> Then the fields could be:
> namespace, name, path
>
> If path is defined we would use that (to support the simple way also)
> otherwise use the resource. I would still deprecate the
> initialSavepointPath field in the jobSpec.
>
> Regarding the Savepoint/Checkpoint vs FlinkStateSnapshot.
> What we need:
>  1. Easy way to list all state snapshots (to select latest)
>  2. Easy way to reference a savepoint/checkpoint from a jobspec
>  3. Differentiate state snapshot types (in some cases users may prefer to
> use checkpoint/savepoint for certain upgrades) -> we should add a label or
> something for easy selection
>  4. Be able to delete savepoints (and checkpoints maybe)
>
> I am personally slightly more in favor of having a single resource as that
> ticks all the boxes, while having 2 separate resources will make both
> listing and referencing harder. We would have to introduce state type in
> the reference (name + namespace would not be enough to uniquely identify a
> state snapshot)
>
> I wonder if I am missing any good argument against the single
> FlinkStateSnapshot here.
>
> Cheers,
> Gyula
>
>
> On Fri, Apr 19, 2024 at 9:09 PM Mate Czagany  wrote:
>
>> Hi Robert and Thomas,
>>
>> Thank you for sharing your thoughts, I will try to address your questions
>> and suggestions:
>>
>> 1. I would really love to hear others' inputs as well about separating
>> the snapshot CRD into two different CRDs instead for savepoints and
>> checkpoints. I think the main upside is that we would not need the
>> mandatory savepoint or checkpoint field inside the spec. The two CRs could
>> share the same status fields, and their specs would be different.
>> I personally like both solutions, and would love to hear others' thoughts
>> as well.
>>
>> 2. I agree with you that "completed" is not very clear, but I would
>> suggest the name "alreadyExists". WDYT?
>>
>> 3. I think having a retry loop inside the operator does not add too much
>> complexity to the FLIP. On failure, we check if we have reached the max
>> retries. If we did, the state will be set to "FAILED", else it will be set
>> to "TRIGGER_PENDING", causing the operator to retry the task. The "error"
>> field will always be populated with the latest error. Kubernetes Jobs
>> already has a similar field called "backoffLimit", maybe we could use the
>> same name, with the same logic applied, WDYT?
>> About error events, I think we should keep the "error" field, and upon
>> successful snapshot, we clear it. I will add to the FLIP that there will be
>> an event generated for each unsuccessful snapshots.
>>
>> 4. I really like the idea of having something like Pod Conditions, but I
>> think it wouldn't add too much value here, because the only 2 stages
>> important to the user are "Triggered" and "Completed", and those timestamps
>> will already be included in the status field. I think it would make more
>> sense to implement this if there were more lifecycle stages.
>>
>> 5. There will be a new field in JobSpec called
>> "flinkStateSnapshotReference" to reference a FlinkStateSnapshot to restore
>> from.
>>
>> > How do you see potential effects on API server performance wrt. number
>> of
>> objects vs mutations? Is the proposal more or less neutral in that regard?
>>
>> While I am not an expert in Kubernetes internals, my understanding is
>> that for the api-server, editing an existing resource or creating a new one
>> is not different performance-wise, because the whole resource will always
>> be written to etcd anyways.
>> Retrieving the savepoints from etcd will be different though for some
>> use-cases, e.g. retrieving all snapshots for a specific FlinkDeployment
>> would require the api-server to retrieve every snapshots first in a
>> namespace from etcd, then filter them for that specific FlinkDeployment. I
>> think this is a worst-case scenario, and it will be up to the user to
>> optimize their queries via e.g. watch queries [1] or resourceVersions [2].
>>
>> > Does that mean one would have to create a FlinkStateSnapshot CR when
>>

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-19 Thread Gyula Fóra
Hey!

Regarding the question of initialSavepointPath and
flinkStateSnapshotReference new object, I think we could simply keep an
extra field as part of the flinkStateSnapshotReference object called path.

Then the fields could be:
namespace, name, path

If path is defined we would use that (to support the simple way also)
otherwise use the resource. I would still deprecate the
initialSavepointPath field in the jobSpec.

Regarding the Savepoint/Checkpoint vs FlinkStateSnapshot.
What we need:
 1. Easy way to list all state snapshots (to select latest)
 2. Easy way to reference a savepoint/checkpoint from a jobspec
 3. Differentiate state snapshot types (in some cases users may prefer to
use checkpoint/savepoint for certain upgrades) -> we should add a label or
something for easy selection
 4. Be able to delete savepoints (and checkpoints maybe)

I am personally slightly more in favor of having a single resource as that
ticks all the boxes, while having 2 separate resources will make both
listing and referencing harder. We would have to introduce state type in
the reference (name + namespace would not be enough to uniquely identify a
state snapshot)

I wonder if I am missing any good argument against the single
FlinkStateSnapshot here.

Cheers,
Gyula


On Fri, Apr 19, 2024 at 9:09 PM Mate Czagany  wrote:

> Hi Robert and Thomas,
>
> Thank you for sharing your thoughts, I will try to address your questions
> and suggestions:
>
> 1. I would really love to hear others' inputs as well about separating the
> snapshot CRD into two different CRDs instead for savepoints and
> checkpoints. I think the main upside is that we would not need the
> mandatory savepoint or checkpoint field inside the spec. The two CRs could
> share the same status fields, and their specs would be different.
> I personally like both solutions, and would love to hear others' thoughts
> as well.
>
> 2. I agree with you that "completed" is not very clear, but I would
> suggest the name "alreadyExists". WDYT?
>
> 3. I think having a retry loop inside the operator does not add too much
> complexity to the FLIP. On failure, we check if we have reached the max
> retries. If we did, the state will be set to "FAILED", else it will be set
> to "TRIGGER_PENDING", causing the operator to retry the task. The "error"
> field will always be populated with the latest error. Kubernetes Jobs
> already has a similar field called "backoffLimit", maybe we could use the
> same name, with the same logic applied, WDYT?
> About error events, I think we should keep the "error" field, and upon
> successful snapshot, we clear it. I will add to the FLIP that there will be
> an event generated for each unsuccessful snapshots.
>
> 4. I really like the idea of having something like Pod Conditions, but I
> think it wouldn't add too much value here, because the only 2 stages
> important to the user are "Triggered" and "Completed", and those timestamps
> will already be included in the status field. I think it would make more
> sense to implement this if there were more lifecycle stages.
>
> 5. There will be a new field in JobSpec called
> "flinkStateSnapshotReference" to reference a FlinkStateSnapshot to restore
> from.
>
> > How do you see potential effects on API server performance wrt. number of
> objects vs mutations? Is the proposal more or less neutral in that regard?
>
> While I am not an expert in Kubernetes internals, my understanding is that
> for the api-server, editing an existing resource or creating a new one is
> not different performance-wise, because the whole resource will always be
> written to etcd anyways.
> Retrieving the savepoints from etcd will be different though for some
> use-cases, e.g. retrieving all snapshots for a specific FlinkDeployment
> would require the api-server to retrieve every snapshots first in a
> namespace from etcd, then filter them for that specific FlinkDeployment. I
> think this is a worst-case scenario, and it will be up to the user to
> optimize their queries via e.g. watch queries [1] or resourceVersions [2].
>
> > Does that mean one would have to create a FlinkStateSnapshot CR when
> starting a new deployment from savepoint? If so, that's rather complicated.
> I would prefer something more simple/concise and would rather
> keep initialSavepointPath
>
> Starting a job from a savepoint path will indeed be deprecated with this
> FLIP. I agree that it will be more complicated to restore from a savepoint
> in those cases, but if the user decides to move away from the deprecated
> savepoint mechanisms, every savepoint will result in a new
> FlinkStateSnapshot CR. So the only situation I expect this to be an
> inconvenience is when the user onboards a new Flink job to the operator.
> But I may not be thinking this through, so please let me know if you
> disagree.
>
> Thank you very much for your questions and suggestions!
>
> [1]
> https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
> [2]
>

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-19 Thread Mate Czagany
Hi Robert and Thomas,

Thank you for sharing your thoughts, I will try to address your questions
and suggestions:

1. I would really love to hear others' inputs as well about separating the
snapshot CRD into two different CRDs instead for savepoints and
checkpoints. I think the main upside is that we would not need the
mandatory savepoint or checkpoint field inside the spec. The two CRs could
share the same status fields, and their specs would be different.
I personally like both solutions, and would love to hear others' thoughts
as well.

2. I agree with you that "completed" is not very clear, but I would suggest
the name "alreadyExists". WDYT?

3. I think having a retry loop inside the operator does not add too much
complexity to the FLIP. On failure, we check if we have reached the max
retries. If we did, the state will be set to "FAILED", else it will be set
to "TRIGGER_PENDING", causing the operator to retry the task. The "error"
field will always be populated with the latest error. Kubernetes Jobs
already has a similar field called "backoffLimit", maybe we could use the
same name, with the same logic applied, WDYT?
About error events, I think we should keep the "error" field, and upon
successful snapshot, we clear it. I will add to the FLIP that there will be
an event generated for each unsuccessful snapshots.

4. I really like the idea of having something like Pod Conditions, but I
think it wouldn't add too much value here, because the only 2 stages
important to the user are "Triggered" and "Completed", and those timestamps
will already be included in the status field. I think it would make more
sense to implement this if there were more lifecycle stages.

5. There will be a new field in JobSpec called
"flinkStateSnapshotReference" to reference a FlinkStateSnapshot to restore
from.

> How do you see potential effects on API server performance wrt. number of
objects vs mutations? Is the proposal more or less neutral in that regard?

While I am not an expert in Kubernetes internals, my understanding is that
for the api-server, editing an existing resource or creating a new one is
not different performance-wise, because the whole resource will always be
written to etcd anyways.
Retrieving the savepoints from etcd will be different though for some
use-cases, e.g. retrieving all snapshots for a specific FlinkDeployment
would require the api-server to retrieve every snapshots first in a
namespace from etcd, then filter them for that specific FlinkDeployment. I
think this is a worst-case scenario, and it will be up to the user to
optimize their queries via e.g. watch queries [1] or resourceVersions [2].

> Does that mean one would have to create a FlinkStateSnapshot CR when
starting a new deployment from savepoint? If so, that's rather complicated.
I would prefer something more simple/concise and would rather
keep initialSavepointPath

Starting a job from a savepoint path will indeed be deprecated with this
FLIP. I agree that it will be more complicated to restore from a savepoint
in those cases, but if the user decides to move away from the deprecated
savepoint mechanisms, every savepoint will result in a new
FlinkStateSnapshot CR. So the only situation I expect this to be an
inconvenience is when the user onboards a new Flink job to the operator.
But I may not be thinking this through, so please let me know if you
disagree.

Thank you very much for your questions and suggestions!

[1]
https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
[2]
https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions

Regards,
Mate

Thomas Weise  ezt írta (időpont: 2024. ápr. 19., P, 11:31):

> Thanks for the proposal.
>
> How do you see potential effects on API server performance wrt. number of
> objects vs mutations? Is the proposal more or less neutral in that regard?
>
> Thanks for the thorough feedback Robert.
>
> Couple more questions below.
>
> -->
>
> On Fri, Apr 19, 2024 at 5:07 AM Robert Metzger 
> wrote:
>
> > Hi Mate,
> > thanks for proposing this, I'm really excited about your FLIP. I hope my
> > questions make sense to you:
> >
> > 1. I would like to discuss the "FlinkStateSnapshot" name and the fact
> that
> > users have to use either the savepoint or checkpoint spec inside the
> > FlinkStateSnapshot.
> > Wouldn't it be more intuitive to introduce two CRs:
> > FlinkSavepoint and FlinkCheckpoint
> > Ideally they can internally share a lot of code paths, but from a users
> > perspective, the abstraction is much clearer.
> >
>
> There are probably pros and cons either way. For example it is desirable to
> have a single list of state snapshots when looking for the initial
> savepoint for a new deployment etc.
>
>
> >
> > 2. I also would like to discuss SavepointSpec.completed, as this name is
> > not intuitive to me. How about "ignoreExisting"?
> >
> > 3. The FLIP proposal seems to leave error handling to the user, e.g. when
> > you create a FlinkStateSn

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-19 Thread Thomas Weise
Thanks for the proposal.

How do you see potential effects on API server performance wrt. number of
objects vs mutations? Is the proposal more or less neutral in that regard?

Thanks for the thorough feedback Robert.

Couple more questions below.

-->

On Fri, Apr 19, 2024 at 5:07 AM Robert Metzger  wrote:

> Hi Mate,
> thanks for proposing this, I'm really excited about your FLIP. I hope my
> questions make sense to you:
>
> 1. I would like to discuss the "FlinkStateSnapshot" name and the fact that
> users have to use either the savepoint or checkpoint spec inside the
> FlinkStateSnapshot.
> Wouldn't it be more intuitive to introduce two CRs:
> FlinkSavepoint and FlinkCheckpoint
> Ideally they can internally share a lot of code paths, but from a users
> perspective, the abstraction is much clearer.
>

There are probably pros and cons either way. For example it is desirable to
have a single list of state snapshots when looking for the initial
savepoint for a new deployment etc.


>
> 2. I also would like to discuss SavepointSpec.completed, as this name is
> not intuitive to me. How about "ignoreExisting"?
>
> 3. The FLIP proposal seems to leave error handling to the user, e.g. when
> you create a FlinkStateSnapshot, it will just move to status FAILED.
> Typically in K8s with the control loop stuff, resources are tried to get
> created until success. I think it would be really nice if the
> FlinkStateSnapshot or FlinkSavepoint resource would retry based on a
> property in the resource. A "FlinkStateSnapshot.retries" number would
> indicate how often the user wants the operator to retry creating a
> savepoint, "retries = -1" means retry forever. In addition, we could
> consider a timeout as well, however, I haven't seen such a concept in K8s
> CRs yet.
> The benefit of this is that other tools relying on the K8s operator
> wouldn't have to implement this retry loop (which is quite natural for
> K8s), they would just have to wait for the CR they've created to transition
> into COMPLETED:
>
> 3. FlinkStateSnapshotStatus.error will only show the last error. What
> about using Events, so that we can show multiple errors and use the
> FlinkStateSnapshotState to report errors?
>
> 4. I wonder if it makes sense to use something like Pod Conditions (e.g.
> Savepoint Conditions):
> https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions
> to track the completion status. We could have the following conditions:
> - Triggered
> - Completed
> - Failed
> The only benefit of this proposal that I see is that it would tell the
> user how long it took to create the savepoint.
>
> 5. You mention that "JobSpec.initialSavepointPath" will be deprecated. I
> assume we will introduce a new field for referencing a FlinkStateSnapshot
> CR? I think it would be good to cover this in the FLIP.
>
> Does that mean one would have to create a FlinkStateSnapshot CR when
starting a new deployment from savepoint? If so, that's rather complicated.
I would prefer something more simple/concise and would rather
keep initialSavepointPath


>
> One minor comment:
>
> "/** Dispose the savepoints upon CRD deletion. */"
>
> I think this should be "upon CR deletion", not "CRD deletion".
>
> Thanks again for this great FLIP!
>
> Best,
> Robert
>
>
> On Fri, Apr 19, 2024 at 9:01 AM Gyula Fóra  wrote:
>
>> Cc'ing some folks who gave positive feedback on this idea in the past.
>>
>> I would love to hear your thoughts on the proposed design
>>
>> Gyula
>>
>> On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás 
>> wrote:
>>
>>> +1 Looking forward to it
>>>
>>> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany  wrote:
>>>
>>> > Thank you Gyula!
>>> >
>>> > I think that is a great idea. I have updated the Google doc to only
>>> have 1
>>> > new configuration option of boolean type, which can be used to signal
>>> the
>>> > Operator to use the old mode.
>>> >
>>> > Also described in the configuration description, the Operator will
>>> fallback
>>> > to the old mode if the FlinkStateSnapshot CRD cannot be found on the
>>> > Kubernetes cluster.
>>> >
>>> > Regards,
>>> > Mate
>>> >
>>> > Gyula Fóra  ezt írta (időpont: 2024. ápr. 16.,
>>> K,
>>> > 17:01):
>>> >
>>> > > Thanks Mate, this is great stuff.
>>> > >
>>> > > Mate, I think the new configs should probably default to the new
>>> mode and
>>> > > they should only be useful for users to fall back to the old
>>> behaviour.
>>> > > We could by default use the new Snapshot CRD if the CRD is installed,
>>> > > otherwise use the old mode by default and log a warning on startup.
>>> > >
>>> > > So I am suggesting a "dynamic" default behaviour based on whether
>>> the new
>>> > > CRD was installed or not because we don't want to break operator
>>> startup.
>>> > >
>>> > > Gyula
>>> > >
>>> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany 
>>> wrote:
>>> > >
>>> > > > Hi Ferenc,
>>> > > >
>>> > > > Thank you for your comments, I have updated the Google docs with a
>>> new
>>> > > > section for the new co

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-19 Thread Robert Metzger
Hi Mate,
thanks for proposing this, I'm really excited about your FLIP. I hope my
questions make sense to you:

1. I would like to discuss the "FlinkStateSnapshot" name and the fact that
users have to use either the savepoint or checkpoint spec inside the
FlinkStateSnapshot.
Wouldn't it be more intuitive to introduce two CRs:
FlinkSavepoint and FlinkCheckpoint
Ideally they can internally share a lot of code paths, but from a users
perspective, the abstraction is much clearer.

2. I also would like to discuss SavepointSpec.completed, as this name is
not intuitive to me. How about "ignoreExisting"?

3. The FLIP proposal seems to leave error handling to the user, e.g. when
you create a FlinkStateSnapshot, it will just move to status FAILED.
Typically in K8s with the control loop stuff, resources are tried to get
created until success. I think it would be really nice if the
FlinkStateSnapshot or FlinkSavepoint resource would retry based on a
property in the resource. A "FlinkStateSnapshot.retries" number would
indicate how often the user wants the operator to retry creating a
savepoint, "retries = -1" means retry forever. In addition, we could
consider a timeout as well, however, I haven't seen such a concept in K8s
CRs yet.
The benefit of this is that other tools relying on the K8s operator
wouldn't have to implement this retry loop (which is quite natural for
K8s), they would just have to wait for the CR they've created to transition
into COMPLETED:

3. FlinkStateSnapshotStatus.error will only show the last error. What about
using Events, so that we can show multiple errors and use the
FlinkStateSnapshotState to report errors?

4. I wonder if it makes sense to use something like Pod Conditions (e.g.
Savepoint Conditions):
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions
to track the completion status. We could have the following conditions:
- Triggered
- Completed
- Failed
The only benefit of this proposal that I see is that it would tell the user
how long it took to create the savepoint.

5. You mention that "JobSpec.initialSavepointPath" will be deprecated. I
assume we will introduce a new field for referencing a FlinkStateSnapshot
CR? I think it would be good to cover this in the FLIP.


One minor comment:

"/** Dispose the savepoints upon CRD deletion. */"

I think this should be "upon CR deletion", not "CRD deletion".

Thanks again for this great FLIP!

Best,
Robert


On Fri, Apr 19, 2024 at 9:01 AM Gyula Fóra  wrote:

> Cc'ing some folks who gave positive feedback on this idea in the past.
>
> I would love to hear your thoughts on the proposed design
>
> Gyula
>
> On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás 
> wrote:
>
>> +1 Looking forward to it
>>
>> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany  wrote:
>>
>> > Thank you Gyula!
>> >
>> > I think that is a great idea. I have updated the Google doc to only
>> have 1
>> > new configuration option of boolean type, which can be used to signal
>> the
>> > Operator to use the old mode.
>> >
>> > Also described in the configuration description, the Operator will
>> fallback
>> > to the old mode if the FlinkStateSnapshot CRD cannot be found on the
>> > Kubernetes cluster.
>> >
>> > Regards,
>> > Mate
>> >
>> > Gyula Fóra  ezt írta (időpont: 2024. ápr. 16., K,
>> > 17:01):
>> >
>> > > Thanks Mate, this is great stuff.
>> > >
>> > > Mate, I think the new configs should probably default to the new mode
>> and
>> > > they should only be useful for users to fall back to the old
>> behaviour.
>> > > We could by default use the new Snapshot CRD if the CRD is installed,
>> > > otherwise use the old mode by default and log a warning on startup.
>> > >
>> > > So I am suggesting a "dynamic" default behaviour based on whether the
>> new
>> > > CRD was installed or not because we don't want to break operator
>> startup.
>> > >
>> > > Gyula
>> > >
>> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany 
>> wrote:
>> > >
>> > > > Hi Ferenc,
>> > > >
>> > > > Thank you for your comments, I have updated the Google docs with a
>> new
>> > > > section for the new configs.
>> > > > All of the newly added config keys will have defaults set, and by
>> > default
>> > > > all the savepoint/checkpoint operations will use the old system:
>> write
>> > > > their results to the FlinkDeployment/FlinkSessionJob status field.
>> > > >
>> > > > I have also added a default for the checkpoint type to be FULL
>> (which
>> > is
>> > > > also the default currently). That was an oversight on my part to
>> miss
>> > > that.
>> > > >
>> > > > Regards,
>> > > > Mate
>> > > >
>> > > > Ferenc Csaky  ezt írta (időpont: 2024.
>> > ápr.
>> > > > 16., K, 16:10):
>> > > >
>> > > > > Thank you Mate for initiating this discussion. +1 for this idea.
>> > > > > Some Qs:
>> > > > >
>> > > > > Can you specify the newly introduced configurations in more
>> > > > > details? Currently, it is not fully clear to me what are the
>> > > > > possible values of `kubernetes.operator.periodic.s

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-19 Thread Gyula Fóra
Cc'ing some folks who gave positive feedback on this idea in the past.

I would love to hear your thoughts on the proposed design

Gyula

On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás 
wrote:

> +1 Looking forward to it
>
> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany  wrote:
>
> > Thank you Gyula!
> >
> > I think that is a great idea. I have updated the Google doc to only have
> 1
> > new configuration option of boolean type, which can be used to signal the
> > Operator to use the old mode.
> >
> > Also described in the configuration description, the Operator will
> fallback
> > to the old mode if the FlinkStateSnapshot CRD cannot be found on the
> > Kubernetes cluster.
> >
> > Regards,
> > Mate
> >
> > Gyula Fóra  ezt írta (időpont: 2024. ápr. 16., K,
> > 17:01):
> >
> > > Thanks Mate, this is great stuff.
> > >
> > > Mate, I think the new configs should probably default to the new mode
> and
> > > they should only be useful for users to fall back to the old behaviour.
> > > We could by default use the new Snapshot CRD if the CRD is installed,
> > > otherwise use the old mode by default and log a warning on startup.
> > >
> > > So I am suggesting a "dynamic" default behaviour based on whether the
> new
> > > CRD was installed or not because we don't want to break operator
> startup.
> > >
> > > Gyula
> > >
> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany 
> wrote:
> > >
> > > > Hi Ferenc,
> > > >
> > > > Thank you for your comments, I have updated the Google docs with a
> new
> > > > section for the new configs.
> > > > All of the newly added config keys will have defaults set, and by
> > default
> > > > all the savepoint/checkpoint operations will use the old system:
> write
> > > > their results to the FlinkDeployment/FlinkSessionJob status field.
> > > >
> > > > I have also added a default for the checkpoint type to be FULL (which
> > is
> > > > also the default currently). That was an oversight on my part to miss
> > > that.
> > > >
> > > > Regards,
> > > > Mate
> > > >
> > > > Ferenc Csaky  ezt írta (időpont: 2024.
> > ápr.
> > > > 16., K, 16:10):
> > > >
> > > > > Thank you Mate for initiating this discussion. +1 for this idea.
> > > > > Some Qs:
> > > > >
> > > > > Can you specify the newly introduced configurations in more
> > > > > details? Currently, it is not fully clear to me what are the
> > > > > possible values of `kubernetes.operator.periodic.savepoint.mode`,
> > > > > is it optional, has a default value?
> > > > >
> > > > > I see that in `SavepointSpec.formatType` has a default, although
> > > > > `CheckppointSpec.checkpointType` not. Are we inferring that from
> > > > > the config? My point is, in general I think it would be good to
> > > > > handle the two snapshot types in a similar way when it makes sense
> > > > > to minimize any kind of confusion.
> > > > >
> > > > > Best,
> > > > > Ferenc
> > > > >
> > > > >
> > > > >
> > > > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany <
> > > czmat...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > I would like to start a discussion on FLIP-446: Kubernetes
> Operator
> > > > State
> > > > > > Snapshot CRD.
> > > > > >
> > > > > > This FLIP adds a new custom resource for Operator users to create
> > and
> > > > > > manage their savepoints and checkpoints. I have also developed an
> > > > initial
> > > > > > POC to prove that this approach is feasible, you can find the
> link
> > > for
> > > > > that
> > > > > > in the FLIP.
> > > > > >
> > > > > > There is a Confluence page [1] and a Google Docs page [2] as I do
> > not
> > > > > have
> > > > > > a Confluence account yet.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Mate
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-16 Thread Őrhidi Mátyás
+1 Looking forward to it

On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany  wrote:

> Thank you Gyula!
>
> I think that is a great idea. I have updated the Google doc to only have 1
> new configuration option of boolean type, which can be used to signal the
> Operator to use the old mode.
>
> Also described in the configuration description, the Operator will fallback
> to the old mode if the FlinkStateSnapshot CRD cannot be found on the
> Kubernetes cluster.
>
> Regards,
> Mate
>
> Gyula Fóra  ezt írta (időpont: 2024. ápr. 16., K,
> 17:01):
>
> > Thanks Mate, this is great stuff.
> >
> > Mate, I think the new configs should probably default to the new mode and
> > they should only be useful for users to fall back to the old behaviour.
> > We could by default use the new Snapshot CRD if the CRD is installed,
> > otherwise use the old mode by default and log a warning on startup.
> >
> > So I am suggesting a "dynamic" default behaviour based on whether the new
> > CRD was installed or not because we don't want to break operator startup.
> >
> > Gyula
> >
> > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany  wrote:
> >
> > > Hi Ferenc,
> > >
> > > Thank you for your comments, I have updated the Google docs with a new
> > > section for the new configs.
> > > All of the newly added config keys will have defaults set, and by
> default
> > > all the savepoint/checkpoint operations will use the old system: write
> > > their results to the FlinkDeployment/FlinkSessionJob status field.
> > >
> > > I have also added a default for the checkpoint type to be FULL (which
> is
> > > also the default currently). That was an oversight on my part to miss
> > that.
> > >
> > > Regards,
> > > Mate
> > >
> > > Ferenc Csaky  ezt írta (időpont: 2024.
> ápr.
> > > 16., K, 16:10):
> > >
> > > > Thank you Mate for initiating this discussion. +1 for this idea.
> > > > Some Qs:
> > > >
> > > > Can you specify the newly introduced configurations in more
> > > > details? Currently, it is not fully clear to me what are the
> > > > possible values of `kubernetes.operator.periodic.savepoint.mode`,
> > > > is it optional, has a default value?
> > > >
> > > > I see that in `SavepointSpec.formatType` has a default, although
> > > > `CheckppointSpec.checkpointType` not. Are we inferring that from
> > > > the config? My point is, in general I think it would be good to
> > > > handle the two snapshot types in a similar way when it makes sense
> > > > to minimize any kind of confusion.
> > > >
> > > > Best,
> > > > Ferenc
> > > >
> > > >
> > > >
> > > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany <
> > czmat...@gmail.com>
> > > > wrote:
> > > >
> > > > >
> > > > >
> > > > > Hi Everyone,
> > > > >
> > > > > I would like to start a discussion on FLIP-446: Kubernetes Operator
> > > State
> > > > > Snapshot CRD.
> > > > >
> > > > > This FLIP adds a new custom resource for Operator users to create
> and
> > > > > manage their savepoints and checkpoints. I have also developed an
> > > initial
> > > > > POC to prove that this approach is feasible, you can find the link
> > for
> > > > that
> > > > > in the FLIP.
> > > > >
> > > > > There is a Confluence page [1] and a Google Docs page [2] as I do
> not
> > > > have
> > > > > a Confluence account yet.
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> > > > >
> > > > >
> > > > > Regards,
> > > > > Mate
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-16 Thread Mate Czagany
Thank you Gyula!

I think that is a great idea. I have updated the Google doc to only have 1
new configuration option of boolean type, which can be used to signal the
Operator to use the old mode.

Also described in the configuration description, the Operator will fallback
to the old mode if the FlinkStateSnapshot CRD cannot be found on the
Kubernetes cluster.

Regards,
Mate

Gyula Fóra  ezt írta (időpont: 2024. ápr. 16., K,
17:01):

> Thanks Mate, this is great stuff.
>
> Mate, I think the new configs should probably default to the new mode and
> they should only be useful for users to fall back to the old behaviour.
> We could by default use the new Snapshot CRD if the CRD is installed,
> otherwise use the old mode by default and log a warning on startup.
>
> So I am suggesting a "dynamic" default behaviour based on whether the new
> CRD was installed or not because we don't want to break operator startup.
>
> Gyula
>
> On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany  wrote:
>
> > Hi Ferenc,
> >
> > Thank you for your comments, I have updated the Google docs with a new
> > section for the new configs.
> > All of the newly added config keys will have defaults set, and by default
> > all the savepoint/checkpoint operations will use the old system: write
> > their results to the FlinkDeployment/FlinkSessionJob status field.
> >
> > I have also added a default for the checkpoint type to be FULL (which is
> > also the default currently). That was an oversight on my part to miss
> that.
> >
> > Regards,
> > Mate
> >
> > Ferenc Csaky  ezt írta (időpont: 2024. ápr.
> > 16., K, 16:10):
> >
> > > Thank you Mate for initiating this discussion. +1 for this idea.
> > > Some Qs:
> > >
> > > Can you specify the newly introduced configurations in more
> > > details? Currently, it is not fully clear to me what are the
> > > possible values of `kubernetes.operator.periodic.savepoint.mode`,
> > > is it optional, has a default value?
> > >
> > > I see that in `SavepointSpec.formatType` has a default, although
> > > `CheckppointSpec.checkpointType` not. Are we inferring that from
> > > the config? My point is, in general I think it would be good to
> > > handle the two snapshot types in a similar way when it makes sense
> > > to minimize any kind of confusion.
> > >
> > > Best,
> > > Ferenc
> > >
> > >
> > >
> > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany <
> czmat...@gmail.com>
> > > wrote:
> > >
> > > >
> > > >
> > > > Hi Everyone,
> > > >
> > > > I would like to start a discussion on FLIP-446: Kubernetes Operator
> > State
> > > > Snapshot CRD.
> > > >
> > > > This FLIP adds a new custom resource for Operator users to create and
> > > > manage their savepoints and checkpoints. I have also developed an
> > initial
> > > > POC to prove that this approach is feasible, you can find the link
> for
> > > that
> > > > in the FLIP.
> > > >
> > > > There is a Confluence page [1] and a Google Docs page [2] as I do not
> > > have
> > > > a Confluence account yet.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > > > [2]
> > > >
> > >
> >
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> > > >
> > > >
> > > > Regards,
> > > > Mate
> > >
> >
>


Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-16 Thread Gyula Fóra
Thanks Mate, this is great stuff.

Mate, I think the new configs should probably default to the new mode and
they should only be useful for users to fall back to the old behaviour.
We could by default use the new Snapshot CRD if the CRD is installed,
otherwise use the old mode by default and log a warning on startup.

So I am suggesting a "dynamic" default behaviour based on whether the new
CRD was installed or not because we don't want to break operator startup.

Gyula

On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany  wrote:

> Hi Ferenc,
>
> Thank you for your comments, I have updated the Google docs with a new
> section for the new configs.
> All of the newly added config keys will have defaults set, and by default
> all the savepoint/checkpoint operations will use the old system: write
> their results to the FlinkDeployment/FlinkSessionJob status field.
>
> I have also added a default for the checkpoint type to be FULL (which is
> also the default currently). That was an oversight on my part to miss that.
>
> Regards,
> Mate
>
> Ferenc Csaky  ezt írta (időpont: 2024. ápr.
> 16., K, 16:10):
>
> > Thank you Mate for initiating this discussion. +1 for this idea.
> > Some Qs:
> >
> > Can you specify the newly introduced configurations in more
> > details? Currently, it is not fully clear to me what are the
> > possible values of `kubernetes.operator.periodic.savepoint.mode`,
> > is it optional, has a default value?
> >
> > I see that in `SavepointSpec.formatType` has a default, although
> > `CheckppointSpec.checkpointType` not. Are we inferring that from
> > the config? My point is, in general I think it would be good to
> > handle the two snapshot types in a similar way when it makes sense
> > to minimize any kind of confusion.
> >
> > Best,
> > Ferenc
> >
> >
> >
> > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany 
> > wrote:
> >
> > >
> > >
> > > Hi Everyone,
> > >
> > > I would like to start a discussion on FLIP-446: Kubernetes Operator
> State
> > > Snapshot CRD.
> > >
> > > This FLIP adds a new custom resource for Operator users to create and
> > > manage their savepoints and checkpoints. I have also developed an
> initial
> > > POC to prove that this approach is feasible, you can find the link for
> > that
> > > in the FLIP.
> > >
> > > There is a Confluence page [1] and a Google Docs page [2] as I do not
> > have
> > > a Confluence account yet.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > > [2]
> > >
> >
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> > >
> > >
> > > Regards,
> > > Mate
> >
>


Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-16 Thread Mate Czagany
Hi Ferenc,

Thank you for your comments, I have updated the Google docs with a new
section for the new configs.
All of the newly added config keys will have defaults set, and by default
all the savepoint/checkpoint operations will use the old system: write
their results to the FlinkDeployment/FlinkSessionJob status field.

I have also added a default for the checkpoint type to be FULL (which is
also the default currently). That was an oversight on my part to miss that.

Regards,
Mate

Ferenc Csaky  ezt írta (időpont: 2024. ápr.
16., K, 16:10):

> Thank you Mate for initiating this discussion. +1 for this idea.
> Some Qs:
>
> Can you specify the newly introduced configurations in more
> details? Currently, it is not fully clear to me what are the
> possible values of `kubernetes.operator.periodic.savepoint.mode`,
> is it optional, has a default value?
>
> I see that in `SavepointSpec.formatType` has a default, although
> `CheckppointSpec.checkpointType` not. Are we inferring that from
> the config? My point is, in general I think it would be good to
> handle the two snapshot types in a similar way when it makes sense
> to minimize any kind of confusion.
>
> Best,
> Ferenc
>
>
>
> On Tuesday, April 16th, 2024 at 11:34, Mate Czagany 
> wrote:
>
> >
> >
> > Hi Everyone,
> >
> > I would like to start a discussion on FLIP-446: Kubernetes Operator State
> > Snapshot CRD.
> >
> > This FLIP adds a new custom resource for Operator users to create and
> > manage their savepoints and checkpoints. I have also developed an initial
> > POC to prove that this approach is feasible, you can find the link for
> that
> > in the FLIP.
> >
> > There is a Confluence page [1] and a Google Docs page [2] as I do not
> have
> > a Confluence account yet.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > [2]
> >
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> >
> >
> > Regards,
> > Mate
>


Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-16 Thread Ferenc Csaky
Thank you Mate for initiating this discussion. +1 for this idea.
Some Qs:

Can you specify the newly introduced configurations in more
details? Currently, it is not fully clear to me what are the
possible values of `kubernetes.operator.periodic.savepoint.mode`,
is it optional, has a default value?

I see that in `SavepointSpec.formatType` has a default, although
`CheckppointSpec.checkpointType` not. Are we inferring that from
the config? My point is, in general I think it would be good to
handle the two snapshot types in a similar way when it makes sense
to minimize any kind of confusion.

Best,
Ferenc



On Tuesday, April 16th, 2024 at 11:34, Mate Czagany  wrote:

> 
> 
> Hi Everyone,
> 
> I would like to start a discussion on FLIP-446: Kubernetes Operator State
> Snapshot CRD.
> 
> This FLIP adds a new custom resource for Operator users to create and
> manage their savepoints and checkpoints. I have also developed an initial
> POC to prove that this approach is feasible, you can find the link for that
> in the FLIP.
> 
> There is a Confluence page [1] and a Google Docs page [2] as I do not have
> a Confluence account yet.
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> [2]
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> 
> 
> Regards,
> Mate


[DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-16 Thread Mate Czagany
Hi Everyone,

I would like to start a discussion on FLIP-446: Kubernetes Operator State
Snapshot CRD.

This FLIP adds a new custom resource for Operator users to create and
manage their savepoints and checkpoints. I have also developed an initial
POC to prove that this approach is feasible, you can find the link for that
in the FLIP.

There is a Confluence page [1] and a Google Docs page [2] as I do not have
a Confluence account yet.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
[2]
https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg


Regards,
Mate