Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-06-06 Thread Lajith Koova
Thank you Gyula for the feedback.

>From the above proposed conditions, so will be having  two conditions as
below

status:
conditions:
- type: JobReady
message: The Job is running
reason: running
status: 'True'
lastTransitionTime: ''
- type: ReconciliationSucceed
message: The resource deployment is considered to be stable and won’t be
rolled back
reason: stable
status: 'True'
lastTransitionTime: ''


Condition JobReady is derived from JobStatus  and Condition
ReconciliationSucceed
derived from LifecycleState.

Please correct me if I missed anything.


Thanks
Lajith K

On Thu, May 30, 2024 at 2:23 PM Gyula Fóra  wrote:

> David,
>
> The problem is exactly that ResourceLifecycleStates do not correspond to
> specific Job statuses (JobReady condition) in most cases. Let me give you a
> concrete example:
>
> ResourceLifecycleState.STABLE means that app/job defined in the spec has
> been successfully deployed and was observed running, and this spec is now
> considered to be stable (won't be rolled back). Once a resource
> (FlinkDeployment) reached STABLE state, it won't change unless the user
> changes the spec. At the same time, this doesn't really say anything about
> job health/readiness at any given future time. 10 minutes later the job can
> go in an unrecoverable failure loop and never reach a running status, the
> ResourceLifecycleState will remain STABLE.
>
> This is actually not a problem with the ResourceLifecycleState but more
> with the understanding of it. It's called ResourceLifecycleState and not
> JobState exactly because it refers to the upgrade/rollback/suspend etc
> lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> underlying flink job itself.
>
> But this is a crucial detail here that we need to consider otherwise the
> "Ready" condition that we may create will be practically useless.
>
> This is the reason why @morh...@apache.org  and
> I suggest separating this to at least 2 independent conditions. One could
> be the UpgradeCompleted/ReconciliationCompleted or something along these
> lines computed based on LifecycleState (as described in your proposal but
> with a different name). The other should be JobReady which could initially
> work based on the JobStatus.state field but ideally would be user
> configurable ready condition such as (job running at least 10 minutes,
> running and have taken checkpoints etcetc).
>
> These 2 conditions should be enough to start with and would actually
> provide a tangible value to users. We can probably leave out ClusterReady
> on a second thought.
>
> Cheers,
> Gyula
>
>
> On Wed, May 29, 2024 at 5:16 PM David Radley 
> wrote:
>
> > Hi Gyula,
> > Thank you for the quick response and confirmation we need a Flip. I am
> not
> > an expert at K8s, Lajith will answer in more detail. Some questions I had
> > anyway:
> >
> > I assume each of the ResourceLifecycleState do have a corresponding
> > jobReady status. You point out some mistakes in the table, for example
> that
> > STABLE should be NotReady; thankyou.  If we put a reason mentioning the
> > stable state, this would help us understand the jobStatus.
> >
> > I guess the jobReady is one perspective that we know is useful (with
> > corrected  mappings from ResourceLifecycleState and with reasons). Can I
> > check that the  2 proposed conditions would also be useful additions? I
> > assume that in your proposal  when jobReady is true, then
> UpgradeCompleted
> > condition would not be present and ClusterReady would always be true? I
> > know conditions do not need to be orthogonal, but I wanted to check what
> > your thoughts are.
> >
> > Kind regards, David.
> >
> >
> >
> >
> > From: Gyula Fóra 
> > Date: Wednesday, 29 May 2024 at 15:28
> > To: dev@flink.apache.org 
> > Cc: morh...@apache.org 
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> CRD
> > Hi David!
> >
> > This change definitely warrants a FLIP even if the code change is not
> huge,
> > there are quite some implications going forward.
> >
> > Looping in @morh...@apache.org  for this discussion.
> >
> > I have some questions / suggestions regarding the condition's meaning and
> > naming.
> >
> > In your proposal you have:
> >  - Ready (True/False) -> This condition is intended for resources which
> are
> > fully ready and operational
> >  - Error (True) -> This condition can be used in scenarios where any
> > exception/error during resource reconcile process
> >
> > The problem with the above is that the implementation does not well
> reflect
> >

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-06-03 Thread Lajith Koova
Thank you  Thomas for the valuable feedback.  I have included couple of
screenshots in the Proposal and included screenshots of how  status will
looks like  if we have a Condition type other than Ready.

Thanks
Lajith K

On Mon, Jun 3, 2024 at 6:48 AM Thomas Weise  wrote:

> Thanks for the proposal. As I understand it the idea is to make the status
> of a Flink deployment more accessible to standard k8s tooling, which would
> be a nice improvement and further strengthen the k8s native experience!
>
> Regarding the FLIP document's overall structure: Before diving into the
> implementation details, can we please expand the intro with the
> motivation/rationale for this change? A few examples of the audience that
> would benefit from this change. Examples of tools that would pick up the
> condition and how that would look like (link or screenshot if you have it).
>
> Regarding multiple conditions: +1 for not commingling reconciliation status
> and job status. It would make the resulting condition confusing. I believe
> what the user would expect under "Ready" is the representation of the job
> status. We can then add another separate condition as suggested, however
> can the FLIP document also outline if/how conditions other than "Ready"
> would appear in the generic k8s tooling?
>
> Thanks,
> Thomas
>
>
>
> On Fri, May 31, 2024 at 10:37 AM David Radley 
> wrote:
>
> > Hi Mate and Gyula,
> > Thank you very much for your clarifications; it is clearer for me now. I
> > agree that a reconciliation condition would be useful – maybe reconciled
> > instead of ready for the boolean, so it is very explicit.
> >
> > Your suggestion of a job related readiness condition related to it’s
> > health would be useful; you suggest it be user configurable – this seems
> > closer to a liveliness / readiness probe.
> >
> > Kind regards, David.
> >
> > From: Mate Czagany 
> > Date: Thursday, 30 May 2024 at 10:39
> > To: dev@flink.apache.org 
> > Cc: morh...@apache.org 
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> CRD
> > Hi,
> >
> > I would definitely keep this as a FLIP. Not all FLIPs have to be big
> > changes, and this format makes it easier for others to chime in and
> follow.
> >
> > I am not a Kubernetes expert, but my understanding is that we don't have
> to
> > follow any strict convention for the type names in the conditions, e.g.
> > "Ready" or "Error". And as Gyula said it doesn't add too much value in
> the
> > currently proposed way, it might even be confusing for users who have not
> > read this email thread or FLIP because "Ready" might suggest that the job
> > is running and is healthy. So my suggestion is the same as Gyulas, to
> have
> > more explicit type names instead of just "Ready" and "Error". However
> > "ClusterReady" sounds weird in case of FlinkSessionJobs.
> >
> > Regarding appending to the conditions field: if I understand the FLIP
> > correctly, we would allow multiple elements of the same type to exist in
> > the conditions list if the message and reason fields are different. From
> > the Kubernetes documentation it seems like the correct way would be to
> use
> > the "type" field as the map key and merge the fields [1].
> >
> >
> > [1]
> >
> >
> https://github.com/kubernetes/kubernetes/blob/bce55b94cdc3a4592749aa919c591fa7df7453eb/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1528
> >
> > Best regards,
> > Mate
> >
> > Gyula Fóra  ezt írta (időpont: 2024. máj. 30., Cs,
> > 10:53):
> >
> > > David,
> > >
> > > The problem is exactly that ResourceLifecycleStates do not correspond
> to
> > > specific Job statuses (JobReady condition) in most cases. Let me give
> > you a
> > > concrete example:
> > >
> > > ResourceLifecycleState.STABLE means that app/job defined in the spec
> has
> > > been successfully deployed and was observed running, and this spec is
> now
> > > considered to be stable (won't be rolled back). Once a resource
> > > (FlinkDeployment) reached STABLE state, it won't change unless the user
> > > changes the spec. At the same time, this doesn't really say anything
> > about
> > > job health/readiness at any given future time. 10 minutes later the job
> > can
> > > go in an unrecoverable failure loop and never reach a running status,
> the
> > > ResourceLifecycleState will remain STABLE.
> > >
> > >

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-06-02 Thread Thomas Weise
Thanks for the proposal. As I understand it the idea is to make the status
of a Flink deployment more accessible to standard k8s tooling, which would
be a nice improvement and further strengthen the k8s native experience!

Regarding the FLIP document's overall structure: Before diving into the
implementation details, can we please expand the intro with the
motivation/rationale for this change? A few examples of the audience that
would benefit from this change. Examples of tools that would pick up the
condition and how that would look like (link or screenshot if you have it).

Regarding multiple conditions: +1 for not commingling reconciliation status
and job status. It would make the resulting condition confusing. I believe
what the user would expect under "Ready" is the representation of the job
status. We can then add another separate condition as suggested, however
can the FLIP document also outline if/how conditions other than "Ready"
would appear in the generic k8s tooling?

Thanks,
Thomas



On Fri, May 31, 2024 at 10:37 AM David Radley 
wrote:

> Hi Mate and Gyula,
> Thank you very much for your clarifications; it is clearer for me now. I
> agree that a reconciliation condition would be useful – maybe reconciled
> instead of ready for the boolean, so it is very explicit.
>
> Your suggestion of a job related readiness condition related to it’s
> health would be useful; you suggest it be user configurable – this seems
> closer to a liveliness / readiness probe.
>
> Kind regards, David.
>
> From: Mate Czagany 
> Date: Thursday, 30 May 2024 at 10:39
> To: dev@flink.apache.org 
> Cc: morh...@apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
> Hi,
>
> I would definitely keep this as a FLIP. Not all FLIPs have to be big
> changes, and this format makes it easier for others to chime in and follow.
>
> I am not a Kubernetes expert, but my understanding is that we don't have to
> follow any strict convention for the type names in the conditions, e.g.
> "Ready" or "Error". And as Gyula said it doesn't add too much value in the
> currently proposed way, it might even be confusing for users who have not
> read this email thread or FLIP because "Ready" might suggest that the job
> is running and is healthy. So my suggestion is the same as Gyulas, to have
> more explicit type names instead of just "Ready" and "Error". However
> "ClusterReady" sounds weird in case of FlinkSessionJobs.
>
> Regarding appending to the conditions field: if I understand the FLIP
> correctly, we would allow multiple elements of the same type to exist in
> the conditions list if the message and reason fields are different. From
> the Kubernetes documentation it seems like the correct way would be to use
> the "type" field as the map key and merge the fields [1].
>
>
> [1]
>
> https://github.com/kubernetes/kubernetes/blob/bce55b94cdc3a4592749aa919c591fa7df7453eb/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1528
>
> Best regards,
> Mate
>
> Gyula Fóra  ezt írta (időpont: 2024. máj. 30., Cs,
> 10:53):
>
> > David,
> >
> > The problem is exactly that ResourceLifecycleStates do not correspond to
> > specific Job statuses (JobReady condition) in most cases. Let me give
> you a
> > concrete example:
> >
> > ResourceLifecycleState.STABLE means that app/job defined in the spec has
> > been successfully deployed and was observed running, and this spec is now
> > considered to be stable (won't be rolled back). Once a resource
> > (FlinkDeployment) reached STABLE state, it won't change unless the user
> > changes the spec. At the same time, this doesn't really say anything
> about
> > job health/readiness at any given future time. 10 minutes later the job
> can
> > go in an unrecoverable failure loop and never reach a running status, the
> > ResourceLifecycleState will remain STABLE.
> >
> > This is actually not a problem with the ResourceLifecycleState but more
> > with the understanding of it. It's called ResourceLifecycleState and not
> > JobState exactly because it refers to the upgrade/rollback/suspend etc
> > lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> > underlying flink job itself.
> >
> > But this is a crucial detail here that we need to consider otherwise the
> > "Ready" condition that we may create will be practically useless.
> >
> > This is the reason why @morh...@apache.org  and
> > I suggest separating this to at least 2 independent conditions. One could
> > be the UpgradeCompleted/ReconciliationCompleted or something along these
> > lines 

RE: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-31 Thread David Radley
Hi Mate and Gyula,
Thank you very much for your clarifications; it is clearer for me now. I agree 
that a reconciliation condition would be useful – maybe reconciled instead of 
ready for the boolean, so it is very explicit.

Your suggestion of a job related readiness condition related to it’s health 
would be useful; you suggest it be user configurable – this seems closer to a 
liveliness / readiness probe.

Kind regards, David.

From: Mate Czagany 
Date: Thursday, 30 May 2024 at 10:39
To: dev@flink.apache.org 
Cc: morh...@apache.org 
Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
Hi,

I would definitely keep this as a FLIP. Not all FLIPs have to be big
changes, and this format makes it easier for others to chime in and follow.

I am not a Kubernetes expert, but my understanding is that we don't have to
follow any strict convention for the type names in the conditions, e.g.
"Ready" or "Error". And as Gyula said it doesn't add too much value in the
currently proposed way, it might even be confusing for users who have not
read this email thread or FLIP because "Ready" might suggest that the job
is running and is healthy. So my suggestion is the same as Gyulas, to have
more explicit type names instead of just "Ready" and "Error". However
"ClusterReady" sounds weird in case of FlinkSessionJobs.

Regarding appending to the conditions field: if I understand the FLIP
correctly, we would allow multiple elements of the same type to exist in
the conditions list if the message and reason fields are different. From
the Kubernetes documentation it seems like the correct way would be to use
the "type" field as the map key and merge the fields [1].


[1]
https://github.com/kubernetes/kubernetes/blob/bce55b94cdc3a4592749aa919c591fa7df7453eb/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1528

Best regards,
Mate

Gyula Fóra  ezt írta (időpont: 2024. máj. 30., Cs,
10:53):

> David,
>
> The problem is exactly that ResourceLifecycleStates do not correspond to
> specific Job statuses (JobReady condition) in most cases. Let me give you a
> concrete example:
>
> ResourceLifecycleState.STABLE means that app/job defined in the spec has
> been successfully deployed and was observed running, and this spec is now
> considered to be stable (won't be rolled back). Once a resource
> (FlinkDeployment) reached STABLE state, it won't change unless the user
> changes the spec. At the same time, this doesn't really say anything about
> job health/readiness at any given future time. 10 minutes later the job can
> go in an unrecoverable failure loop and never reach a running status, the
> ResourceLifecycleState will remain STABLE.
>
> This is actually not a problem with the ResourceLifecycleState but more
> with the understanding of it. It's called ResourceLifecycleState and not
> JobState exactly because it refers to the upgrade/rollback/suspend etc
> lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> underlying flink job itself.
>
> But this is a crucial detail here that we need to consider otherwise the
> "Ready" condition that we may create will be practically useless.
>
> This is the reason why @morh...@apache.org  and
> I suggest separating this to at least 2 independent conditions. One could
> be the UpgradeCompleted/ReconciliationCompleted or something along these
> lines computed based on LifecycleState (as described in your proposal but
> with a different name). The other should be JobReady which could initially
> work based on the JobStatus.state field but ideally would be user
> configurable ready condition such as (job running at least 10 minutes,
> running and have taken checkpoints etcetc).
>
> These 2 conditions should be enough to start with and would actually
> provide a tangible value to users. We can probably leave out ClusterReady
> on a second thought.
>
> Cheers,
> Gyula
>
>
> On Wed, May 29, 2024 at 5:16 PM David Radley 
> wrote:
>
> > Hi Gyula,
> > Thank you for the quick response and confirmation we need a Flip. I am
> not
> > an expert at K8s, Lajith will answer in more detail. Some questions I had
> > anyway:
> >
> > I assume each of the ResourceLifecycleState do have a corresponding
> > jobReady status. You point out some mistakes in the table, for example
> that
> > STABLE should be NotReady; thankyou.  If we put a reason mentioning the
> > stable state, this would help us understand the jobStatus.
> >
> > I guess the jobReady is one perspective that we know is useful (with
> > corrected  mappings from ResourceLifecycleState and with reasons). Can I
> > check that the  2 proposed conditions would also be useful additions? I
> > assume that in your 

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-30 Thread Mate Czagany
Hi,

I would definitely keep this as a FLIP. Not all FLIPs have to be big
changes, and this format makes it easier for others to chime in and follow.

I am not a Kubernetes expert, but my understanding is that we don't have to
follow any strict convention for the type names in the conditions, e.g.
"Ready" or "Error". And as Gyula said it doesn't add too much value in the
currently proposed way, it might even be confusing for users who have not
read this email thread or FLIP because "Ready" might suggest that the job
is running and is healthy. So my suggestion is the same as Gyulas, to have
more explicit type names instead of just "Ready" and "Error". However
"ClusterReady" sounds weird in case of FlinkSessionJobs.

Regarding appending to the conditions field: if I understand the FLIP
correctly, we would allow multiple elements of the same type to exist in
the conditions list if the message and reason fields are different. From
the Kubernetes documentation it seems like the correct way would be to use
the "type" field as the map key and merge the fields [1].


[1]
https://github.com/kubernetes/kubernetes/blob/bce55b94cdc3a4592749aa919c591fa7df7453eb/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1528

Best regards,
Mate

Gyula Fóra  ezt írta (időpont: 2024. máj. 30., Cs,
10:53):

> David,
>
> The problem is exactly that ResourceLifecycleStates do not correspond to
> specific Job statuses (JobReady condition) in most cases. Let me give you a
> concrete example:
>
> ResourceLifecycleState.STABLE means that app/job defined in the spec has
> been successfully deployed and was observed running, and this spec is now
> considered to be stable (won't be rolled back). Once a resource
> (FlinkDeployment) reached STABLE state, it won't change unless the user
> changes the spec. At the same time, this doesn't really say anything about
> job health/readiness at any given future time. 10 minutes later the job can
> go in an unrecoverable failure loop and never reach a running status, the
> ResourceLifecycleState will remain STABLE.
>
> This is actually not a problem with the ResourceLifecycleState but more
> with the understanding of it. It's called ResourceLifecycleState and not
> JobState exactly because it refers to the upgrade/rollback/suspend etc
> lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> underlying flink job itself.
>
> But this is a crucial detail here that we need to consider otherwise the
> "Ready" condition that we may create will be practically useless.
>
> This is the reason why @morh...@apache.org  and
> I suggest separating this to at least 2 independent conditions. One could
> be the UpgradeCompleted/ReconciliationCompleted or something along these
> lines computed based on LifecycleState (as described in your proposal but
> with a different name). The other should be JobReady which could initially
> work based on the JobStatus.state field but ideally would be user
> configurable ready condition such as (job running at least 10 minutes,
> running and have taken checkpoints etcetc).
>
> These 2 conditions should be enough to start with and would actually
> provide a tangible value to users. We can probably leave out ClusterReady
> on a second thought.
>
> Cheers,
> Gyula
>
>
> On Wed, May 29, 2024 at 5:16 PM David Radley 
> wrote:
>
> > Hi Gyula,
> > Thank you for the quick response and confirmation we need a Flip. I am
> not
> > an expert at K8s, Lajith will answer in more detail. Some questions I had
> > anyway:
> >
> > I assume each of the ResourceLifecycleState do have a corresponding
> > jobReady status. You point out some mistakes in the table, for example
> that
> > STABLE should be NotReady; thankyou.  If we put a reason mentioning the
> > stable state, this would help us understand the jobStatus.
> >
> > I guess the jobReady is one perspective that we know is useful (with
> > corrected  mappings from ResourceLifecycleState and with reasons). Can I
> > check that the  2 proposed conditions would also be useful additions? I
> > assume that in your proposal  when jobReady is true, then
> UpgradeCompleted
> > condition would not be present and ClusterReady would always be true? I
> > know conditions do not need to be orthogonal, but I wanted to check what
> > your thoughts are.
> >
> > Kind regards, David.
> >
> >
> >
> >
> > From: Gyula Fóra 
> > Date: Wednesday, 29 May 2024 at 15:28
> > To: dev@flink.apache.org 
> > Cc: morh...@apache.org 
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> CRD
> > Hi David!
> >
> > This cha

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-30 Thread Gyula Fóra
David,

The problem is exactly that ResourceLifecycleStates do not correspond to
specific Job statuses (JobReady condition) in most cases. Let me give you a
concrete example:

ResourceLifecycleState.STABLE means that app/job defined in the spec has
been successfully deployed and was observed running, and this spec is now
considered to be stable (won't be rolled back). Once a resource
(FlinkDeployment) reached STABLE state, it won't change unless the user
changes the spec. At the same time, this doesn't really say anything about
job health/readiness at any given future time. 10 minutes later the job can
go in an unrecoverable failure loop and never reach a running status, the
ResourceLifecycleState will remain STABLE.

This is actually not a problem with the ResourceLifecycleState but more
with the understanding of it. It's called ResourceLifecycleState and not
JobState exactly because it refers to the upgrade/rollback/suspend etc
lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
underlying flink job itself.

But this is a crucial detail here that we need to consider otherwise the
"Ready" condition that we may create will be practically useless.

This is the reason why @morh...@apache.org  and
I suggest separating this to at least 2 independent conditions. One could
be the UpgradeCompleted/ReconciliationCompleted or something along these
lines computed based on LifecycleState (as described in your proposal but
with a different name). The other should be JobReady which could initially
work based on the JobStatus.state field but ideally would be user
configurable ready condition such as (job running at least 10 minutes,
running and have taken checkpoints etcetc).

These 2 conditions should be enough to start with and would actually
provide a tangible value to users. We can probably leave out ClusterReady
on a second thought.

Cheers,
Gyula


On Wed, May 29, 2024 at 5:16 PM David Radley 
wrote:

> Hi Gyula,
> Thank you for the quick response and confirmation we need a Flip. I am not
> an expert at K8s, Lajith will answer in more detail. Some questions I had
> anyway:
>
> I assume each of the ResourceLifecycleState do have a corresponding
> jobReady status. You point out some mistakes in the table, for example that
> STABLE should be NotReady; thankyou.  If we put a reason mentioning the
> stable state, this would help us understand the jobStatus.
>
> I guess the jobReady is one perspective that we know is useful (with
> corrected  mappings from ResourceLifecycleState and with reasons). Can I
> check that the  2 proposed conditions would also be useful additions? I
> assume that in your proposal  when jobReady is true, then UpgradeCompleted
> condition would not be present and ClusterReady would always be true? I
> know conditions do not need to be orthogonal, but I wanted to check what
> your thoughts are.
>
> Kind regards, David.
>
>
>
>
> From: Gyula Fóra 
> Date: Wednesday, 29 May 2024 at 15:28
> To: dev@flink.apache.org 
> Cc: morh...@apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
> Hi David!
>
> This change definitely warrants a FLIP even if the code change is not huge,
> there are quite some implications going forward.
>
> Looping in @morh...@apache.org  for this discussion.
>
> I have some questions / suggestions regarding the condition's meaning and
> naming.
>
> In your proposal you have:
>  - Ready (True/False) -> This condition is intended for resources which are
> fully ready and operational
>  - Error (True) -> This condition can be used in scenarios where any
> exception/error during resource reconcile process
>
> The problem with the above is that the implementation does not well reflect
> this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean the
> job is running, it just means that the resource is fully reconciled and it
> will not be rolled back (so the current pending upgrade is completed). This
> is mainly a fault of the ResourceLifecycleState as it doesn't capture the
> job status but one could argue that it was "designed" this way.
>
> I think we should probably have more condition types to capture the
> difference:
>  - JobReady (True/False) -> Flink job is running (Basically job status but
> with transition time)
>  - ClusterReady (True/False) -> Session / Application cluster is deployed
> (Basically JM deployment status but with transition time)
> -  UpgradeCompleted (True/False) -> Similar to what you call Ready now
> which should correspond to the STABLE/ROLLED_BACK states and mostly tracks
> in-progress CR updates
>
> This is my best idea at the moment, not great as it feels a little
> redundant with the current status fields. But maybe thats not a problem or
> a way to elimina

RE: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-29 Thread David Radley
Hi Gyula,
Thank you for the quick response and confirmation we need a Flip. I am not an 
expert at K8s, Lajith will answer in more detail. Some questions I had anyway:

I assume each of the ResourceLifecycleState do have a corresponding jobReady 
status. You point out some mistakes in the table, for example that STABLE 
should be NotReady; thankyou.  If we put a reason mentioning the stable state, 
this would help us understand the jobStatus.

I guess the jobReady is one perspective that we know is useful (with corrected  
mappings from ResourceLifecycleState and with reasons). Can I check that the  2 
proposed conditions would also be useful additions? I assume that in your 
proposal  when jobReady is true, then UpgradeCompleted condition would not be 
present and ClusterReady would always be true? I know conditions do not need to 
be orthogonal, but I wanted to check what your thoughts are.

Kind regards, David.




From: Gyula Fóra 
Date: Wednesday, 29 May 2024 at 15:28
To: dev@flink.apache.org 
Cc: morh...@apache.org 
Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
Hi David!

This change definitely warrants a FLIP even if the code change is not huge,
there are quite some implications going forward.

Looping in @morh...@apache.org  for this discussion.

I have some questions / suggestions regarding the condition's meaning and
naming.

In your proposal you have:
 - Ready (True/False) -> This condition is intended for resources which are
fully ready and operational
 - Error (True) -> This condition can be used in scenarios where any
exception/error during resource reconcile process

The problem with the above is that the implementation does not well reflect
this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean the
job is running, it just means that the resource is fully reconciled and it
will not be rolled back (so the current pending upgrade is completed). This
is mainly a fault of the ResourceLifecycleState as it doesn't capture the
job status but one could argue that it was "designed" this way.

I think we should probably have more condition types to capture the
difference:
 - JobReady (True/False) -> Flink job is running (Basically job status but
with transition time)
 - ClusterReady (True/False) -> Session / Application cluster is deployed
(Basically JM deployment status but with transition time)
-  UpgradeCompleted (True/False) -> Similar to what you call Ready now
which should correspond to the STABLE/ROLLED_BACK states and mostly tracks
in-progress CR updates

This is my best idea at the moment, not great as it feels a little
redundant with the current status fields. But maybe thats not a problem or
a way to eliminate the old fields later?

I am not so sure of the Error status and what this means in practice. Why
do we want to track the last error in 2 places? It's already in the status.

What do you think?
Gyula

On Wed, May 29, 2024 at 3:55 PM David Radley 
wrote:

> Hi,
> Thanks Lajith for raising this discussion thread under the Flip title.
>
> To summarise the concerns from the other discussion thread.
>
> “
> - I echo Gyula that including some examples and further explanations might
> ease reader's work. With the current version, the FLIP is a bit hard to
> follow. - Will the usage of Conditions be enabled by default? Or will there
> be any disadvantages for Flink users? If Conditions with the same type
> already exist in the Status Conditions
>
> - Do you think we should have clear rules about handling rules for how
> these Conditions should be managed, especially when multiple Conditions of
> the same type are present? For example, resource has multiple causes for
> the same condition (e.g., Error due to network and Error due to I/O). Then,
> overriding the old condition with the new one is not the best approach no?
> Please correct me if I misunderstood.
> “
>
> I see the Google doc link has been reformatted to match the Flip template.
>
> To explicitly answer the questions from Jeyhun and Gyula:
> - “Will the usage of Conditions be enabled by default?” Yes, but this is
> just making the status content useful, whereas before it was not useful.
> - in terms of examples, I am not sure what you would like to see, the
> table Lajith provided shows the status for various ResourceLifecycleStates.
> How the operator gets into these states is the current behaviour. The
> change just shows the appropriate corresponding high level status – that
> could be shown on the User Interfaces.
> - “will there be any disadvantages for Flink users?” None , there is just
> more information in the status, without this it is more difficult to work
> out the status of the job.
> - Multiple conditions question. The status is showing whether the job is
> ready or not, so as long as the last condition is the one that is shown -
> all 

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-29 Thread Gyula Fóra
Hi David!

This change definitely warrants a FLIP even if the code change is not huge,
there are quite some implications going forward.

Looping in @morh...@apache.org  for this discussion.

I have some questions / suggestions regarding the condition's meaning and
naming.

In your proposal you have:
 - Ready (True/False) -> This condition is intended for resources which are
fully ready and operational
 - Error (True) -> This condition can be used in scenarios where any
exception/error during resource reconcile process

The problem with the above is that the implementation does not well reflect
this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean the
job is running, it just means that the resource is fully reconciled and it
will not be rolled back (so the current pending upgrade is completed). This
is mainly a fault of the ResourceLifecycleState as it doesn't capture the
job status but one could argue that it was "designed" this way.

I think we should probably have more condition types to capture the
difference:
 - JobReady (True/False) -> Flink job is running (Basically job status but
with transition time)
 - ClusterReady (True/False) -> Session / Application cluster is deployed
(Basically JM deployment status but with transition time)
-  UpgradeCompleted (True/False) -> Similar to what you call Ready now
which should correspond to the STABLE/ROLLED_BACK states and mostly tracks
in-progress CR updates

This is my best idea at the moment, not great as it feels a little
redundant with the current status fields. But maybe thats not a problem or
a way to eliminate the old fields later?

I am not so sure of the Error status and what this means in practice. Why
do we want to track the last error in 2 places? It's already in the status.

What do you think?
Gyula

On Wed, May 29, 2024 at 3:55 PM David Radley 
wrote:

> Hi,
> Thanks Lajith for raising this discussion thread under the Flip title.
>
> To summarise the concerns from the other discussion thread.
>
> “
> - I echo Gyula that including some examples and further explanations might
> ease reader's work. With the current version, the FLIP is a bit hard to
> follow. - Will the usage of Conditions be enabled by default? Or will there
> be any disadvantages for Flink users? If Conditions with the same type
> already exist in the Status Conditions
>
> - Do you think we should have clear rules about handling rules for how
> these Conditions should be managed, especially when multiple Conditions of
> the same type are present? For example, resource has multiple causes for
> the same condition (e.g., Error due to network and Error due to I/O). Then,
> overriding the old condition with the new one is not the best approach no?
> Please correct me if I misunderstood.
> “
>
> I see the Google doc link has been reformatted to match the Flip template.
>
> To explicitly answer the questions from Jeyhun and Gyula:
> - “Will the usage of Conditions be enabled by default?” Yes, but this is
> just making the status content useful, whereas before it was not useful.
> - in terms of examples, I am not sure what you would like to see, the
> table Lajith provided shows the status for various ResourceLifecycleStates.
> How the operator gets into these states is the current behaviour. The
> change just shows the appropriate corresponding high level status – that
> could be shown on the User Interfaces.
> - “will there be any disadvantages for Flink users?” None , there is just
> more information in the status, without this it is more difficult to work
> out the status of the job.
> - Multiple conditions question. The status is showing whether the job is
> ready or not, so as long as the last condition is the one that is shown -
> all is as expected. I don’t think this needs rules for precedence and the
> like.
> - The condition’s Reason is going to be more specific.
>
> Gyula and Jeyhun, is the google doc clear enough for you now? Do you feel
> you feedback has been addressed? Lajith and I are happy to provide more
> details.
>
> I wonder whether this change is big enough to warrant a Flip, as it is so
> small. We could do this in an issue. WDYT?
>
> Kind regards, David.
>
>
> From: Lajith Koova 
> Date: Wednesday, 29 May 2024 at 13:41
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
> Hello ,
>
>
> Discussion thread here:
> https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z  to
> discuss a proposal to add Conditions field in the CR status of Flink
> Deployment and FlinkSessionJob.
>
>
> Note : Starting this new thread as discussion thread title has been
> modified to follow the FLIP process.
>
>
> Thank you.
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>


Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-29 Thread David Radley
Hi,
Thanks Lajith for raising this discussion thread under the Flip title.

To summarise the concerns from the other discussion thread.

“
- I echo Gyula that including some examples and further explanations might ease 
reader's work. With the current version, the FLIP is a bit hard to follow. - 
Will the usage of Conditions be enabled by default? Or will there be any 
disadvantages for Flink users? If Conditions with the same type already exist 
in the Status Conditions

- Do you think we should have clear rules about handling rules for how these 
Conditions should be managed, especially when multiple Conditions of the same 
type are present? For example, resource has multiple causes for the same 
condition (e.g., Error due to network and Error due to I/O). Then, overriding 
the old condition with the new one is not the best approach no? Please correct 
me if I misunderstood.
“

I see the Google doc link has been reformatted to match the Flip template.

To explicitly answer the questions from Jeyhun and Gyula:
- “Will the usage of Conditions be enabled by default?” Yes, but this is just 
making the status content useful, whereas before it was not useful.
- in terms of examples, I am not sure what you would like to see, the table 
Lajith provided shows the status for various ResourceLifecycleStates. How the 
operator gets into these states is the current behaviour. The change just shows 
the appropriate corresponding high level status – that could be shown on the 
User Interfaces.
- “will there be any disadvantages for Flink users?” None , there is just more 
information in the status, without this it is more difficult to work out the 
status of the job.
- Multiple conditions question. The status is showing whether the job is ready 
or not, so as long as the last condition is the one that is shown - all is as 
expected. I don’t think this needs rules for precedence and the like.
- The condition’s Reason is going to be more specific.

Gyula and Jeyhun, is the google doc clear enough for you now? Do you feel you 
feedback has been addressed? Lajith and I are happy to provide more details.

I wonder whether this change is big enough to warrant a Flip, as it is so 
small. We could do this in an issue. WDYT?

Kind regards, David.


From: Lajith Koova 
Date: Wednesday, 29 May 2024 at 13:41
To: dev@flink.apache.org 
Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
Hello ,


Discussion thread here:
https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z  to
discuss a proposal to add Conditions field in the CR status of Flink
Deployment and FlinkSessionJob.


Note : Starting this new thread as discussion thread title has been
modified to follow the FLIP process.


Thank you.

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU