Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Lajith Koova Mon, 03 Jun 2024 22:14:14 -0700

Thank you  Thomas for the valuable feedback.  I have included couple of
screenshots in the Proposal and included screenshots of how  status will
looks like  if we have a Condition type other than Ready.


Thanks
Lajith K

On Mon, Jun 3, 2024 at 6:48 AM Thomas Weise <t...@apache.org> wrote:

> Thanks for the proposal. As I understand it the idea is to make the status
> of a Flink deployment more accessible to standard k8s tooling, which would
> be a nice improvement and further strengthen the k8s native experience!
>
> Regarding the FLIP document's overall structure: Before diving into the
> implementation details, can we please expand the intro with the
> motivation/rationale for this change? A few examples of the audience that
> would benefit from this change. Examples of tools that would pick up the
> condition and how that would look like (link or screenshot if you have it).
>
> Regarding multiple conditions: +1 for not commingling reconciliation status
> and job status. It would make the resulting condition confusing. I believe
> what the user would expect under "Ready" is the representation of the job
> status. We can then add another separate condition as suggested, however
> can the FLIP document also outline if/how conditions other than "Ready"
> would appear in the generic k8s tooling?
>
> Thanks,
> Thomas
>
>
>
> On Fri, May 31, 2024 at 10:37 AM David Radley <david_rad...@uk.ibm.com>
> wrote:
>
> > Hi Mate and Gyula,
> > Thank you very much for your clarifications; it is clearer for me now. I
> > agree that a reconciliation condition would be useful – maybe reconciled
> > instead of ready for the boolean, so it is very explicit.
> >
> > Your suggestion of a job related readiness condition related to it’s
> > health would be useful; you suggest it be user configurable – this seems
> > closer to a liveliness / readiness probe.
> >
> > Kind regards, David.
> >
> > From: Mate Czagany <czmat...@gmail.com>
> > Date: Thursday, 30 May 2024 at 10:39
> > To: dev@flink.apache.org <dev@flink.apache.org>
> > Cc: morh...@apache.org <morh...@apache.org>
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> CRD
> > Hi,
> >
> > I would definitely keep this as a FLIP. Not all FLIPs have to be big
> > changes, and this format makes it easier for others to chime in and
> follow.
> >
> > I am not a Kubernetes expert, but my understanding is that we don't have
> to
> > follow any strict convention for the type names in the conditions, e.g.
> > "Ready" or "Error". And as Gyula said it doesn't add too much value in
> the
> > currently proposed way, it might even be confusing for users who have not
> > read this email thread or FLIP because "Ready" might suggest that the job
> > is running and is healthy. So my suggestion is the same as Gyulas, to
> have
> > more explicit type names instead of just "Ready" and "Error". However
> > "ClusterReady" sounds weird in case of FlinkSessionJobs.
> >
> > Regarding appending to the conditions field: if I understand the FLIP
> > correctly, we would allow multiple elements of the same type to exist in
> > the conditions list if the message and reason fields are different. From
> > the Kubernetes documentation it seems like the correct way would be to
> use
> > the "type" field as the map key and merge the fields [1].
> >
> >
> > [1]
> >
> >
> https://github.com/kubernetes/kubernetes/blob/bce55b94cdc3a4592749aa919c591fa7df7453eb/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1528
> >
> > Best regards,
> > Mate
> >
> > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2024. máj. 30., Cs,
> > 10:53):
> >
> > > David,
> > >
> > > The problem is exactly that ResourceLifecycleStates do not correspond
> to
> > > specific Job statuses (JobReady condition) in most cases. Let me give
> > you a
> > > concrete example:
> > >
> > > ResourceLifecycleState.STABLE means that app/job defined in the spec
> has
> > > been successfully deployed and was observed running, and this spec is
> now
> > > considered to be stable (won't be rolled back). Once a resource
> > > (FlinkDeployment) reached STABLE state, it won't change unless the user
> > > changes the spec. At the same time, this doesn't really say anything
> > about
> > > job health/readiness at any given future time. 10 minutes later the job
> > can
> > > go in an unrecoverable failure loop and never reach a running status,
> the
> > > ResourceLifecycleState will remain STABLE.
> > >
> > > This is actually not a problem with the ResourceLifecycleState but more
> > > with the understanding of it. It's called ResourceLifecycleState and
> not
> > > JobState exactly because it refers to the upgrade/rollback/suspend etc
> > > lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> > > underlying flink job itself.
> > >
> > > But this is a crucial detail here that we need to consider otherwise
> the
> > > "Ready" condition that we may create will be practically useless.
> > >
> > > This is the reason why @morh...@apache.org <morh...@apache.org> and
> > > I suggest separating this to at least 2 independent conditions. One
> could
> > > be the UpgradeCompleted/ReconciliationCompleted or something along
> these
> > > lines computed based on LifecycleState (as described in your proposal
> but
> > > with a different name). The other should be JobReady which could
> > initially
> > > work based on the JobStatus.state field but ideally would be user
> > > configurable ready condition such as (job running at least 10 minutes,
> > > running and have taken checkpoints etcetc).
> > >
> > > These 2 conditions should be enough to start with and would actually
> > > provide a tangible value to users. We can probably leave out
> ClusterReady
> > > on a second thought.
> > >
> > > Cheers,
> > > Gyula
> > >
> > >
> > > On Wed, May 29, 2024 at 5:16 PM David Radley <david_rad...@uk.ibm.com>
> > > wrote:
> > >
> > > > Hi Gyula,
> > > > Thank you for the quick response and confirmation we need a Flip. I
> am
> > > not
> > > > an expert at K8s, Lajith will answer in more detail. Some questions I
> > had
> > > > anyway:
> > > >
> > > > I assume each of the ResourceLifecycleState do have a corresponding
> > > > jobReady status. You point out some mistakes in the table, for
> example
> > > that
> > > > STABLE should be NotReady; thankyou.  If we put a reason mentioning
> the
> > > > stable state, this would help us understand the jobStatus.
> > > >
> > > > I guess the jobReady is one perspective that we know is useful (with
> > > > corrected  mappings from ResourceLifecycleState and with reasons).
> Can
> > I
> > > > check that the  2 proposed conditions would also be useful
> additions? I
> > > > assume that in your proposal  when jobReady is true, then
> > > UpgradeCompleted
> > > > condition would not be present and ClusterReady would always be
> true? I
> > > > know conditions do not need to be orthogonal, but I wanted to check
> > what
> > > > your thoughts are.
> > > >
> > > >     Kind regards, David.
> > > >
> > > >
> > > >
> > > >
> > > > From: Gyula Fóra <gyula.f...@gmail.com>
> > > > Date: Wednesday, 29 May 2024 at 15:28
> > > > To: dev@flink.apache.org <dev@flink.apache.org>
> > > > Cc: morh...@apache.org <morh...@apache.org>
> > > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to
> Flink
> > > CRD
> > > > Hi David!
> > > >
> > > > This change definitely warrants a FLIP even if the code change is not
> > > huge,
> > > > there are quite some implications going forward.
> > > >
> > > > Looping in @morh...@apache.org <morh...@apache.org> for this
> > discussion.
> > > >
> > > > I have some questions / suggestions regarding the condition's meaning
> > and
> > > > naming.
> > > >
> > > > In your proposal you have:
> > > >  - Ready (True/False) -> This condition is intended for resources
> which
> > > are
> > > > fully ready and operational
> > > >  - Error (True) -> This condition can be used in scenarios where any
> > > > exception/error during resource reconcile process
> > > >
> > > > The problem with the above is that the implementation does not well
> > > reflect
> > > > this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually
> mean
> > > the
> > > > job is running, it just means that the resource is fully reconciled
> and
> > > it
> > > > will not be rolled back (so the current pending upgrade is
> completed).
> > > This
> > > > is mainly a fault of the ResourceLifecycleState as it doesn't capture
> > the
> > > > job status but one could argue that it was "designed" this way.
> > > >
> > > > I think we should probably have more condition types to capture the
> > > > difference:
> > > >  - JobReady (True/False) -> Flink job is running (Basically job
> status
> > > but
> > > > with transition time)
> > > >  - ClusterReady (True/False) -> Session / Application cluster is
> > deployed
> > > > (Basically JM deployment status but with transition time)
> > > > -  UpgradeCompleted (True/False) -> Similar to what you call Ready
> now
> > > > which should correspond to the STABLE/ROLLED_BACK states and mostly
> > > tracks
> > > > in-progress CR updates
> > > >
> > > > This is my best idea at the moment, not great as it feels a little
> > > > redundant with the current status fields. But maybe thats not a
> problem
> > > or
> > > > a way to eliminate the old fields later?
> > > >
> > > > I am not so sure of the Error status and what this means in practice.
> > Why
> > > > do we want to track the last error in 2 places? It's already in the
> > > status.
> > > >
> > > > What do you think?
> > > > Gyula
> > > >
> > > > On Wed, May 29, 2024 at 3:55 PM David Radley <
> david_rad...@uk.ibm.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > Thanks Lajith for raising this discussion thread under the Flip
> > title.
> > > > >
> > > > > To summarise the concerns from the other discussion thread.
> > > > >
> > > > > “
> > > > > - I echo Gyula that including some examples and further
> explanations
> > > > might
> > > > > ease reader's work. With the current version, the FLIP is a bit
> hard
> > to
> > > > > follow. - Will the usage of Conditions be enabled by default? Or
> will
> > > > there
> > > > > be any disadvantages for Flink users? If Conditions with the same
> > type
> > > > > already exist in the Status Conditions
> > > > >
> > > > > - Do you think we should have clear rules about handling rules for
> > how
> > > > > these Conditions should be managed, especially when multiple
> > Conditions
> > > > of
> > > > > the same type are present? For example, resource has multiple
> causes
> > > for
> > > > > the same condition (e.g., Error due to network and Error due to
> I/O).
> > > > Then,
> > > > > overriding the old condition with the new one is not the best
> > approach
> > > > no?
> > > > > Please correct me if I misunderstood.
> > > > > “
> > > > >
> > > > > I see the Google doc link has been reformatted to match the Flip
> > > > template.
> > > > >
> > > > > To explicitly answer the questions from Jeyhun and Gyula:
> > > > > - “Will the usage of Conditions be enabled by default?” Yes, but
> this
> > > is
> > > > > just making the status content useful, whereas before it was not
> > > useful.
> > > > > - in terms of examples, I am not sure what you would like to see,
> the
> > > > > table Lajith provided shows the status for various
> > > > ResourceLifecycleStates.
> > > > > How the operator gets into these states is the current behaviour.
> The
> > > > > change just shows the appropriate corresponding high level status –
> > > that
> > > > > could be shown on the User Interfaces.
> > > > > - “will there be any disadvantages for Flink users?” None , there
> is
> > > just
> > > > > more information in the status, without this it is more difficult
> to
> > > work
> > > > > out the status of the job.
> > > > > - Multiple conditions question. The status is showing whether the
> job
> > > is
> > > > > ready or not, so as long as the last condition is the one that is
> > > shown -
> > > > > all is as expected. I don’t think this needs rules for precedence
> and
> > > the
> > > > > like.
> > > > > - The condition’s Reason is going to be more specific.
> > > > >
> > > > > Gyula and Jeyhun, is the google doc clear enough for you now? Do
> you
> > > feel
> > > > > you feedback has been addressed? Lajith and I are happy to provide
> > more
> > > > > details.
> > > > >
> > > > > I wonder whether this change is big enough to warrant a Flip, as it
> > is
> > > so
> > > > > small. We could do this in an issue. WDYT?
> > > > >
> > > > > Kind regards, David.
> > > > >
> > > > >
> > > > > From: Lajith Koova <lajith...@gmail.com>
> > > > > Date: Wednesday, 29 May 2024 at 13:41
> > > > > To: dev@flink.apache.org <dev@flink.apache.org>
> > > > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> > CRD
> > > > > Hello ,
> > > > >
> > > > >
> > > > > Discussion thread here:
> > > > > https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z
> to
> > > > > discuss a proposal to add Conditions field in the CR status of
> Flink
> > > > > Deployment and FlinkSessionJob.
> > > > >
> > > > >
> > > > > Note : Starting this new thread as discussion thread title has been
> > > > > modified to follow the FLIP process.
> > > > >
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Unless otherwise stated above:
> > > > >
> > > > > IBM United Kingdom Limited
> > > > > Registered in England and Wales with number 741598
> > > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> > 3AU
> > > > >
> > > >
> > > > Unless otherwise stated above:
> > > >
> > > > IBM United Kingdom Limited
> > > > Registered in England and Wales with number 741598
> > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> 3AU
> > > >
> > >
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> >
>

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Reply via email to