Thank you Thomas for the valuable feedback. I have included couple of screenshots in the Proposal and included screenshots of how status will looks like if we have a Condition type other than Ready.
Thanks Lajith K On Mon, Jun 3, 2024 at 6:48 AM Thomas Weise <t...@apache.org> wrote: > Thanks for the proposal. As I understand it the idea is to make the status > of a Flink deployment more accessible to standard k8s tooling, which would > be a nice improvement and further strengthen the k8s native experience! > > Regarding the FLIP document's overall structure: Before diving into the > implementation details, can we please expand the intro with the > motivation/rationale for this change? A few examples of the audience that > would benefit from this change. Examples of tools that would pick up the > condition and how that would look like (link or screenshot if you have it). > > Regarding multiple conditions: +1 for not commingling reconciliation status > and job status. It would make the resulting condition confusing. I believe > what the user would expect under "Ready" is the representation of the job > status. We can then add another separate condition as suggested, however > can the FLIP document also outline if/how conditions other than "Ready" > would appear in the generic k8s tooling? > > Thanks, > Thomas > > > > On Fri, May 31, 2024 at 10:37 AM David Radley <david_rad...@uk.ibm.com> > wrote: > > > Hi Mate and Gyula, > > Thank you very much for your clarifications; it is clearer for me now. I > > agree that a reconciliation condition would be useful – maybe reconciled > > instead of ready for the boolean, so it is very explicit. > > > > Your suggestion of a job related readiness condition related to it’s > > health would be useful; you suggest it be user configurable – this seems > > closer to a liveliness / readiness probe. > > > > Kind regards, David. > > > > From: Mate Czagany <czmat...@gmail.com> > > Date: Thursday, 30 May 2024 at 10:39 > > To: dev@flink.apache.org <dev@flink.apache.org> > > Cc: morh...@apache.org <morh...@apache.org> > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink > CRD > > Hi, > > > > I would definitely keep this as a FLIP. Not all FLIPs have to be big > > changes, and this format makes it easier for others to chime in and > follow. > > > > I am not a Kubernetes expert, but my understanding is that we don't have > to > > follow any strict convention for the type names in the conditions, e.g. > > "Ready" or "Error". And as Gyula said it doesn't add too much value in > the > > currently proposed way, it might even be confusing for users who have not > > read this email thread or FLIP because "Ready" might suggest that the job > > is running and is healthy. So my suggestion is the same as Gyulas, to > have > > more explicit type names instead of just "Ready" and "Error". However > > "ClusterReady" sounds weird in case of FlinkSessionJobs. > > > > Regarding appending to the conditions field: if I understand the FLIP > > correctly, we would allow multiple elements of the same type to exist in > > the conditions list if the message and reason fields are different. From > > the Kubernetes documentation it seems like the correct way would be to > use > > the "type" field as the map key and merge the fields [1]. > > > > > > [1] > > > > > https://github.com/kubernetes/kubernetes/blob/bce55b94cdc3a4592749aa919c591fa7df7453eb/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1528 > > > > Best regards, > > Mate > > > > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2024. máj. 30., Cs, > > 10:53): > > > > > David, > > > > > > The problem is exactly that ResourceLifecycleStates do not correspond > to > > > specific Job statuses (JobReady condition) in most cases. Let me give > > you a > > > concrete example: > > > > > > ResourceLifecycleState.STABLE means that app/job defined in the spec > has > > > been successfully deployed and was observed running, and this spec is > now > > > considered to be stable (won't be rolled back). Once a resource > > > (FlinkDeployment) reached STABLE state, it won't change unless the user > > > changes the spec. At the same time, this doesn't really say anything > > about > > > job health/readiness at any given future time. 10 minutes later the job > > can > > > go in an unrecoverable failure loop and never reach a running status, > the > > > ResourceLifecycleState will remain STABLE. > > > > > > This is actually not a problem with the ResourceLifecycleState but more > > > with the understanding of it. It's called ResourceLifecycleState and > not > > > JobState exactly because it refers to the upgrade/rollback/suspend etc > > > lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the > > > underlying flink job itself. > > > > > > But this is a crucial detail here that we need to consider otherwise > the > > > "Ready" condition that we may create will be practically useless. > > > > > > This is the reason why @morh...@apache.org <morh...@apache.org> and > > > I suggest separating this to at least 2 independent conditions. One > could > > > be the UpgradeCompleted/ReconciliationCompleted or something along > these > > > lines computed based on LifecycleState (as described in your proposal > but > > > with a different name). The other should be JobReady which could > > initially > > > work based on the JobStatus.state field but ideally would be user > > > configurable ready condition such as (job running at least 10 minutes, > > > running and have taken checkpoints etcetc). > > > > > > These 2 conditions should be enough to start with and would actually > > > provide a tangible value to users. We can probably leave out > ClusterReady > > > on a second thought. > > > > > > Cheers, > > > Gyula > > > > > > > > > On Wed, May 29, 2024 at 5:16 PM David Radley <david_rad...@uk.ibm.com> > > > wrote: > > > > > > > Hi Gyula, > > > > Thank you for the quick response and confirmation we need a Flip. I > am > > > not > > > > an expert at K8s, Lajith will answer in more detail. Some questions I > > had > > > > anyway: > > > > > > > > I assume each of the ResourceLifecycleState do have a corresponding > > > > jobReady status. You point out some mistakes in the table, for > example > > > that > > > > STABLE should be NotReady; thankyou. If we put a reason mentioning > the > > > > stable state, this would help us understand the jobStatus. > > > > > > > > I guess the jobReady is one perspective that we know is useful (with > > > > corrected mappings from ResourceLifecycleState and with reasons). > Can > > I > > > > check that the 2 proposed conditions would also be useful > additions? I > > > > assume that in your proposal when jobReady is true, then > > > UpgradeCompleted > > > > condition would not be present and ClusterReady would always be > true? I > > > > know conditions do not need to be orthogonal, but I wanted to check > > what > > > > your thoughts are. > > > > > > > > Kind regards, David. > > > > > > > > > > > > > > > > > > > > From: Gyula Fóra <gyula.f...@gmail.com> > > > > Date: Wednesday, 29 May 2024 at 15:28 > > > > To: dev@flink.apache.org <dev@flink.apache.org> > > > > Cc: morh...@apache.org <morh...@apache.org> > > > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to > Flink > > > CRD > > > > Hi David! > > > > > > > > This change definitely warrants a FLIP even if the code change is not > > > huge, > > > > there are quite some implications going forward. > > > > > > > > Looping in @morh...@apache.org <morh...@apache.org> for this > > discussion. > > > > > > > > I have some questions / suggestions regarding the condition's meaning > > and > > > > naming. > > > > > > > > In your proposal you have: > > > > - Ready (True/False) -> This condition is intended for resources > which > > > are > > > > fully ready and operational > > > > - Error (True) -> This condition can be used in scenarios where any > > > > exception/error during resource reconcile process > > > > > > > > The problem with the above is that the implementation does not well > > > reflect > > > > this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually > mean > > > the > > > > job is running, it just means that the resource is fully reconciled > and > > > it > > > > will not be rolled back (so the current pending upgrade is > completed). > > > This > > > > is mainly a fault of the ResourceLifecycleState as it doesn't capture > > the > > > > job status but one could argue that it was "designed" this way. > > > > > > > > I think we should probably have more condition types to capture the > > > > difference: > > > > - JobReady (True/False) -> Flink job is running (Basically job > status > > > but > > > > with transition time) > > > > - ClusterReady (True/False) -> Session / Application cluster is > > deployed > > > > (Basically JM deployment status but with transition time) > > > > - UpgradeCompleted (True/False) -> Similar to what you call Ready > now > > > > which should correspond to the STABLE/ROLLED_BACK states and mostly > > > tracks > > > > in-progress CR updates > > > > > > > > This is my best idea at the moment, not great as it feels a little > > > > redundant with the current status fields. But maybe thats not a > problem > > > or > > > > a way to eliminate the old fields later? > > > > > > > > I am not so sure of the Error status and what this means in practice. > > Why > > > > do we want to track the last error in 2 places? It's already in the > > > status. > > > > > > > > What do you think? > > > > Gyula > > > > > > > > On Wed, May 29, 2024 at 3:55 PM David Radley < > david_rad...@uk.ibm.com> > > > > wrote: > > > > > > > > > Hi, > > > > > Thanks Lajith for raising this discussion thread under the Flip > > title. > > > > > > > > > > To summarise the concerns from the other discussion thread. > > > > > > > > > > “ > > > > > - I echo Gyula that including some examples and further > explanations > > > > might > > > > > ease reader's work. With the current version, the FLIP is a bit > hard > > to > > > > > follow. - Will the usage of Conditions be enabled by default? Or > will > > > > there > > > > > be any disadvantages for Flink users? If Conditions with the same > > type > > > > > already exist in the Status Conditions > > > > > > > > > > - Do you think we should have clear rules about handling rules for > > how > > > > > these Conditions should be managed, especially when multiple > > Conditions > > > > of > > > > > the same type are present? For example, resource has multiple > causes > > > for > > > > > the same condition (e.g., Error due to network and Error due to > I/O). > > > > Then, > > > > > overriding the old condition with the new one is not the best > > approach > > > > no? > > > > > Please correct me if I misunderstood. > > > > > “ > > > > > > > > > > I see the Google doc link has been reformatted to match the Flip > > > > template. > > > > > > > > > > To explicitly answer the questions from Jeyhun and Gyula: > > > > > - “Will the usage of Conditions be enabled by default?” Yes, but > this > > > is > > > > > just making the status content useful, whereas before it was not > > > useful. > > > > > - in terms of examples, I am not sure what you would like to see, > the > > > > > table Lajith provided shows the status for various > > > > ResourceLifecycleStates. > > > > > How the operator gets into these states is the current behaviour. > The > > > > > change just shows the appropriate corresponding high level status – > > > that > > > > > could be shown on the User Interfaces. > > > > > - “will there be any disadvantages for Flink users?” None , there > is > > > just > > > > > more information in the status, without this it is more difficult > to > > > work > > > > > out the status of the job. > > > > > - Multiple conditions question. The status is showing whether the > job > > > is > > > > > ready or not, so as long as the last condition is the one that is > > > shown - > > > > > all is as expected. I don’t think this needs rules for precedence > and > > > the > > > > > like. > > > > > - The condition’s Reason is going to be more specific. > > > > > > > > > > Gyula and Jeyhun, is the google doc clear enough for you now? Do > you > > > feel > > > > > you feedback has been addressed? Lajith and I are happy to provide > > more > > > > > details. > > > > > > > > > > I wonder whether this change is big enough to warrant a Flip, as it > > is > > > so > > > > > small. We could do this in an issue. WDYT? > > > > > > > > > > Kind regards, David. > > > > > > > > > > > > > > > From: Lajith Koova <lajith...@gmail.com> > > > > > Date: Wednesday, 29 May 2024 at 13:41 > > > > > To: dev@flink.apache.org <dev@flink.apache.org> > > > > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink > > CRD > > > > > Hello , > > > > > > > > > > > > > > > Discussion thread here: > > > > > https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z > to > > > > > discuss a proposal to add Conditions field in the CR status of > Flink > > > > > Deployment and FlinkSessionJob. > > > > > > > > > > > > > > > Note : Starting this new thread as discussion thread title has been > > > > > modified to follow the FLIP process. > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > Unless otherwise stated above: > > > > > > > > > > IBM United Kingdom Limited > > > > > Registered in England and Wales with number 741598 > > > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 > > 3AU > > > > > > > > > > > > > Unless otherwise stated above: > > > > > > > > IBM United Kingdom Limited > > > > Registered in England and Wales with number 741598 > > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 > 3AU > > > > > > > > > > > Unless otherwise stated above: > > > > IBM United Kingdom Limited > > Registered in England and Wales with number 741598 > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU > > >