Re: Argo CD health check for FlinkDeployment

Xingcan Cui Wed, 16 Nov 2022 07:56:53 -0800

Hi Gyula,

Thanks for the explanation!


The distinction between Flink jobs and FlinkDeployments makes sense! I'll
try to make some changes to Argo CD and hopefully can get some review from
you or other Flink-K8s-op contributors then.

Best,
Xingcan

On Wed, Nov 16, 2022 at 10:40 AM Gyula Fóra <[email protected]> wrote:

> Hi Xingcan!
>
> If you are looking for checking the health of the deployed Flink jobs,
> status.jobStatus.state is a good place to start.
> At any given time that should represent the Flink Job Status. RUNNING means
> it's processing data other states mean that it is doing something else
> (restarting, failing etc.)
>
> This is a logic you can also apply on the session jobs.
>
> However I would not really say this is the state of a FlinkDeployment. A
> FlinkDeployment represents more than just a Flink job. Whether the job
> itself is failing or not depends mostly on the job logic.
> The operator cannot fix broken user jobs therefore from the operator
> perspective the FlinkDeployment is healthy as long as we can determine the
> correct status of it and it's reconciled to the spec that the user
> requested.
> For more information about this you can check this state diagram:
>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/architecture/#flink-resource-lifecycle
>
> A side note: While true that the operator is in active development, the CRD
> (spec, status) did not change significantly since the initial stable
> release (1.0.0) in the last couple of months.
> The jobStatus is also one thing that did not change at all.
>
> Cheers,
> Gyula
>
> On Wed, Nov 16, 2022 at 4:21 PM Xingcan Cui <[email protected]> wrote:
>
> > Hi all,
> >
> > We are exploring Argo CD to manage `FlinkDeployment` resources but
> noticed
> > that the health checking for it doesn't work properly.
> >
> > To give you some context, Argo CD uses Lua scripts to check some
> > state-related fields and map them to three status values: "Healthy",
> > "Progressing" and "Degraded". The current implementation
> > <https://github.com/argoproj/argo-cd/pull/9300> uses some legacy fields
> > (e.g., status.reconciliationStatus.success) that have been removed
> > <
> >
> https://github.com/apache/flink-kubernetes-operator/pull/165/files#diff-77c3de65b7bd2db04eeeae370a85cec77f7d7eb22ef801ef11305ede88cb315a
> > >
> > a long time ago. Thus users will always get the "Progressing" status.
> >
> > To fix the issue, we plan to re-implement the health checking logic. Got
> > three questions here.
> >
> > 1. Is it reasonable to simply use "obj.status.jobStatus.state" as the
> > indicator, i.e., map "running" to "Healthy", map "Failing" and "Failed"
> to
> > "Degraded" and map the remaining states to "Progressing"?
> > 2. I know the Flink-K8s-operator project is still in active development.
> > Given that the health checking logic is coupled with the state fields,
> I'm
> > curious if they are stable now.
> > 3. Can we apply the same logic to "FlinkSessionJob"?
> >
> > Thanks,
> > Xingcan
> >
>

Re: Argo CD health check for FlinkDeployment

Reply via email to