Philippe Gref-Viau created FLINK-36932:
------------------------------------------
Summary: Add resource-level metrics for different status/states to
flink-kubernetes-operator
Key: FLINK-36932
URL: https://issues.apache.org/jira/browse/FLINK-36932
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator, Runtime / Metrics
Reporter: Philippe Gref-Viau
Operator-specific metrics were introduced as part of FLINK-26953. These metrics
are useful from a high-level reporting point of view (i.e. X many
FlinkDeployments are in state Y across the namespace), but they give no
insights as to the states/statuses of _individual_ (i.e. resource-level)
deployments. For example, there's currently no good signal to indicate if a
particular deployment is in a given lifecycle state.
As part of our daily operational routine, we have found this lack of
resource-level metrics painful, since we cannot create graphs or alerts that
show the name of failing deployments. We can always turn to the metrics emitted
by Flink itself (ex: the {{<jobStatus>State}} Gauge metric available on the
JobManager) that are "faceted" by the job/deployment name, but in some cases, a
problem can occur before the jobs ever get to run and/or before their metrics
even get a chance to be emitted. There's also the fact that the fact that not
all status/states are covered by those metrics (i.e. lifecycle states).
Furthermore, the current set of metrics emitted for FlinkDeployments include
namespace-level counts for each Job Manager state, but it does not include
counts metrics for each Job status. Again, we can turn to metrics emitted
directly by Flink itself, but we run into the limitations I mentioned above.
As such, we propose the following changes:
* Extending all of the existing "counter-based" metrics related to
status/state, so that each status/state also has a resource-level,
"gauge-based" metric that tracks whether each deployment (or the related
sub-resource, i.e. job/job manager) is in a given status/state
* Adding metrics to track the total count of Jobs in each status (by
namespace), and a gauge-based metric for each Job status (by deployment)
Another way to present the suggested changes is to show what new items would be
added in the "Flink Resource Metrics" table shown on
[this|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#flink-resource-metrics]
page:
||Scope||Metrics||Description||Type||
|Resource|FlinkDeployment.JmDeploymentStatus.<Status>.InStatus|For a given Job
Manager deployment status <Status>, return 1 if the Job Manager associated with
the FlinkDeployment is currently in that status, otherwise return 0. <Status>
can take values from: READY, DEPLOYED_NOT_READY, DEPLOYING, MISSING,
ERROR|Gauge|
|Resource|FlinkDeployment.JobStatus.<Status>.InStatus|For a given job status
<Status>, return 1 if the job associated with the FlinkDeployment is currently
in that status, otherwise return 0. <Status> can take values from: CANCELED,
CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING, RECONCILING,
RESTARTING, RUNNING, SUSPENDED|Gauge|
|Namespace|FlinkDeployment.JobStatus.<Status>.Count|Number of managed
FlinkDeployment resources per <Status> per namespace. <Status> can take values
from: CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING,
RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge|
|Resource|FlinkDeployment/FlinkSessionJob.Lifecycle.State.<State>.InState|For a
given lifecycle state <State>, return 1 if the managed resource is currently in
that state, otherwise return 0. <State> can take values from: CREATED,
SUSPENDED, UPGRADING, DEPLOYED, STABLE, ROLLING_BACK, ROLLED_BACK, FAILED|Gauge
|
We've actually already implemented these changes in our fork of the
flink-kubernetes-operator codebase, and it's been working pretty well. At his
point, we're interested in merging the changes back into the main branch to
avoid diverging from the releases share the improvement with the rest of the
community and get some feedback on our implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)