Dennis-Mircea Ciupitu created FLINK-39541:
---------------------------------------------
Summary: Improve operator metrics documentation and bundle
additional metric reporters
Key: FLINK-39541
URL: https://issues.apache.org/jira/browse/FLINK-39541
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.14.0
Reporter: Dennis-Mircea Ciupitu
Fix For: kubernetes-operator-1.15.0
h1. Summary
The current {{Metrics and Logging}} page of the Flink Kubernetes Operator is
terse and, in several places, ambiguous or incomplete. It mixes
{{FlinkDeployment}} / {{FlinkSessionJob}} / {{FlinkBlueGreenDeployment}}
metrics in a single flat markdown table, does not cover {{FlinkStateSnapshot}}
or autoscaler metrics at all, provides no explanation of how scope formats
translate into reporter output, and does not describe the operator-side
configuration surface for metric reporters. On top of that, the operator image
bundles only a subset of the Flink metric reporter plugins commonly used in
production.
This improvement rewrites the page end-to-end, extends the operator image with
two additional bundled reporters, and clarifies in code (javadoc) the
relationship between operator-scoped and Flink-core metric configuration.
h1. Motivation
* The existing {{Flink Resource Metrics}} markdown table conflates scope
(System / Namespace) with resource type ({{{}FlinkDeployment{}}} /
{{FlinkSessionJob}} / {{{}FlinkBlueGreenDeployment{}}}) and is hard to read.
Several metric groups emitted by the operator are not listed at all.
* Autoscaler metrics ({{{}AutoScaler.scalings{}}}, {{{}AutoScaler.errors{}}},
{{{}AutoScaler.balanced{}}}, and the per-vertex
{{{}AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.{Current,Average{}}}}
gauges) are emitted in code but completely undocumented. Users have no
reference for the placeholder values ({{{}LAG{}}}, {{{}LOAD{}}},
{{{}TRUE_PROCESSING_RATE{}}}, {{{}TARGET_DATA_RATE{}}}, etc.) or for which of
them also emit an {{{}.Average{}}}.
* {{FlinkStateSnapshot}} checkpoint/savepoint gauges
({{{}Checkpoint.Count{}}}, {{{}Checkpoint.State.<SnapshotState>.Count{}}},
{{{}Savepoint.Count{}}}, {{{}Savepoint.State.<SnapshotState>.Count{}}}) are
emitted by the snapshot tracker but not documented.
* Namespace-level {{{}FlinkMinorVersion.<FlinkMinorVersion>.Count{}}},
{{ResourceUsage.StateSize}} and
{{FlinkDeployment.JmDeploymentStatus.<Status>.Count}} are either missing or
only mentioned in passing.
* There is no explanation of how scope formats
({{{}<host>.k8soperator.<namespace>.<name>.system{}}}, etc.) map to actual
reporter output, and in particular no explanation of the asymmetry between
non-labeling reporters (SLF4J, JMX, Graphite, …) and labeling reporters
(Prometheus, Datadog, InfluxDB, …), even though the latter drop scope-format
literals like system / namespace / resource and surface variables as labels.
* The {{JOSDK Metrics}} subsection claims metrics are forwarded but does not
link to the upstream reference where users can look the names up.
* The operator image bundles SLF4J, Prometheus, JMX, Graphite, InfluxDB,
Datadog and StatsD plugins, but not Dropwizard or OpenTelemetry, forcing users
to build a custom image for both common setups.
* The metric-reporter examples section does not mention that reporter options
in {{spec.flinkConfiguration}} must use the plain {{metrics.reporter.*}} prefix
(consumed by JM/TMs), while the operator JVM uses
{{{}kubernetes.operator.metrics.reporter.*{}}}. This is a frequent source of
misconfiguration.
* {{KubernetesOperatorMetricOptions}} only declares operator-specific toggles;
it is not obvious from the code that every Flink {{metrics.*}} key is also
honoured when written with the {{kubernetes.operator.}} prefix (the prefix is
stripped at startup and the remainder is forwarded to Flink's metric registry).
This routing is undocumented.
h1. Proposed Change
h2. Documentation
# *New {{Scope}} section*
** Add a typed table of the three scopes (System / Namespace / Resource) with
their configuration option and default scope format.
** Document all scope variables ({{<host>}}, {{<namespace>}}, {{<name>}},
{{<resourcens>}}, {{<resourcename>}}, {{<resourcetype>}}).
** Add a {{How Metric Identifiers Are Built}} subsection explaining the
distinction between scope components (variable substitution in the scope
format) and logical scope (operator metric-group chain: {{k8soperator}},
{{k8soperator.namespace}}, {{k8soperator.namespace.resource}}).
** Explain how non-labeling reporters build the identifier from *scope
components + metric* name, while labeling reporters build it from *logical
scope + metric* name and expose scope variables as labels/tags. Include an info
hint that labeling reporters drop scope-format literals.
** Add a Concrete Example subsection with Prometheus and SLF4J/JMX renderings
for System, Namespace and Resource scopes, using real metrics
({{Lifecycle.State.<State>.TimeSeconds}}, {{Lifecycle.State.<State>.Count}},
{{AutoScaler.<jobVertexID>.TRUE_PROCESSING_RATE.Current}}).
# *Rewritten {{Operator Custom Resource Metrics}} table*
** Replace the previous flat markdown table ({{### Flink Resource Metrics}})
with a single Scope / Resource type / Metrics / Description / Type table,
adopting the {{<table class="table table-bordered">}} markup used by the core
Flink documentation (this is a new styling choice, not a retrofit from a
pre-existing HTML table).
** Group rows by scope (System / Namespace / Resource) and, within each scope,
by resource type (FlinkBlueGreenDeployment, FlinkDeployment, FlinkDeployment,
FlinkSessionJob, FlinkStateSnapshot).
** Add previously undocumented rows:
*** {{FlinkBlueGreenDeployment.Failures}},
{{FlinkBlueGreenDeployment.JobStatus.<Status>.Count}}.
*** {{FlinkDeployment.FlinkMinorVersion.<FlinkMinorVersion>.Count}},
{{FlinkDeployment.JmDeploymentStatus.<Status>.Count}},
{{ResourceUsage.Cpu/Memory/StateSize}}.
*** All FlinkStateSnapshot checkpoint/savepoint gauges.
*** All autoscaler resource-scoped rows ({{AutoScaler.scalings}},
{{AutoScaler.errors, AutoScaler.balanced}},
{{AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.Current}}).
# *New per-topic subsections (narrative + tables), ordered for readability*
** {{FlinkDeployment Version and Resource Usage}} (new): introductory paragraph
on fleet-wide Flink version adoption, capacity / quota monitoring, plus bullet
list of the involved gauges.
** {{FlinkDeployment / FlinkSessionJob Lifecycle metrics}} (renamed from the
old Lifecycle metrics; clarifies that it covers FlinkDeployment and
FlinkSessionJob only).
** {{FlinkBlueGreenDeployment Lifecycle metrics}} (kept, minor rewording).
** FlinkDeployment / FlinkSessionJob JobStatus Tracking (new): narrative on how
JmDeploymentStatus complements lifecycle metrics; placed before the blue-green
JobStatus subsection.
** {{FlinkBlueGreenDeployment JobStatus Tracking}} (kept, minor rewording).
** {{FlinkStateSnapshot State Tracking}} (new): narrative on detecting stuck /
failing snapshot pipelines; placed before Scaling metrics.
** {{Scaling metrics}} (new): narrative on what the autoscaler counters and
per-vertex gauges are useful for, plus a new alphabetically-ordered table
listing every valid {{<ScalingMetric>}} value ({{CATCH_UP_DATA_RATE}},
E{{XPECTED_PROCESSING_RATE}}, {{LAG}}, {{LOAD}}, {{MAX_PARALLELISM}},
{{NUM_SOURCE_PARTITIONS}}, {{PARALLELISM, RECOMMENDED_PARALLELISM}},
{{SCALE_DOWN_RATE_THRESHOLD}}, {{SCALE_UP_RATE_THRESHOLD}},
{{TARGET_DATA_RATE}}, {{TRUE_PROCESSING_RATE}}) with a short description and an
{{.Average emitted?}} column.
# *Kubernetes Client Metrics tables*
** Convert both existing markdown tables (default +
{{http.response.code.groups.enabled}}) to the same {{<table class="table
table-bordered">}} style now used by the new {{Operator Custom Resource
Metrics}} table, keeping row content but sorting metric names alphabetically
within each table for stable navigation.
# *JOSDK Metrics subsection*
** Keep the current short paragraph but add a link to the upstream JOSDK
metrics documentation as the authoritative reference, and note that JOSDK-owned
metrics are subject to the same scope/reporter rules as the rest of the
operator metrics.
# *Metric Reporters section*
** Enumerate the reporter plugins actually bundled in the operator image:
SLF4J, Prometheus, JMX, Graphite, InfluxDB, Datadog, StatsD, Dropwizard,
OpenTelemetry. Mention that any other {{MetricReporterFactory}} can be added by
dropping its plugin jar under {{/opt/flink/plugins/<name>/ }}in a custom image.
** New {{Operator-scoped Metric Configuration }}subsection: explain that the
operator accepts the standard Flink {{metrics.\*}} keys under the
{{kubernetes.operator.metrics.\*}} prefix, that the kubernetes.operator. prefix
is stripped at startup and the remainder forwarded to the operator's Flink
metric registry, and that reporter options therefore follow Flink's schema
verbatim and are not re-declared on the Configuration page.
** New {{Configuring Reporters on a FlinkDeployment}} example: clarify that
reporters for a managed Flink cluster (spec.flinkConfiguration) use the plain
{{metrics.reporter.\*}} prefix, while
{{kubernetes.operator.metrics.reporter.\*}} is reserved for the operator JVM.
** {{Rename How to Enable Prometheus}} (Example) → Prometheus, rename Set up
Prometheus locally → Monitoring the Operator with Prometheus, and drop the
metrics.reporter.prom.interval line from the Prometheus example (Prometheus is
pull-based).
h2. Image / Packaging
** Extend the {{maven-dependency-plugin}} {{artifactItems}} to copy
*flink-metrics-dropwizard* and *flink-metrics-otel* under
{{${plugins.tmp.dir}/}}, so both plugins end up in {{/opt/flink/plugins/}} of
the operator image. This matches the updated {{Metric Reporters}} documentation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)