Dennis-Mircea Ciupitu created FLINK-39541:
---------------------------------------------

             Summary: Improve operator metrics documentation and bundle 
additional metric reporters
                 Key: FLINK-39541
                 URL: https://issues.apache.org/jira/browse/FLINK-39541
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.14.0
            Reporter: Dennis-Mircea Ciupitu
             Fix For: kubernetes-operator-1.15.0


h1. Summary

The current {{Metrics and Logging}} page of the Flink Kubernetes Operator is 
terse and, in several places, ambiguous or incomplete. It mixes 
{{FlinkDeployment}} / {{FlinkSessionJob}} / {{FlinkBlueGreenDeployment}} 
metrics in a single flat markdown table, does not cover {{FlinkStateSnapshot}} 
or autoscaler metrics at all, provides no explanation of how scope formats 
translate into reporter output, and does not describe the operator-side 
configuration surface for metric reporters. On top of that, the operator image 
bundles only a subset of the Flink metric reporter plugins commonly used in 
production.

This improvement rewrites the page end-to-end, extends the operator image with 
two additional bundled reporters, and clarifies in code (javadoc) the 
relationship between operator-scoped and Flink-core metric configuration.
h1. Motivation
 * The existing {{Flink Resource Metrics}} markdown table conflates scope 
(System / Namespace) with resource type ({{{}FlinkDeployment{}}} / 
{{FlinkSessionJob}} / {{{}FlinkBlueGreenDeployment{}}}) and is hard to read. 
Several metric groups emitted by the operator are not listed at all.
 * Autoscaler metrics ({{{}AutoScaler.scalings{}}}, {{{}AutoScaler.errors{}}}, 
{{{}AutoScaler.balanced{}}}, and the per-vertex 
{{{}AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.{Current,Average{}}}} 
gauges) are emitted in code but completely undocumented. Users have no 
reference for the placeholder values ({{{}LAG{}}}, {{{}LOAD{}}}, 
{{{}TRUE_PROCESSING_RATE{}}}, {{{}TARGET_DATA_RATE{}}}, etc.) or for which of 
them also emit an {{{}.Average{}}}.
 * {{FlinkStateSnapshot}} checkpoint/savepoint gauges 
({{{}Checkpoint.Count{}}}, {{{}Checkpoint.State.<SnapshotState>.Count{}}}, 
{{{}Savepoint.Count{}}}, {{{}Savepoint.State.<SnapshotState>.Count{}}}) are 
emitted by the snapshot tracker but not documented.
 * Namespace-level {{{}FlinkMinorVersion.<FlinkMinorVersion>.Count{}}}, 
{{ResourceUsage.StateSize}} and 
{{FlinkDeployment.JmDeploymentStatus.<Status>.Count}} are either missing or 
only mentioned in passing.
 * There is no explanation of how scope formats 
({{{}<host>.k8soperator.<namespace>.<name>.system{}}}, etc.) map to actual 
reporter output, and in particular no explanation of the asymmetry between 
non-labeling reporters (SLF4J, JMX, Graphite, …) and labeling reporters 
(Prometheus, Datadog, InfluxDB, …), even though the latter drop scope-format 
literals like system / namespace / resource and surface variables as labels.
 * The {{JOSDK Metrics}} subsection claims metrics are forwarded but does not 
link to the upstream reference where users can look the names up.
 * The operator image bundles SLF4J, Prometheus, JMX, Graphite, InfluxDB, 
Datadog and StatsD plugins, but not Dropwizard or OpenTelemetry, forcing users 
to build a custom image for both common setups.
 * The metric-reporter examples section does not mention that reporter options 
in {{spec.flinkConfiguration}} must use the plain {{metrics.reporter.*}} prefix 
(consumed by JM/TMs), while the operator JVM uses 
{{{}kubernetes.operator.metrics.reporter.*{}}}. This is a frequent source of 
misconfiguration.
 * {{KubernetesOperatorMetricOptions}} only declares operator-specific toggles; 
it is not obvious from the code that every Flink {{metrics.*}} key is also 
honoured when written with the {{kubernetes.operator.}} prefix (the prefix is 
stripped at startup and the remainder is forwarded to Flink's metric registry). 
This routing is undocumented.

h1. Proposed Change
h2. Documentation
 # *New {{Scope}} section*
 ** Add a typed table of the three scopes (System / Namespace / Resource) with 
their configuration option and default scope format.
 ** Document all scope variables ({{<host>}}, {{<namespace>}}, {{<name>}}, 
{{<resourcens>}}, {{<resourcename>}}, {{<resourcetype>}}).
** Add a {{How Metric Identifiers Are Built}} subsection explaining the 
distinction between scope components (variable substitution in the scope 
format) and logical scope (operator metric-group chain: {{k8soperator}}, 
{{k8soperator.namespace}}, {{k8soperator.namespace.resource}}).
 ** Explain how non-labeling reporters build the identifier from *scope 
components + metric* name, while labeling reporters build it from *logical 
scope + metric* name and expose scope variables as labels/tags. Include an info 
hint that labeling reporters drop scope-format literals.
** Add a Concrete Example subsection with Prometheus and SLF4J/JMX renderings 
for System, Namespace and Resource scopes, using real metrics 
({{Lifecycle.State.<State>.TimeSeconds}}, {{Lifecycle.State.<State>.Count}}, 
{{AutoScaler.<jobVertexID>.TRUE_PROCESSING_RATE.Current}}).
# *Rewritten {{Operator Custom Resource Metrics}} table*
** Replace the previous flat markdown table ({{### Flink Resource Metrics}}) 
with a single Scope / Resource type / Metrics / Description / Type table, 
adopting the {{<table class="table table-bordered">}} markup used by the core 
Flink documentation (this is a new styling choice, not a retrofit from a 
pre-existing HTML table).
** Group rows by scope (System / Namespace / Resource) and, within each scope, 
by resource type (FlinkBlueGreenDeployment, FlinkDeployment, FlinkDeployment, 
FlinkSessionJob, FlinkStateSnapshot).
** Add previously undocumented rows:
*** {{FlinkBlueGreenDeployment.Failures}}, 
{{FlinkBlueGreenDeployment.JobStatus.<Status>.Count}}.
*** {{FlinkDeployment.FlinkMinorVersion.<FlinkMinorVersion>.Count}}, 
{{FlinkDeployment.JmDeploymentStatus.<Status>.Count}}, 
{{ResourceUsage.Cpu/Memory/StateSize}}.
*** All FlinkStateSnapshot checkpoint/savepoint gauges.
*** All autoscaler resource-scoped rows ({{AutoScaler.scalings}}, 
{{AutoScaler.errors, AutoScaler.balanced}}, 
{{AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.Current}}).
# *New per-topic subsections (narrative + tables), ordered for readability*
** {{FlinkDeployment Version and Resource Usage}} (new): introductory paragraph 
on fleet-wide Flink version adoption, capacity / quota monitoring, plus bullet 
list of the involved gauges.
** {{FlinkDeployment / FlinkSessionJob Lifecycle metrics}} (renamed from the 
old Lifecycle metrics; clarifies that it covers FlinkDeployment and 
FlinkSessionJob only).
** {{FlinkBlueGreenDeployment Lifecycle metrics}} (kept, minor rewording).
** FlinkDeployment / FlinkSessionJob JobStatus Tracking (new): narrative on how 
JmDeploymentStatus complements lifecycle metrics; placed before the blue-green 
JobStatus subsection.
** {{FlinkBlueGreenDeployment JobStatus Tracking}} (kept, minor rewording).
** {{FlinkStateSnapshot State Tracking}} (new): narrative on detecting stuck / 
failing snapshot pipelines; placed before Scaling metrics.
** {{Scaling metrics}} (new): narrative on what the autoscaler counters and 
per-vertex gauges are useful for, plus a new alphabetically-ordered table 
listing every valid {{<ScalingMetric>}} value ({{CATCH_UP_DATA_RATE}}, 
E{{XPECTED_PROCESSING_RATE}}, {{LAG}}, {{LOAD}}, {{MAX_PARALLELISM}}, 
{{NUM_SOURCE_PARTITIONS}}, {{PARALLELISM, RECOMMENDED_PARALLELISM}}, 
{{SCALE_DOWN_RATE_THRESHOLD}}, {{SCALE_UP_RATE_THRESHOLD}}, 
{{TARGET_DATA_RATE}}, {{TRUE_PROCESSING_RATE}}) with a short description and an 
{{.Average emitted?}} column.
# *Kubernetes Client Metrics tables*
** Convert both existing markdown tables (default + 
{{http.response.code.groups.enabled}}) to the same {{<table class="table 
table-bordered">}} style now used by the new {{Operator Custom Resource 
Metrics}} table, keeping row content but sorting metric names alphabetically 
within each table for stable navigation.
# *JOSDK Metrics subsection*
** Keep the current short paragraph but add a link to the upstream JOSDK 
metrics documentation as the authoritative reference, and note that JOSDK-owned 
metrics are subject to the same scope/reporter rules as the rest of the 
operator metrics.
# *Metric Reporters section*
** Enumerate the reporter plugins actually bundled in the operator image: 
SLF4J, Prometheus, JMX, Graphite, InfluxDB, Datadog, StatsD, Dropwizard, 
OpenTelemetry. Mention that any other {{MetricReporterFactory}} can be added by 
dropping its plugin jar under {{/opt/flink/plugins/<name>/ }}in a custom image.
** New {{Operator-scoped Metric Configuration }}subsection: explain that the 
operator accepts the standard Flink {{metrics.\*}} keys under the 
{{kubernetes.operator.metrics.\*}} prefix, that the kubernetes.operator. prefix 
is stripped at startup and the remainder forwarded to the operator's Flink 
metric registry, and that reporter options therefore follow Flink's schema 
verbatim and are not re-declared on the Configuration page.
** New {{Configuring Reporters on a FlinkDeployment}} example: clarify that 
reporters for a managed Flink cluster (spec.flinkConfiguration) use the plain 
{{metrics.reporter.\*}} prefix, while 
{{kubernetes.operator.metrics.reporter.\*}} is reserved for the operator JVM.
** {{Rename How to Enable Prometheus}} (Example) → Prometheus, rename Set up 
Prometheus locally → Monitoring the Operator with Prometheus, and drop the 
metrics.reporter.prom.interval line from the Prometheus example (Prometheus is 
pull-based).

h2. Image / Packaging
** Extend the {{maven-dependency-plugin}} {{artifactItems}} to copy 
*flink-metrics-dropwizard* and *flink-metrics-otel* under 
{{${plugins.tmp.dir}/}}, so both plugins end up in {{/opt/flink/plugins/}} of 
the operator image. This matches the updated {{Metric Reporters}} documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to