[
https://issues.apache.org/jira/browse/FLINK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-39541.
------------------------------
Resolution: Fixed
merged to main 230611a301cf4fe3fd5a0944c9cbf9c04fbfc384
> Improve operator metrics documentation and bundle additional metric reporters
> -----------------------------------------------------------------------------
>
> Key: FLINK-39541
> URL: https://issues.apache.org/jira/browse/FLINK-39541
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Dennis-Mircea Ciupitu
> Priority: Major
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.15.0
>
>
> h1. Summary
> The current {{Metrics and Logging}} page of the Flink Kubernetes Operator is
> terse and, in several places, ambiguous or incomplete. It mixes
> {{FlinkDeployment}} / {{FlinkSessionJob}} / {{FlinkBlueGreenDeployment}}
> metrics in a single flat markdown table, does not cover
> {{FlinkStateSnapshot}} or autoscaler metrics at all, provides no explanation
> of how scope formats translate into reporter output, and does not describe
> the operator-side configuration surface for metric reporters. On top of that,
> the operator image bundles only a subset of the Flink metric reporter plugins
> commonly used in production.
> This improvement rewrites the page end-to-end, extends the operator image
> with two additional bundled reporters, and clarifies in code (javadoc) the
> relationship between operator-scoped and Flink-core metric configuration.
> h1. Motivation
> * The existing {{Flink Resource Metrics}} markdown table conflates scope
> (System / Namespace) with resource type ({{{}FlinkDeployment{}}} /
> {{FlinkSessionJob}} / {{{}FlinkBlueGreenDeployment{}}}) and is hard to read.
> Several metric groups emitted by the operator are not listed at all.
> * Autoscaler metrics ({{{}AutoScaler.scalings{}}},
> {{{}AutoScaler.errors{}}}, {{{}AutoScaler.balanced{}}}, and the per-vertex
> {{{}AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.{Current,Average{}}}}
> gauges) are emitted in code but completely undocumented. Users have no
> reference for the placeholder values ({{{}LAG{}}}, {{{}LOAD{}}},
> {{{}TRUE_PROCESSING_RATE{}}}, {{{}TARGET_DATA_RATE{}}}, etc.) or for which of
> them also emit an {{{}.Average{}}}.
> * {{FlinkStateSnapshot}} checkpoint/savepoint gauges
> ({{{}Checkpoint.Count{}}}, {{{}Checkpoint.State.<SnapshotState>.Count{}}},
> {{{}Savepoint.Count{}}}, {{{}Savepoint.State.<SnapshotState>.Count{}}}) are
> emitted by the snapshot tracker but not documented.
> * Namespace-level {{{}FlinkMinorVersion.<FlinkMinorVersion>.Count{}}},
> {{ResourceUsage.StateSize}} and
> {{FlinkDeployment.JmDeploymentStatus.<Status>.Count}} are either missing or
> only mentioned in passing.
> * There is no explanation of how scope formats
> ({{{}<host>.k8soperator.<namespace>.<name>.system{}}}, etc.) map to actual
> reporter output, and in particular no explanation of the asymmetry between
> non-labeling reporters (SLF4J, JMX, Graphite, …) and labeling reporters
> (Prometheus, Datadog, InfluxDB, …), even though the latter drop scope-format
> literals like system / namespace / resource and surface variables as labels.
> * The {{JOSDK Metrics}} subsection claims metrics are forwarded but does not
> link to the upstream reference where users can look the names up.
> * The operator image bundles SLF4J, Prometheus, JMX, Graphite, InfluxDB,
> Datadog and StatsD plugins, but not Dropwizard or OpenTelemetry, forcing
> users to build a custom image for both common setups.
> * The metric-reporter examples section does not mention that reporter
> options in {{spec.flinkConfiguration}} must use the plain
> {{metrics.reporter.{*}{*}}} \{*}prefix (consumed by JM/TMs), while the
> operator JVM uses {{kubernetes.operator.metrics.reporter.}}{*}. This is a
> frequent source of misconfiguration.
> * {{KubernetesOperatorMetricOptions}} only declares operator-specific
> toggles; it is not obvious from the code that every Flink {{metrics.*}} key
> is also honoured when written with the {{kubernetes.operator.}} prefix (the
> prefix is stripped at startup and the remainder is forwarded to Flink's
> metric registry). This routing is undocumented.
> h1. Proposed Change
> h2. Documentation
> # *New {{Scope}} section*
> ** Add a typed table of the three scopes (System / Namespace / Resource)
> with their configuration option and default scope format.
> ** Document all scope variables ({{{}<host>{}}}, {{{}<namespace>{}}},
> {{{}<name>{}}}, {{{}<resourcens>{}}}, {{{}<resourcename>{}}},
> {{{}<resourcetype>{}}}).
> ** Add a {{How Metric Identifiers Are Built}} subsection explaining the
> distinction between scope components (variable substitution in the scope
> format) and logical scope (operator metric-group chain: {{{}k8soperator{}}},
> {{{}k8soperator.namespace{}}}, {{{}k8soperator.namespace.resource{}}}).
> ** Explain how non-labeling reporters build the identifier from *scope
> components + metric* name, while labeling reporters build it from *logical
> scope + metric* name and expose scope variables as labels/tags. Include an
> info hint that labeling reporters drop scope-format literals.
> ** Add a Concrete Example subsection with Prometheus and SLF4J/JMX
> renderings for System, Namespace and Resource scopes, using real metrics
> ({{{}Lifecycle.State.<State>.TimeSeconds{}}},
> {{{}Lifecycle.State.<State>.Count{}}},
> {{{}AutoScaler.<jobVertexID>.TRUE_PROCESSING_RATE.Current{}}}).
> # *Rewritten {{Operator Custom Resource Metrics}} table*
> ** Replace the previous flat markdown table ({{{}### Flink Resource
> Metrics{}}}) with a single Scope / Resource type / Metrics / Description /
> Type table, adopting the {{<table class="table table-bordered">}} markup used
> by the core Flink documentation (this is a new styling choice, not a retrofit
> from a pre-existing HTML table).
> ** Group rows by scope (System / Namespace / Resource) and, within each
> scope, by resource type (FlinkBlueGreenDeployment, FlinkDeployment,
> FlinkDeployment, FlinkSessionJob, FlinkStateSnapshot).
> ** Add previously undocumented rows:
> *** {{{}FlinkBlueGreenDeployment.Failures{}}},
> {{{}FlinkBlueGreenDeployment.JobStatus.<Status>.Count{}}}.
> *** {{{}FlinkDeployment.FlinkMinorVersion.<FlinkMinorVersion>.Count{}}},
> {{{}FlinkDeployment.JmDeploymentStatus.<Status>.Count{}}},
> {{{}ResourceUsage.Cpu/Memory/StateSize{}}}.
> *** All FlinkStateSnapshot checkpoint/savepoint gauges.
> *** All autoscaler resource-scoped rows ({{{}AutoScaler.scalings{}}},
> {{{}AutoScaler.errors, AutoScaler.balanced{}}},
> {{{}AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.Current{}}}).
> # *New per-topic subsections (narrative + tables), ordered for readability*
> ** {{FlinkDeployment Version and Resource Usage}} (new): introductory
> paragraph on fleet-wide Flink version adoption, capacity / quota monitoring,
> plus bullet list of the involved gauges.
> ** {{FlinkDeployment / FlinkSessionJob Lifecycle metrics}} (renamed from the
> old Lifecycle metrics; clarifies that it covers FlinkDeployment and
> FlinkSessionJob only).
> ** {{FlinkBlueGreenDeployment Lifecycle metrics}} (kept, minor rewording).
> ** FlinkDeployment / FlinkSessionJob JobStatus Tracking (new): narrative on
> how JmDeploymentStatus complements lifecycle metrics; placed before the
> blue-green JobStatus subsection.
> ** {{FlinkBlueGreenDeployment JobStatus Tracking}} (kept, minor rewording).
> ** {{FlinkStateSnapshot State Tracking}} (new): narrative on detecting stuck
> / failing snapshot pipelines; placed before Scaling metrics.
> ** {{Scaling metrics}} (new): narrative on what the autoscaler counters and
> per-vertex gauges are useful for, plus a new alphabetically-ordered table
> listing every valid {{<ScalingMetric>}} value ({{{}CATCH_UP_DATA_RATE{}}},
> E\{{XPECTED_PROCESSING_RATE}}, {{{}LAG{}}}, {{{}LOAD{}}},
> {{{}MAX_PARALLELISM{}}}, {{{}NUM_SOURCE_PARTITIONS{}}}, {{{}PARALLELISM,
> RECOMMENDED_PARALLELISM{}}}, {{{}SCALE_DOWN_RATE_THRESHOLD{}}},
> {{{}SCALE_UP_RATE_THRESHOLD{}}}, {{{}TARGET_DATA_RATE{}}},
> {{{}TRUE_PROCESSING_RATE{}}}) with a short description and an {{.Average
> emitted?}} column.
> # *Kubernetes Client Metrics tables*
> ** Convert both existing markdown tables (default +
> {{{}http.response.code.groups.enabled{}}}) to the same {{<table class="table
> table-bordered">}} style now used by the new {{Operator Custom Resource
> Metrics}} table, keeping row content but sorting metric names alphabetically
> within each table for stable navigation.
> # *JOSDK Metrics subsection*
> ** Keep the current short paragraph but add a link to the upstream JOSDK
> metrics documentation as the authoritative reference, and note that
> JOSDK-owned metrics are subject to the same scope/reporter rules as the rest
> of the operator metrics.
> # *Metric Reporters section*
> ** Enumerate the reporter plugins actually bundled in the operator image:
> SLF4J, Prometheus, JMX, Graphite, InfluxDB, Datadog, StatsD, Dropwizard,
> OpenTelemetry. Mention that any other {{MetricReporterFactory}} can be added
> by dropping its plugin jar under \{{/opt/flink/plugins/<name>/ }}in a custom
> image.
> ** New {{Operator-scoped Metric Configuration }}subsection: explain that the
> operator accepts the standard Flink {{metrics.*}} keys under the
> {{kubernetes.operator.metrics.*}} prefix, that the kubernetes.operator.
> prefix is stripped at startup and the remainder forwarded to the operator's
> Flink metric registry, and that reporter options therefore follow Flink's
> schema verbatim and are not re-declared on the Configuration page.
> ** New {{Configuring Reporters on a FlinkDeployment}} example: clarify that
> reporters for a managed Flink cluster (spec.flinkConfiguration) use the plain
> {{metrics.reporter.*}} prefix, while
> {{kubernetes.operator.metrics.reporter.*}} is reserved for the operator JVM.
> ** {{Rename How to Enable Prometheus}} (Example) → Prometheus, rename Set up
> Prometheus locally → Monitoring the Operator with Prometheus, and drop the
> metrics.reporter.prom.interval line from the Prometheus example (Prometheus
> is pull-based).
> h2. Image / Packaging
> * Extend the {{maven-dependency-plugin}} {{artifactItems}} to copy
> *flink-metrics-dropwizard* and *flink-metrics-otel* under
> {{{}${plugins.tmp.dir}/{}}}, so both plugins end up in
> {{/opt/flink/plugins/}} of the operator image. This matches the updated
> {{Metric Reporters}} documentation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)