[ 
https://issues.apache.org/jira/browse/FLINK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-39541.
------------------------------
    Resolution: Fixed

merged to main 230611a301cf4fe3fd5a0944c9cbf9c04fbfc384

> Improve operator metrics documentation and bundle additional metric reporters
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-39541
>                 URL: https://issues.apache.org/jira/browse/FLINK-39541
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.15.0
>
>
> h1. Summary
> The current {{Metrics and Logging}} page of the Flink Kubernetes Operator is 
> terse and, in several places, ambiguous or incomplete. It mixes 
> {{FlinkDeployment}} / {{FlinkSessionJob}} / {{FlinkBlueGreenDeployment}} 
> metrics in a single flat markdown table, does not cover 
> {{FlinkStateSnapshot}} or autoscaler metrics at all, provides no explanation 
> of how scope formats translate into reporter output, and does not describe 
> the operator-side configuration surface for metric reporters. On top of that, 
> the operator image bundles only a subset of the Flink metric reporter plugins 
> commonly used in production.
> This improvement rewrites the page end-to-end, extends the operator image 
> with two additional bundled reporters, and clarifies in code (javadoc) the 
> relationship between operator-scoped and Flink-core metric configuration.
> h1. Motivation
>  * The existing {{Flink Resource Metrics}} markdown table conflates scope 
> (System / Namespace) with resource type ({{{}FlinkDeployment{}}} / 
> {{FlinkSessionJob}} / {{{}FlinkBlueGreenDeployment{}}}) and is hard to read. 
> Several metric groups emitted by the operator are not listed at all.
>  * Autoscaler metrics ({{{}AutoScaler.scalings{}}}, 
> {{{}AutoScaler.errors{}}}, {{{}AutoScaler.balanced{}}}, and the per-vertex 
> {{{}AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.{Current,Average{}}}}
>  gauges) are emitted in code but completely undocumented. Users have no 
> reference for the placeholder values ({{{}LAG{}}}, {{{}LOAD{}}}, 
> {{{}TRUE_PROCESSING_RATE{}}}, {{{}TARGET_DATA_RATE{}}}, etc.) or for which of 
> them also emit an {{{}.Average{}}}.
>  * {{FlinkStateSnapshot}} checkpoint/savepoint gauges 
> ({{{}Checkpoint.Count{}}}, {{{}Checkpoint.State.<SnapshotState>.Count{}}}, 
> {{{}Savepoint.Count{}}}, {{{}Savepoint.State.<SnapshotState>.Count{}}}) are 
> emitted by the snapshot tracker but not documented.
>  * Namespace-level {{{}FlinkMinorVersion.<FlinkMinorVersion>.Count{}}}, 
> {{ResourceUsage.StateSize}} and 
> {{FlinkDeployment.JmDeploymentStatus.<Status>.Count}} are either missing or 
> only mentioned in passing.
>  * There is no explanation of how scope formats 
> ({{{}<host>.k8soperator.<namespace>.<name>.system{}}}, etc.) map to actual 
> reporter output, and in particular no explanation of the asymmetry between 
> non-labeling reporters (SLF4J, JMX, Graphite, …) and labeling reporters 
> (Prometheus, Datadog, InfluxDB, …), even though the latter drop scope-format 
> literals like system / namespace / resource and surface variables as labels.
>  * The {{JOSDK Metrics}} subsection claims metrics are forwarded but does not 
> link to the upstream reference where users can look the names up.
>  * The operator image bundles SLF4J, Prometheus, JMX, Graphite, InfluxDB, 
> Datadog and StatsD plugins, but not Dropwizard or OpenTelemetry, forcing 
> users to build a custom image for both common setups.
>  * The metric-reporter examples section does not mention that reporter 
> options in {{spec.flinkConfiguration}} must use the plain 
> {{metrics.reporter.{*}{*}}} \{*}prefix (consumed by JM/TMs), while the 
> operator JVM uses {{kubernetes.operator.metrics.reporter.}}{*}. This is a 
> frequent source of misconfiguration.
>  * {{KubernetesOperatorMetricOptions}} only declares operator-specific 
> toggles; it is not obvious from the code that every Flink {{metrics.*}} key 
> is also honoured when written with the {{kubernetes.operator.}} prefix (the 
> prefix is stripped at startup and the remainder is forwarded to Flink's 
> metric registry). This routing is undocumented.
> h1. Proposed Change
> h2. Documentation
>  # *New {{Scope}} section*
>  ** Add a typed table of the three scopes (System / Namespace / Resource) 
> with their configuration option and default scope format.
>  ** Document all scope variables ({{{}<host>{}}}, {{{}<namespace>{}}}, 
> {{{}<name>{}}}, {{{}<resourcens>{}}}, {{{}<resourcename>{}}}, 
> {{{}<resourcetype>{}}}).
>  ** Add a {{How Metric Identifiers Are Built}} subsection explaining the 
> distinction between scope components (variable substitution in the scope 
> format) and logical scope (operator metric-group chain: {{{}k8soperator{}}}, 
> {{{}k8soperator.namespace{}}}, {{{}k8soperator.namespace.resource{}}}).
>  ** Explain how non-labeling reporters build the identifier from *scope 
> components + metric* name, while labeling reporters build it from *logical 
> scope + metric* name and expose scope variables as labels/tags. Include an 
> info hint that labeling reporters drop scope-format literals.
>  ** Add a Concrete Example subsection with Prometheus and SLF4J/JMX 
> renderings for System, Namespace and Resource scopes, using real metrics 
> ({{{}Lifecycle.State.<State>.TimeSeconds{}}}, 
> {{{}Lifecycle.State.<State>.Count{}}}, 
> {{{}AutoScaler.<jobVertexID>.TRUE_PROCESSING_RATE.Current{}}}).
>  # *Rewritten {{Operator Custom Resource Metrics}} table*
>  ** Replace the previous flat markdown table ({{{}### Flink Resource 
> Metrics{}}}) with a single Scope / Resource type / Metrics / Description / 
> Type table, adopting the {{<table class="table table-bordered">}} markup used 
> by the core Flink documentation (this is a new styling choice, not a retrofit 
> from a pre-existing HTML table).
>  ** Group rows by scope (System / Namespace / Resource) and, within each 
> scope, by resource type (FlinkBlueGreenDeployment, FlinkDeployment, 
> FlinkDeployment, FlinkSessionJob, FlinkStateSnapshot).
>  ** Add previously undocumented rows:
>  *** {{{}FlinkBlueGreenDeployment.Failures{}}}, 
> {{{}FlinkBlueGreenDeployment.JobStatus.<Status>.Count{}}}.
>  *** {{{}FlinkDeployment.FlinkMinorVersion.<FlinkMinorVersion>.Count{}}}, 
> {{{}FlinkDeployment.JmDeploymentStatus.<Status>.Count{}}}, 
> {{{}ResourceUsage.Cpu/Memory/StateSize{}}}.
>  *** All FlinkStateSnapshot checkpoint/savepoint gauges.
>  *** All autoscaler resource-scoped rows ({{{}AutoScaler.scalings{}}}, 
> {{{}AutoScaler.errors, AutoScaler.balanced{}}}, 
> {{{}AutoScaler.jobVertexID.<jobVertexID>.<ScalingMetric>.Current{}}}).
>  # *New per-topic subsections (narrative + tables), ordered for readability*
>  ** {{FlinkDeployment Version and Resource Usage}} (new): introductory 
> paragraph on fleet-wide Flink version adoption, capacity / quota monitoring, 
> plus bullet list of the involved gauges.
>  ** {{FlinkDeployment / FlinkSessionJob Lifecycle metrics}} (renamed from the 
> old Lifecycle metrics; clarifies that it covers FlinkDeployment and 
> FlinkSessionJob only).
>  ** {{FlinkBlueGreenDeployment Lifecycle metrics}} (kept, minor rewording).
>  ** FlinkDeployment / FlinkSessionJob JobStatus Tracking (new): narrative on 
> how JmDeploymentStatus complements lifecycle metrics; placed before the 
> blue-green JobStatus subsection.
>  ** {{FlinkBlueGreenDeployment JobStatus Tracking}} (kept, minor rewording).
>  ** {{FlinkStateSnapshot State Tracking}} (new): narrative on detecting stuck 
> / failing snapshot pipelines; placed before Scaling metrics.
>  ** {{Scaling metrics}} (new): narrative on what the autoscaler counters and 
> per-vertex gauges are useful for, plus a new alphabetically-ordered table 
> listing every valid {{<ScalingMetric>}} value ({{{}CATCH_UP_DATA_RATE{}}}, 
> E\{{XPECTED_PROCESSING_RATE}}, {{{}LAG{}}}, {{{}LOAD{}}}, 
> {{{}MAX_PARALLELISM{}}}, {{{}NUM_SOURCE_PARTITIONS{}}}, {{{}PARALLELISM, 
> RECOMMENDED_PARALLELISM{}}}, {{{}SCALE_DOWN_RATE_THRESHOLD{}}}, 
> {{{}SCALE_UP_RATE_THRESHOLD{}}}, {{{}TARGET_DATA_RATE{}}}, 
> {{{}TRUE_PROCESSING_RATE{}}}) with a short description and an {{.Average 
> emitted?}} column.
>  # *Kubernetes Client Metrics tables*
>  ** Convert both existing markdown tables (default + 
> {{{}http.response.code.groups.enabled{}}}) to the same {{<table class="table 
> table-bordered">}} style now used by the new {{Operator Custom Resource 
> Metrics}} table, keeping row content but sorting metric names alphabetically 
> within each table for stable navigation.
>  # *JOSDK Metrics subsection*
>  ** Keep the current short paragraph but add a link to the upstream JOSDK 
> metrics documentation as the authoritative reference, and note that 
> JOSDK-owned metrics are subject to the same scope/reporter rules as the rest 
> of the operator metrics.
>  # *Metric Reporters section*
>  ** Enumerate the reporter plugins actually bundled in the operator image: 
> SLF4J, Prometheus, JMX, Graphite, InfluxDB, Datadog, StatsD, Dropwizard, 
> OpenTelemetry. Mention that any other {{MetricReporterFactory}} can be added 
> by dropping its plugin jar under \{{/opt/flink/plugins/<name>/ }}in a custom 
> image.
>  ** New {{Operator-scoped Metric Configuration }}subsection: explain that the 
> operator accepts the standard Flink {{metrics.*}} keys under the 
> {{kubernetes.operator.metrics.*}} prefix, that the kubernetes.operator. 
> prefix is stripped at startup and the remainder forwarded to the operator's 
> Flink metric registry, and that reporter options therefore follow Flink's 
> schema verbatim and are not re-declared on the Configuration page.
>  ** New {{Configuring Reporters on a FlinkDeployment}} example: clarify that 
> reporters for a managed Flink cluster (spec.flinkConfiguration) use the plain 
> {{metrics.reporter.*}} prefix, while 
> {{kubernetes.operator.metrics.reporter.*}} is reserved for the operator JVM.
>  ** {{Rename How to Enable Prometheus}} (Example) → Prometheus, rename Set up 
> Prometheus locally → Monitoring the Operator with Prometheus, and drop the 
> metrics.reporter.prom.interval line from the Prometheus example (Prometheus 
> is pull-based).
> h2. Image / Packaging
>  * Extend the {{maven-dependency-plugin}} {{artifactItems}} to copy 
> *flink-metrics-dropwizard* and *flink-metrics-otel* under 
> {{{}${plugins.tmp.dir}/{}}}, so both plugins end up in 
> {{/opt/flink/plugins/}} of the operator image. This matches the updated 
> {{Metric Reporters}} documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to