[jira] [Updated] (FLINK-39571) Improve Flink Kubernetes Operator documentation coverage

Dennis-Mircea Ciupitu (Jira) Fri, 12 Jun 2026 05:22:18 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dennis-Mircea Ciupitu updated FLINK-39571:
------------------------------------------
    Description: 
h2. Motivation

The Flink Kubernetes Operator is production-ready and widely deployed, but its 
documentation has not kept pace, both in coverage and in structure:
 * *Concepts is an internals trap:* The section opens with controller 
deep-dives aimed at contributors, instead of answering the evaluator's 
question: "what does the operator bring me, why should I install it?"
 * *Custom Resource mixes concerns:* CR structure/spec/reference pages are 
interleaved with operator capabilities (job management, snapshots, autoscaler).
 * *Operations is a grab-bag:* Install-time concerns (Helm, RBAC, 
configuration) sit next to day-2 concerns (metrics, upgrades, debugging), while 
Flink core deliberately separates Deployment from Operations.
 * *Coverage gaps:* Two custom resources are barely documented 
(FlinkStateSnapshot) or not at all (FlinkBlueGreenDeployment), and operational 
topics critical for production (operator state, event taxonomy, high 
availability, debugging, security) require reading source code today.

h2. Goal

Restructure the operator documentation into a layered, navigable tree aligned 
with the Flink core documentation conventions, and fill the coverage gaps as 
part of the same effort. The guiding principles:
 * *Three altitudes per capability:* _Concepts_ explain why a capability exists 
and what it buys you (no config keys), the management pages explain how to 
enable and configure it, _Internals_ explain how it works inside. Example 
vertical: Concepts > Autoscaling -> Managing Flink Jobs > Autoscaler -> 
Internals > Autoscaler Flow.
 * *Clear separation* between CR structure (spec/status/reference) and the 
operator functionality applied on the CRs.
 * *Flink core alignment* wherever an analogous concern exists (Deployment vs 
Operations split, Configuration/HA/Security under Deployment, Debugging under 
Operations, Concepts Glossary), with deliberate deviations where the operator 
differs (CR sections instead of Application Development, a kubectl/helm Command 
Cheatsheet instead of a CLI page).
 * *No stubs:* a navigation entry only appears once real content exists behind 
it.
 * *URL stability:* every moved or renamed page keeps its old paths via aliases.

h2. Target structure
* *Try Flink Kubernetes Operator*
** Quick Start
* *Concepts*
** Overview
** Architecture
** Lifecycle Management
** Zero-Downtime Upgrades
** Autoscaling
** Glossary
* *Custom Resource*
** Overview
** Pod Template
** Ingress
** Status and Lifecycle
** Reference
* *Managing Flink Jobs*
** Job Management
** Snapshot Management
** Blue/Green Deployments
** Autoscaler
** Autotuning
* *Deployment*
** Overview
** Compatibility
** Helm
*** RBAC
*** Cert Manager
** Configuration
** High Availability
** Plugins
** Command Cheatsheet
** Security (gated until written)
* *Operations*
** Operator State
** Metrics
** Logging
** Events
** Debugging
*** Resource Exceptions
*** Application Profiling & Debugging
** Monitoring the Operator
** Upgrading the Operator
* *Development*
** Guide
** Importing the Operator into an IDE (gated until written)
** Roadmap
* *Internals*
** Overview
** Startup Flow
** Reconciliation Flow
** Autoscaler Flow
** Webhook Flow

The sections form four visual groups, mirroring the Flink core navigation: 
learn (Try, Concepts), use the custom resources (Custom Resource, Managing 
Flink Jobs), run the operator (Deployment, Operations), contribute 
(Development, Internals).


This absorbs and supersedes the originally proposed scope: the six new pages 
(state, events, high availability, security, debugging, plus monitoring) all 
exist in the tree above, now in their structurally correct homes, alongside new 
pages for the previously undocumented capabilities (FlinkStateSnapshot via 
Snapshot Management and Status and Lifecycle, FlinkBlueGreenDeployment via 
Zero-Downtime Upgrades and Blue/Green Deployments).

  was:
h1. Summary

The Flink Kubernetes Operator is a mature project used in production by a large 
number of organizations, but its documentation has not kept pace with the 
codebase. Today the docs cover installation ({{{}helm.md{}}}), CRD schema 
({{{}reference.md{}}}, {{{}overview.md{}}}), and metrics/logging well, but stay 
thin on the operational concerns that determine whether an operator can be 
safely run in production: high availability, security posture, the Kubernetes 
ConfigMaps the operator creates and manages, the event taxonomy, day-2 
troubleshooting, and the full surface of the autoscaler. As a result, users 
frequently have to read the source code or rely on community channels for 
answers that should live in the official docs.

The goal of this umbrella is to bring the operator documentation closer to the 
depth and structure of the Flink main documentation itself, where each 
operational concern (state backends, HA, security, deployment, monitoring, 
debugging) has a dedicated, narrative-driven section rather than being 
scattered across pages.
h1. New Pages
 - *State* - Document the Kubernetes ConfigMaps the operator creates and 
manages per Flink resource.

 - *Events* - Document the event taxonomy emitted by the operator (submit, 
recovery, scaling, snapshot, validation, etc.), deduplication semantics, and 
the events can be consumed.

 - *High Availability* - Document the operator-side leader election, replica 
topology, recovery semantics, limitations, and configuration reference.

 - *Security* - Document the RBAC scoping, operator -> Flink REST mTLS, 
truststore/keystore management, Kerberos auth, webhook TLS, and secrets 
handling for credentials.

 - *Debugging* - Document common failure modes, how to interpret status fields 
and reconciler logs, runbooks for stuck reconciliations, and diagnostic 
configuration toggles.

 - *Production Readiness Checklist* - Single-page checklist consolidating HA, 
security, resource sizing, observability, upgrade strategy, and disaster 
recovery, modeled on similar pages in other mature Kubernetes operators.

h1. Pages to be updated
 - *Configuration* - It does not explain which properties hot-reload vs which 
require an operator restart, documents only a fraction of the operator's actual 
environment variables, lacks guidance on the YAML configuration format and its 
limitations, and groups Leader Election and High Availability with general 
configuration instead of giving it a dedicated page.
 - *Autoscaler* - It should expand the existing 358-line page to cover more 
internal semantics, scaling cooldowns, exclusion semantics, scaling history 
persistence, and advance additional features that can be enabled by the end 
users.


> Improve Flink Kubernetes Operator documentation coverage
> --------------------------------------------------------
>
>                 Key: FLINK-39571
>                 URL: https://issues.apache.org/jira/browse/FLINK-39571
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Assignee: Dennis-Mircea Ciupitu
>            Priority: Major
>             Fix For: kubernetes-operator-1.16.0
>
>
> h2. Motivation
> The Flink Kubernetes Operator is production-ready and widely deployed, but 
> its documentation has not kept pace, both in coverage and in structure:
>  * *Concepts is an internals trap:* The section opens with controller 
> deep-dives aimed at contributors, instead of answering the evaluator's 
> question: "what does the operator bring me, why should I install it?"
>  * *Custom Resource mixes concerns:* CR structure/spec/reference pages are 
> interleaved with operator capabilities (job management, snapshots, 
> autoscaler).
>  * *Operations is a grab-bag:* Install-time concerns (Helm, RBAC, 
> configuration) sit next to day-2 concerns (metrics, upgrades, debugging), 
> while Flink core deliberately separates Deployment from Operations.
>  * *Coverage gaps:* Two custom resources are barely documented 
> (FlinkStateSnapshot) or not at all (FlinkBlueGreenDeployment), and 
> operational topics critical for production (operator state, event taxonomy, 
> high availability, debugging, security) require reading source code today.
> h2. Goal
> Restructure the operator documentation into a layered, navigable tree aligned 
> with the Flink core documentation conventions, and fill the coverage gaps as 
> part of the same effort. The guiding principles:
>  * *Three altitudes per capability:* _Concepts_ explain why a capability 
> exists and what it buys you (no config keys), the management pages explain 
> how to enable and configure it, _Internals_ explain how it works inside. 
> Example vertical: Concepts > Autoscaling -> Managing Flink Jobs > Autoscaler 
> -> Internals > Autoscaler Flow.
>  * *Clear separation* between CR structure (spec/status/reference) and the 
> operator functionality applied on the CRs.
>  * *Flink core alignment* wherever an analogous concern exists (Deployment vs 
> Operations split, Configuration/HA/Security under Deployment, Debugging under 
> Operations, Concepts Glossary), with deliberate deviations where the operator 
> differs (CR sections instead of Application Development, a kubectl/helm 
> Command Cheatsheet instead of a CLI page).
>  * *No stubs:* a navigation entry only appears once real content exists 
> behind it.
>  * *URL stability:* every moved or renamed page keeps its old paths via 
> aliases.
> h2. Target structure
> * *Try Flink Kubernetes Operator*
> ** Quick Start
> * *Concepts*
> ** Overview
> ** Architecture
> ** Lifecycle Management
> ** Zero-Downtime Upgrades
> ** Autoscaling
> ** Glossary
> * *Custom Resource*
> ** Overview
> ** Pod Template
> ** Ingress
> ** Status and Lifecycle
> ** Reference
> * *Managing Flink Jobs*
> ** Job Management
> ** Snapshot Management
> ** Blue/Green Deployments
> ** Autoscaler
> ** Autotuning
> * *Deployment*
> ** Overview
> ** Compatibility
> ** Helm
> *** RBAC
> *** Cert Manager
> ** Configuration
> ** High Availability
> ** Plugins
> ** Command Cheatsheet
> ** Security (gated until written)
> * *Operations*
> ** Operator State
> ** Metrics
> ** Logging
> ** Events
> ** Debugging
> *** Resource Exceptions
> *** Application Profiling & Debugging
> ** Monitoring the Operator
> ** Upgrading the Operator
> * *Development*
> ** Guide
> ** Importing the Operator into an IDE (gated until written)
> ** Roadmap
> * *Internals*
> ** Overview
> ** Startup Flow
> ** Reconciliation Flow
> ** Autoscaler Flow
> ** Webhook Flow
> The sections form four visual groups, mirroring the Flink core navigation: 
> learn (Try, Concepts), use the custom resources (Custom Resource, Managing 
> Flink Jobs), run the operator (Deployment, Operations), contribute 
> (Development, Internals).
> This absorbs and supersedes the originally proposed scope: the six new pages 
> (state, events, high availability, security, debugging, plus monitoring) all 
> exist in the tree above, now in their structurally correct homes, alongside 
> new pages for the previously undocumented capabilities (FlinkStateSnapshot 
> via Snapshot Management and Status and Lifecycle, FlinkBlueGreenDeployment 
> via Zero-Downtime Upgrades and Blue/Green Deployments).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39571) Improve Flink Kubernetes Operator documentation coverage

Reply via email to