[jira] [Assigned] (FLINK-39571) Improve Flink Kubernetes Operator documentation coverage

Gyula Fora (Jira) Wed, 29 Apr 2026 05:30:07 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gyula Fora reassigned FLINK-39571:
----------------------------------

    Assignee: Dennis-Mircea Ciupitu

> Improve Flink Kubernetes Operator documentation coverage
> --------------------------------------------------------
>
>                 Key: FLINK-39571
>                 URL: https://issues.apache.org/jira/browse/FLINK-39571
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Assignee: Dennis-Mircea Ciupitu
>            Priority: Major
>
> h1. Summary
> The Flink Kubernetes Operator is a mature project used in production by a 
> large number of organizations, but its documentation has not kept pace with 
> the codebase. Today the docs cover installation ({{helm.md}}), CRD schema 
> ({{reference.md}}, {{overview.md}}), and metrics/logging well, but stay thin 
> on the operational concerns that determine whether an operator can be safely 
> run in production: high availability, security posture, the Kubernetes 
> ConfigMaps the operator creates and manages, the event taxonomy, day-2 
> troubleshooting, and the full surface of the autoscaler. As a result, users 
> frequently have to read the source code or rely on community channels for 
> answers that should live in the official docs.
> The goal of this umbrella is to bring the operator documentation closer to 
> the depth and structure of the Flink main documentation itself, where each 
> operational concern (state backends, HA, security, deployment, monitoring, 
> debugging) has a dedicated, narrative-driven section rather than being 
> scattered across pages.
> h1. New Sections
> - *State*, the Kubernetes ConfigMaps the operator creates and manages per 
> Flink resource: HA ConfigMaps (leader info and last-completed-checkpoint 
> pointers), the autoscaler state ConfigMap, and auxiliary ConfigMaps 
> (flink-conf, pod-template, log4j). For each: naming, ownership, lifecycle 
> across {{last-state}} upgrades and deletes, relevant configuration knobs, and 
> recovery procedures when a ConfigMap is lost or corrupted. State backends and 
> the {{FlinkStateSnapshot}} CR are cross-linked to the upstream Flink docs 
> rather than re-documented.
> - *Events*, the event taxonomy emitted by the operator (submit, recovery, 
> scaling, snapshot, validation, etc.), deduplication semantics, and how to 
> consume.
> - *High Availability*, operator-side leader election, replica topology, 
> JobManager HA via Kubernetes ConfigMaps, recovery semantics, and 
> configuration reference.
> - *Security*, RBAC scoping, operator -> Flink REST mTLS, truststore/keystore 
> management, Kerberos auth, webhook TLS, and secrets handling for credentials.
> - *Debugging*, common failure modes, how to interpret status fields and 
> reconciler logs, runbooks for stuck reconciliations, and diagnostic 
> configuration toggles.
> - *Production Readiness Checklist*, single-page checklist consolidating HA, 
> security, resource sizing, observability, upgrade strategy, and disaster 
> recovery, modeled on similar pages in other mature Kubernetes operators.
> h1. Updated Sections
> - *Autoscaler*, expand the existing 358-line page to cover more internal 
> semantics, scaling cooldowns, exclusion semantics, scaling history 
> persistence, and advance additional features that can be enabled by the end 
> users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-39571) Improve Flink Kubernetes Operator documentation coverage

Reply via email to