Dennis-Mircea Ciupitu created FLINK-39571:
---------------------------------------------
Summary: Improve Flink Kubernetes Operator documentation coverage
Key: FLINK-39571
URL: https://issues.apache.org/jira/browse/FLINK-39571
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.14.0
Reporter: Dennis-Mircea Ciupitu
h1. Summary
The Flink Kubernetes Operator is a mature project used in production by a large
number of organizations, but its documentation has not kept pace with the
codebase. Today the docs cover installation ({{helm.md}}), CRD schema
({{reference.md}}, {{overview.md}}), and metrics/logging well, but stay thin on
the operational concerns that determine whether an operator can be safely run
in production: high availability, security posture, the Kubernetes ConfigMaps
the operator creates and manages, the event taxonomy, day-2 troubleshooting,
and the full surface of the autoscaler. As a result, users frequently have to
read the source code or rely on community channels for answers that should live
in the official docs.
The goal of this umbrella is to bring the operator documentation closer to the
depth and structure of the Flink main documentation itself, where each
operational concern (state backends, HA, security, deployment, monitoring,
debugging) has a dedicated, narrative-driven section rather than being
scattered across pages.
h1. New Sections
- *State*, the Kubernetes ConfigMaps the operator creates and manages per Flink
resource: HA ConfigMaps (leader info and last-completed-checkpoint pointers),
the autoscaler state ConfigMap, and auxiliary ConfigMaps (flink-conf,
pod-template, log4j). For each: naming, ownership, lifecycle across
{{last-state}} upgrades and deletes, relevant configuration knobs, and recovery
procedures when a ConfigMap is lost or corrupted. State backends and the
{{FlinkStateSnapshot}} CR are cross-linked to the upstream Flink docs rather
than re-documented.
- *Events*, the event taxonomy emitted by the operator (submit, recovery,
scaling, snapshot, validation, etc.), deduplication semantics, and how to
consume.
- *High Availability*, operator-side leader election, replica topology,
JobManager HA via Kubernetes ConfigMaps, recovery semantics, and configuration
reference.
- *Security*, RBAC scoping, operator -> Flink REST mTLS, truststore/keystore
management, Kerberos auth, webhook TLS, and secrets handling for credentials.
- *Debugging*, common failure modes, how to interpret status fields and
reconciler logs, runbooks for stuck reconciliations, and diagnostic
configuration toggles.
- *Production Readiness Checklist*, single-page checklist consolidating HA,
security, resource sizing, observability, upgrade strategy, and disaster
recovery, modeled on similar pages in other mature Kubernetes operators.
h1. Updated Sections
- *Autoscaler*, expand the existing 358-line page to cover more internal
semantics, scaling cooldowns, exclusion semantics, scaling history persistence,
and advance additional features that can be enabled by the end users.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)