[
https://issues.apache.org/jira/browse/FLINK-39571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora reassigned FLINK-39571:
----------------------------------
Assignee: Dennis-Mircea Ciupitu
> Improve Flink Kubernetes Operator documentation coverage
> --------------------------------------------------------
>
> Key: FLINK-39571
> URL: https://issues.apache.org/jira/browse/FLINK-39571
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Dennis-Mircea Ciupitu
> Assignee: Dennis-Mircea Ciupitu
> Priority: Major
>
> h1. Summary
> The Flink Kubernetes Operator is a mature project used in production by a
> large number of organizations, but its documentation has not kept pace with
> the codebase. Today the docs cover installation ({{helm.md}}), CRD schema
> ({{reference.md}}, {{overview.md}}), and metrics/logging well, but stay thin
> on the operational concerns that determine whether an operator can be safely
> run in production: high availability, security posture, the Kubernetes
> ConfigMaps the operator creates and manages, the event taxonomy, day-2
> troubleshooting, and the full surface of the autoscaler. As a result, users
> frequently have to read the source code or rely on community channels for
> answers that should live in the official docs.
> The goal of this umbrella is to bring the operator documentation closer to
> the depth and structure of the Flink main documentation itself, where each
> operational concern (state backends, HA, security, deployment, monitoring,
> debugging) has a dedicated, narrative-driven section rather than being
> scattered across pages.
> h1. New Sections
> - *State*, the Kubernetes ConfigMaps the operator creates and manages per
> Flink resource: HA ConfigMaps (leader info and last-completed-checkpoint
> pointers), the autoscaler state ConfigMap, and auxiliary ConfigMaps
> (flink-conf, pod-template, log4j). For each: naming, ownership, lifecycle
> across {{last-state}} upgrades and deletes, relevant configuration knobs, and
> recovery procedures when a ConfigMap is lost or corrupted. State backends and
> the {{FlinkStateSnapshot}} CR are cross-linked to the upstream Flink docs
> rather than re-documented.
> - *Events*, the event taxonomy emitted by the operator (submit, recovery,
> scaling, snapshot, validation, etc.), deduplication semantics, and how to
> consume.
> - *High Availability*, operator-side leader election, replica topology,
> JobManager HA via Kubernetes ConfigMaps, recovery semantics, and
> configuration reference.
> - *Security*, RBAC scoping, operator -> Flink REST mTLS, truststore/keystore
> management, Kerberos auth, webhook TLS, and secrets handling for credentials.
> - *Debugging*, common failure modes, how to interpret status fields and
> reconciler logs, runbooks for stuck reconciliations, and diagnostic
> configuration toggles.
> - *Production Readiness Checklist*, single-page checklist consolidating HA,
> security, resource sizing, observability, upgrade strategy, and disaster
> recovery, modeled on similar pages in other mature Kubernetes operators.
> h1. Updated Sections
> - *Autoscaler*, expand the existing 358-line page to cover more internal
> semantics, scaling cooldowns, exclusion semantics, scaling history
> persistence, and advance additional features that can be enabled by the end
> users.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)