palkakrzysiek opened a new pull request, #1106: URL: https://github.com/apache/flink-kubernetes-operator/pull/1106
## What is the purpose of the change This pull request fixes a deterministic, unrecoverable loop on session clusters whose JobManager Deployment goes missing. `SessionReconciler.recoverSession()` calls `submitSessionCluster()` with `ctx.getObserveConfig()` but does not first invoke `setOwnerReference()`. Because the `kubernetes.jobmanager.owner.reference` Flink config option is set transiently per-deploy and is never persisted to `spec.flinkConfiguration`, the observe config never carries it. As a result, the JM Deployment recreated by recovery has no `ownerReferences`. JOSDK's default secondary-to-primary mapper for the JM Deployment uses `ownerReferences` to link a Deployment back to its `FlinkDeployment` primary (registered in `EventSourceUtils#getDeploymentInformerEventSource` via `InformerEventSourceConfiguration.from(Deployment.class, FlinkDeployment.class)` without a custom mapper). Without ownerReferences, `getSecondaryResource(Deployment.class)` returns empty on the next observe, the operator sets `jobManagerDeploymentStatus = MISSING`, recovery runs again and fails with `409 AlreadyExists`, the catch path in `KubernetesClusterDescriptor` calls `stopAndCleanupCluster()` which deletes the orphan Deployment, and the cycle repeats indefinitely. This is a regression of the same fix applied for the `deploy()` path in [FLINK-28979](https://issues.apache.org/jira/browse/FLINK-28979) (commit 62eb68c8) — the `recoverSession()` path was missed when ownerReferences were originally introduced. The corresponding test `SessionReconcilerTest#testSetOwnerReference` only exercises `deploy()`, so the gap was never caught. `ApplicationReconciler` is unaffected because its recovery path routes through `resubmitJob → restoreJob → deploy()`, and `ApplicationReconciler#deploy()` already calls `setOwnerReference()`. ## Brief change log - `SessionReconciler#recoverSession` now calls `setOwnerReference()` on the observe config before submitting the session cluster, mirroring the `deploy()` path. - Added regression test `SessionReconcilerTest#testRecoverSessionSetsOwnerReference` that exercises `reconcileOtherChanges → recoverSession` and asserts the configuration passed to `submitSessionCluster()` carries the expected `JOB_MANAGER_OWNER_REFERENCE` entry. ## Verifying this change This change added a regression test and can be verified as follows: - `SessionReconcilerTest#testRecoverSessionSetsOwnerReference` brings a session cluster to READY, marks the JM `MISSING`, triggers reconcile, captures the configuration handed to `submitSessionCluster()`, and asserts the expected ownerReferences are present. - Confirmed the test fails on master (no ownerReferences set) and passes with the fix applied. - Full operator module suite (`mvn -pl flink-kubernetes-operator test`) passes locally — 2134 tests, 0 failures, 0 errors, 0 skipped. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: yes — but only the session-cluster recovery branch (`SessionReconciler#recoverSession`); behaviour aligns with the existing `deploy()` path. ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
