palkakrzysiek opened a new pull request, #1106:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1106

   ## What is the purpose of the change
   
   This pull request fixes a deterministic, unrecoverable loop on session 
clusters whose JobManager Deployment goes missing.
   
   `SessionReconciler.recoverSession()` calls `submitSessionCluster()` with 
`ctx.getObserveConfig()` but does not first invoke `setOwnerReference()`. 
Because the `kubernetes.jobmanager.owner.reference` Flink config option is set 
transiently per-deploy and is never persisted to `spec.flinkConfiguration`, the 
observe config never carries it. As a result, the JM Deployment recreated by 
recovery has no `ownerReferences`.
   
   JOSDK's default secondary-to-primary mapper for the JM Deployment uses 
`ownerReferences` to link a Deployment back to its `FlinkDeployment` primary 
(registered in `EventSourceUtils#getDeploymentInformerEventSource` via 
`InformerEventSourceConfiguration.from(Deployment.class, 
FlinkDeployment.class)` without a custom mapper). Without ownerReferences, 
`getSecondaryResource(Deployment.class)` returns empty on the next observe, the 
operator sets `jobManagerDeploymentStatus = MISSING`, recovery runs again and 
fails with `409 AlreadyExists`, the catch path in `KubernetesClusterDescriptor` 
calls `stopAndCleanupCluster()` which deletes the orphan Deployment, and the 
cycle repeats indefinitely.
   
   This is a regression of the same fix applied for the `deploy()` path in 
[FLINK-28979](https://issues.apache.org/jira/browse/FLINK-28979) (commit 
62eb68c8) — the `recoverSession()` path was missed when ownerReferences were 
originally introduced. The corresponding test 
`SessionReconcilerTest#testSetOwnerReference` only exercises `deploy()`, so the 
gap was never caught.
   
   `ApplicationReconciler` is unaffected because its recovery path routes 
through `resubmitJob → restoreJob → deploy()`, and 
`ApplicationReconciler#deploy()` already calls `setOwnerReference()`.
   
   ## Brief change log
   
     - `SessionReconciler#recoverSession` now calls `setOwnerReference()` on 
the observe config before submitting the session cluster, mirroring the 
`deploy()` path.
     - Added regression test 
`SessionReconcilerTest#testRecoverSessionSetsOwnerReference` that exercises 
`reconcileOtherChanges → recoverSession` and asserts the configuration passed 
to `submitSessionCluster()` carries the expected `JOB_MANAGER_OWNER_REFERENCE` 
entry.
   
   ## Verifying this change
   
   This change added a regression test and can be verified as follows:
   
     - `SessionReconcilerTest#testRecoverSessionSetsOwnerReference` brings a 
session cluster to READY, marks the JM `MISSING`, triggers reconcile, captures 
the configuration handed to `submitSessionCluster()`, and asserts the expected 
ownerReferences are present.
     - Confirmed the test fails on master (no ownerReferences set) and passes 
with the fix applied.
     - Full operator module suite (`mvn -pl flink-kubernetes-operator test`) 
passes locally — 2134 tests, 0 failures, 0 errors, 0 skipped.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes — but 
only the session-cluster recovery branch (`SessionReconciler#recoverSession`); 
behaviour aligns with the existing `deploy()` path.
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to