[
https://issues.apache.org/jira/browse/FLINK-37766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kartikey Pant updated FLINK-37766:
----------------------------------
Environment: apache/flink-kubernetes-operator:1.10.0, apache/flink:1.20.1,
minikube version: v1.35.0 (was: Flink Kubernetes Operator Image:
apache/flink-kubernetes-operator:1.10.0
Flink Image: apache/flink:1.20.1
Kubernetes: minikube version: v1.35.0)
> FlinkSessionJob deletion blocked by finalizer when Flink job already
> terminal/missing due to HA desync
> ------------------------------------------------------------------------------------------------------
>
> Key: FLINK-37766
> URL: https://issues.apache.org/jira/browse/FLINK-37766
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.20.1
> Environment: apache/flink-kubernetes-operator:1.10.0,
> apache/flink:1.20.1, minikube version: v1.35.0
> Reporter: Kartikey Pant
> Priority: Major
>
> We've encountered an issue where {{FlinkSessionJob}} custom resources become
> stuck in a {{Terminating}} state when deleted via {{{}kubectl delete{}}}.
> This occurs after a desynchronization between the Flink Kubernetes Operator
> and the Flink JobManager, typically initiated by a JobManager restart where
> its High Availability (HA) mechanism fails to recover the state of the
> pre-existing job.
> The sequence of events leading to the problem is as follows:
> # A Flink JobManager pod for an active session cluster restarts.
> # Upon restart, the JobManager's HA recovery fails to load the state of
> previously running jobs. JobManager logs indicate this with messages like:
> {{{}Retrieved job ids [] from KubernetesStateHandleStore...{}}}.
> # This creates a desynchronization:
> ** The Flink Operator (via the {{FlinkSessionJob}} CR status) still holds
> information about the original Flink JobID and its last known
> state/savepoint. It attempts to reconcile this job.
> ** The newly started Flink JobManager has no internal record of this
> specific job instance from its HA recovery.
> # The {{FlinkSessionJob}} CR status often remains {{RECONCILING}} as the
> Operator tries to manage a job the current JobManager doesn't recognize from
> its HA state.
> # When {{kubectl delete FlinkSessionJob <job-name>}} is issued, the
> Operator's finalizer ({{{}flinksessionjobs.flink.apache.org/finalizer{}}})
> logic is triggered.
> # The Operator attempts to cancel the Flink job via the JobManager's REST
> API using the JobID from the CR status.
> # The Flink JobManager, which either doesn't know the job or has internally
> marked it as {{FAILED}} due to the ongoing reconciliation attempts for a
> desynchronized job, responds with an error to the cancellation request.
> JobManager logs show: {{Job cancellation failed because the job has already
> reached another terminal state (FAILED).}}
> # The Flink Kubernetes Operator's REST client logic or the finalizer's error
> handling does not gracefully process this specific "already FAILED" (or
> potentially "not found") response. An exception occurs within the Operator
> (visible in Operator logs, often involving {{RestClient.parseResponse}} or
> {{{}CompletableFuture.completeExceptionally{}}}).
> # Due to this unhandled exception in the finalizer logic, the Operator fails
> to remove its finalizer from the {{FlinkSessionJob}} CR.
> # Consequently, the {{FlinkSessionJob}} CR remains stuck in the
> {{Terminating}} state indefinitely.
> The only workaround is to manually edit the {{FlinkSessionJob}} CR and remove
> the finalizer, allowing Kubernetes to complete the deletion.
>
> *Steps to Reproduce:*
> # Deploy a Flink Session Cluster with HA enabled (e.g., Kubernetes HA).
> # Submit a {{FlinkSessionJob}} to the cluster.
> # Induce a JobManager restart in such a way that its HA metadata for the
> running job is lost or not recoverable (e.g., by temporarily clearing the HA
> storage like ConfigMaps before the JobManager fully recovers, or simulating a
> crash where HA data isn't written).
> # The new JobManager should start without recovering the previous job.
> # The {{FlinkSessionJob}} CR may show {{RECONCILING}} as the Operator tries
> to manage the desynchronized job.
> # Attempt to delete the {{FlinkSessionJob}} CR using {{{}kubectl delete{}}}.
> # Observe the Operator logs for exceptions during finalization and the
> {{FlinkSessionJob}} CR getting stuck in the {{Terminating}} state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)