[ 
https://issues.apache.org/jira/browse/FLINK-37766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kartikey Pant updated FLINK-37766:
----------------------------------
    Environment: apache/flink-kubernetes-operator:1.10.0, apache/flink:1.20.1, 
minikube version: v1.35.0  (was: Flink Kubernetes Operator Image: 
apache/flink-kubernetes-operator:1.10.0

Flink Image: apache/flink:1.20.1

Kubernetes: minikube version: v1.35.0)

> FlinkSessionJob deletion blocked by finalizer when Flink job already 
> terminal/missing due to HA desync
> ------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37766
>                 URL: https://issues.apache.org/jira/browse/FLINK-37766
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.20.1
>         Environment: apache/flink-kubernetes-operator:1.10.0, 
> apache/flink:1.20.1, minikube version: v1.35.0
>            Reporter: Kartikey Pant
>            Priority: Major
>
> We've encountered an issue where {{FlinkSessionJob}} custom resources become 
> stuck in a {{Terminating}} state when deleted via {{{}kubectl delete{}}}. 
> This occurs after a desynchronization between the Flink Kubernetes Operator 
> and the Flink JobManager, typically initiated by a JobManager restart where 
> its High Availability (HA) mechanism fails to recover the state of the 
> pre-existing job.
> The sequence of events leading to the problem is as follows:
>  # A Flink JobManager pod for an active session cluster restarts.
>  # Upon restart, the JobManager's HA recovery fails to load the state of 
> previously running jobs. JobManager logs indicate this with messages like: 
> {{{}Retrieved job ids [] from KubernetesStateHandleStore...{}}}.
>  # This creates a desynchronization:
>  ** The Flink Operator (via the {{FlinkSessionJob}} CR status) still holds 
> information about the original Flink JobID and its last known 
> state/savepoint. It attempts to reconcile this job.
>  ** The newly started Flink JobManager has no internal record of this 
> specific job instance from its HA recovery.
>  # The {{FlinkSessionJob}} CR status often remains {{RECONCILING}} as the 
> Operator tries to manage a job the current JobManager doesn't recognize from 
> its HA state.
>  # When {{kubectl delete FlinkSessionJob <job-name>}} is issued, the 
> Operator's finalizer ({{{}flinksessionjobs.flink.apache.org/finalizer{}}}) 
> logic is triggered.
>  # The Operator attempts to cancel the Flink job via the JobManager's REST 
> API using the JobID from the CR status.
>  # The Flink JobManager, which either doesn't know the job or has internally 
> marked it as {{FAILED}} due to the ongoing reconciliation attempts for a 
> desynchronized job, responds with an error to the cancellation request. 
> JobManager logs show: {{Job cancellation failed because the job has already 
> reached another terminal state (FAILED).}}
>  # The Flink Kubernetes Operator's REST client logic or the finalizer's error 
> handling does not gracefully process this specific "already FAILED" (or 
> potentially "not found") response. An exception occurs within the Operator 
> (visible in Operator logs, often involving {{RestClient.parseResponse}} or 
> {{{}CompletableFuture.completeExceptionally{}}}).
>  # Due to this unhandled exception in the finalizer logic, the Operator fails 
> to remove its finalizer from the {{FlinkSessionJob}} CR.
>  # Consequently, the {{FlinkSessionJob}} CR remains stuck in the 
> {{Terminating}} state indefinitely.
> The only workaround is to manually edit the {{FlinkSessionJob}} CR and remove 
> the finalizer, allowing Kubernetes to complete the deletion.
>  
> *Steps to Reproduce:*
>  # Deploy a Flink Session Cluster with HA enabled (e.g., Kubernetes HA).
>  # Submit a {{FlinkSessionJob}} to the cluster.
>  # Induce a JobManager restart in such a way that its HA metadata for the 
> running job is lost or not recoverable (e.g., by temporarily clearing the HA 
> storage like ConfigMaps before the JobManager fully recovers, or simulating a 
> crash where HA data isn't written).
>  # The new JobManager should start without recovering the previous job.
>  # The {{FlinkSessionJob}} CR may show {{RECONCILING}} as the Operator tries 
> to manage the desynchronized job.
>  # Attempt to delete the {{FlinkSessionJob}} CR using {{{}kubectl delete{}}}.
>  # Observe the Operator logs for exceptions during finalization and the 
> {{FlinkSessionJob}} CR getting stuck in the {{Terminating}} state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to