[jira] [Resolved] (FLINK-39618) FlinkDeployment deletion deadlocks when FlinkSessionJobs are running with default block-on-* options

Dennis-Mircea Ciupitu (Jira) Tue, 19 May 2026 04:17:19 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dennis-Mircea Ciupitu resolved FLINK-39618.
-------------------------------------------
    Resolution: Fixed

> FlinkDeployment deletion deadlocks when FlinkSessionJobs are running with 
> default block-on-* options
> ----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39618
>                 URL: https://issues.apache.org/jira/browse/FLINK-39618
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.15.0
>
>
> h1. Summary
> {{FlinkDeployment}} deletion deadlocks when {{FlinkSessionJob}}s are running 
> and both {{block-on-session-jobs}} and {{block-on-unmanaged-jobs}} are 
> enabled (the defaults).
> h1. Symptom
> After {{kubectl delete flinkdeployment <name>}} against a session-mode 
> deployment that has {{FlinkSessionJob}} resources attached, the deployment is 
> stuck indefinitely in {{LIFECYCLE STATE: DELETING}} and the user must 
> manually cancel the underlying Flink jobs (e.g. via the JM REST API) or flip 
> a config flag and restart the operator to recover.
> h1. Reproduction
> With the operator's default configuration:
> # Apply a session {{FlinkDeployment}} and one or more {{FlinkSessionJob}} 
> resources targeting it.
> # Wait for the jobs to reach {{RUNNING}}.
> # Run {{kubectl delete flinkdeployment <name>}} (without first deleting the 
> {{FlinkSessionJob}} CRs).
> # Run {{kubectl delete flinksessionjob <names...>}} to satisfy the operator's 
> complaint about "session jobs should be deleted first".
> # Observe: the {{FlinkSessionJob}} CRs are gone from the API server, but the 
> {{FlinkDeployment}} stays in {{DELETING}} forever, and the operator log keeps 
> emitting:
> {noformat}
> Event[Job] | Warning | CLEANUPFAILED | The session cluster has non terminated 
> jobs [<jobIds>] that should be cancelled first
> {noformat}
> h1. Root cause
> {{SessionJobReconciler.cleanupInternal}} (introduced in FLINK-39271, 1.15) 
> takes a "skip cancellation when the cluster is being deleted" bypass on the 
> assumption that the cluster will tear itself down and the jobs will die with 
> it:
> {code:java}
> if (sessionLifecycleState == ResourceLifecycleState.DELETING
>         || sessionLifecycleState == ResourceLifecycleState.DELETED) {
>     LOG.info("Session cluster is being deleted, skipping job cancellation");
>     return DeleteControl.defaultDelete();
> }
> {code}
> That assumption is invalidated by 
> {{kubernetes.operator.session.deletion.block-on-unmanaged-jobs}} (introduced 
> earlier in FLINK-28648, 1.13), which makes 
> {{SessionReconciler.cleanupInternal}} poll the JobManager REST API and refuse 
> to remove the {{FlinkDeployment}} finalizer while any non-terminal Flink jobs 
> exist on the cluster. Because {{SessionJobReconciler}} skipped the cancel, 
> those jobs are still running, so the finalizer is held forever. The 
> session-job CRs are already gone, so there is no controller left that will 
> ever issue the cancel. Both options default to {{true}}, so the deadlock 
> occurs out of the box.
> h1. Workarounds (until fixed)
> * Cancel the jobs directly on the JM:
> {code:bash}
> kubectl port-forward svc/<rest-svc> 8081:8081
> curl -X PATCH "http://localhost:8081/jobs/<jobId>?mode=cancel"
> {code}
> The next reconcile will then release the finalizer.
> * Or set {{kubernetes.operator.session.deletion.block-on-unmanaged-jobs: 
> "false"}} in the operator ConfigMap and restart the operator pod (less 
> surgical, weakens an unrelated safety guard).
> h1. Suggested fix
> Gate the bypass on {{BLOCK_ON_SESSION_JOBS}} so that when the user has opted 
> into strong delete-ordering guarantees the Flink job is still cancelled 
> explicitly. This restores 1.14 behaviour for the deadlocking case while 
> preserving the optimization for users who have opted out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-39618) FlinkDeployment deletion deadlocks when FlinkSessionJobs are running with default block-on-* options

Reply via email to