[
https://issues.apache.org/jira/browse/FLINK-39618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis-Mircea Ciupitu resolved FLINK-39618.
-------------------------------------------
Resolution: Fixed
> FlinkDeployment deletion deadlocks when FlinkSessionJobs are running with
> default block-on-* options
> ----------------------------------------------------------------------------------------------------
>
> Key: FLINK-39618
> URL: https://issues.apache.org/jira/browse/FLINK-39618
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Dennis-Mircea Ciupitu
> Priority: Major
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.15.0
>
>
> h1. Summary
> {{FlinkDeployment}} deletion deadlocks when {{FlinkSessionJob}}s are running
> and both {{block-on-session-jobs}} and {{block-on-unmanaged-jobs}} are
> enabled (the defaults).
> h1. Symptom
> After {{kubectl delete flinkdeployment <name>}} against a session-mode
> deployment that has {{FlinkSessionJob}} resources attached, the deployment is
> stuck indefinitely in {{LIFECYCLE STATE: DELETING}} and the user must
> manually cancel the underlying Flink jobs (e.g. via the JM REST API) or flip
> a config flag and restart the operator to recover.
> h1. Reproduction
> With the operator's default configuration:
> # Apply a session {{FlinkDeployment}} and one or more {{FlinkSessionJob}}
> resources targeting it.
> # Wait for the jobs to reach {{RUNNING}}.
> # Run {{kubectl delete flinkdeployment <name>}} (without first deleting the
> {{FlinkSessionJob}} CRs).
> # Run {{kubectl delete flinksessionjob <names...>}} to satisfy the operator's
> complaint about "session jobs should be deleted first".
> # Observe: the {{FlinkSessionJob}} CRs are gone from the API server, but the
> {{FlinkDeployment}} stays in {{DELETING}} forever, and the operator log keeps
> emitting:
> {noformat}
> Event[Job] | Warning | CLEANUPFAILED | The session cluster has non terminated
> jobs [<jobIds>] that should be cancelled first
> {noformat}
> h1. Root cause
> {{SessionJobReconciler.cleanupInternal}} (introduced in FLINK-39271, 1.15)
> takes a "skip cancellation when the cluster is being deleted" bypass on the
> assumption that the cluster will tear itself down and the jobs will die with
> it:
> {code:java}
> if (sessionLifecycleState == ResourceLifecycleState.DELETING
> || sessionLifecycleState == ResourceLifecycleState.DELETED) {
> LOG.info("Session cluster is being deleted, skipping job cancellation");
> return DeleteControl.defaultDelete();
> }
> {code}
> That assumption is invalidated by
> {{kubernetes.operator.session.deletion.block-on-unmanaged-jobs}} (introduced
> earlier in FLINK-28648, 1.13), which makes
> {{SessionReconciler.cleanupInternal}} poll the JobManager REST API and refuse
> to remove the {{FlinkDeployment}} finalizer while any non-terminal Flink jobs
> exist on the cluster. Because {{SessionJobReconciler}} skipped the cancel,
> those jobs are still running, so the finalizer is held forever. The
> session-job CRs are already gone, so there is no controller left that will
> ever issue the cancel. Both options default to {{true}}, so the deadlock
> occurs out of the box.
> h1. Workarounds (until fixed)
> * Cancel the jobs directly on the JM:
> {code:bash}
> kubectl port-forward svc/<rest-svc> 8081:8081
> curl -X PATCH "http://localhost:8081/jobs/<jobId>?mode=cancel"
> {code}
> The next reconcile will then release the finalizer.
> * Or set {{kubernetes.operator.session.deletion.block-on-unmanaged-jobs:
> "false"}} in the operator ConfigMap and restart the operator pod (less
> surgical, weakens an unrelated safety guard).
> h1. Suggested fix
> Gate the bypass on {{BLOCK_ON_SESSION_JOBS}} so that when the user has opted
> into strong delete-ordering guarantees the Flink job is still cancelled
> explicitly. This restores 1.14 behaviour for the deadlocking case while
> preserving the optimization for users who have opted out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)