[
https://issues.apache.org/jira/browse/FLINK-39970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39970:
-----------------------------------
Labels: pull-request-available (was: )
> Kubernetes Operator proceeds with cluster resubmission after Deployment
> deletion wait timeout, causing AlreadyExists / object-is-being-deleted race
> ---------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39970
> URL: https://issues.apache.org/jira/browse/FLINK-39970
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.14.6
> Reporter: Bowen Li
> Priority: Major
> Labels: pull-request-available
>
> When a Flink job reaches terminal FAILED and
> `kubernetes.operator.job.restart.failed=true`, the operator deletes the
> existing cluster and resubmits it.
> If foreground deletion of the old JobManager Deployment exceeds
> `kubernetes.operator.resource.cleanup.timeout`,
> `AbstractFlinkService.deleteBlocking()` catches the non-404
> `KubernetesClientException` from `waitUntilCondition()`, logs it, and returns
> normally.
> The caller then proceeds as if deletion completed:
> 1. `deleteClusterDeployment()` updates status to
> JobManagerDeploymentStatus.MISSING.
> 2. The failed-job restart flow calls `resubmitJob()`.
> 3. The operator attempts to create a new Deployment with the same name.
> 4. Kubernetes rejects it because the old Deployment still exists in
> Terminating state:
> `AlreadyExists: object is being deleted: deployments.apps "<cluster>" already
> exists`
> *Expected Behavior*
> A non-404 failure or timeout while waiting for Deployment deletion should
> abort the current reconciliation.
> The operator should not mark the JobManager Deployment as MISSING or attempt
> to resubmit until Kubernetes confirms the old Deployment is gone. A later
> reconciliation can retry deletion/resubmission.
> 404 should still be treated as successful deletion.
> *Actual Behavior*
> Deletion wait timeout is logged and swallowed. The same reconciliation
> continues into resubmission while the old Deployment is still being deleted,
> causing `AlreadyExists / object is being deleted`.
> *Impact*
> This causes avoidable restart delays and noisy reconciliation failures. In
> worse cases it can leave the FlinkDeployment in an error/recovery loop
> requiring manual intervention.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)