Hi all, I'd like to report a bug in the Flink Kubernetes Operator's cluster upgrade path along with a related enhancement idea.
If during a FlinkDeployment upgrade the old cluster deletion times out, the operator proceeds to deploy the new cluster while the old one is still running. This happens because AbstractFlinkService.deleteBlocking() silently swallows KubernetesClientTimeoutException instead of propagating it to the reconciler. Full details and a draft fix are in: - Jira: https://issues.apache.org/jira/browse/FLINK-39953 - PR: https://github.com/apache/flink-kubernetes-operator/pull/1138 I'd also like to get the community's opinion on a follow-up enhancement: a force-delete option for when deletion repeatedly times out. The idea is an opt-in config key kubernetes.operator.cluster.force-delete-on-timeout that, as a last resort, removes blocking finalizers and/or issues a DELETE with `gracePeriodSeconds=0` instead of exhausting all JOSDK retries with no progress. The evident trade-off is that force-deletion can leave dangling resources (orphaned PVs, leaked network attachments), which is why an opt-in approach seems appropriate. Does the community think this is worth pursuing, or are there other concerns? Thanks, Lucas
