Hi all,

I'd like to report a bug in the Flink Kubernetes Operator's cluster upgrade
path along with a related enhancement idea.

If during a FlinkDeployment upgrade the old cluster deletion times out, the
operator proceeds to deploy the new cluster while the old one is still
running. This happens because AbstractFlinkService.deleteBlocking()
silently swallows KubernetesClientTimeoutException instead of propagating
it to the reconciler. Full details and a draft fix are in:
- Jira: https://issues.apache.org/jira/browse/FLINK-39953
- PR: https://github.com/apache/flink-kubernetes-operator/pull/1138

I'd also like to get the community's opinion on a follow-up enhancement: a
force-delete option for when deletion repeatedly times out. The idea is an
opt-in config key kubernetes.operator.cluster.force-delete-on-timeout that,
as a last resort, removes blocking finalizers and/or issues a DELETE with
`gracePeriodSeconds=0` instead of exhausting all JOSDK retries with no
progress.

The evident trade-off is that force-deletion can leave dangling resources
(orphaned PVs, leaked network attachments), which is why an opt-in approach
seems appropriate. Does the community think this is worth pursuing, or are
there other concerns?

Thanks,
Lucas

Reply via email to