[jira] [Created] (FLINK-32774) Reconciliation for autoscaling overrides gets stuck after cancel-with-savepoint

Maximilian Michels (Jira) Mon, 07 Aug 2023 10:36:06 -0700

Maximilian Michels created FLINK-32774:
------------------------------------------

Summary: Reconciliation for autoscaling overrides gets stuck after
cancel-with-savepoint
Key: FLINK-32774
URL: https://issues.apache.org/jira/browse/FLINK-32774
Project: Flink
Issue Type: Bug
Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels

Since https://issues.apache.org/jira/browse/FLINK-32589 the operator does not
rely on the Flink configuration anymore to store the parallelism overrides.
Instead, it stores them internally in the autoscaler config map. Upon scalings
without the rescaling API, the spec is changed on the fly during reconciliation
and the parallelism overrides are added.

Unfortunately, this yields to the cluster getting stuck with the job in
FINISHED state after taking a savepoint for upgrade. The operator assumes that
the new cluster got deployed successfully and goes into DEPLOYED state again.

Log flow (from oldest to newest):
# Rescheduling new reconciliation immediately to execute scaling operation.
# Upgrading/Restarting running job, suspending first...
# Job is in running state, ready for upgrade with SAVEPOINT
# Suspending existing deployment.
# Suspending job with savepoint.
# Job successfully suspended with savepoint
# The resource is being upgraded
# Pending upgrade is already deployed, updating status.
# Observing JobManager deployment. Previous status: DEPLOYING
# JobManager deployment port is ready, waiting for the Flink REST API...
# DEPLOYED The resource is deployed/submitted to Kubernetes, but it’s not yet
considered to be stable and might be rolled back in the future

It appears the issue might be in (8):
[https://github.com/apache/flink-kubernetes-operator/blob/c09671c5c51277c266b8c45d493317d3be1324c0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L260]
because the generation id hasn't been changed by the mere parallelism override
change.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-32774) Reconciliation for autoscaling overrides gets stuck after cancel-with-savepoint

Reply via email to