Hi Flink Community,

I am currently running Apache flink-kubernetes-operator on our kubernetes
clusters, and I have Flink applications that are deployed using the
FlinkDeployment Custom Resources (CR). I am trying to automate the process
of rollbacks and I am running into some issues.

I was testing out a bad deployment where the jobmanager never becomes
healthy. I simulated this bad deployment by creating a Flink image with a
bug in it. I see in the operator logs that the jobmanager is unhealthy:

[m [33m2023-10-02 22:14:34,874 [m
[36mo.a.f.k.o.r.d.AbstractFlinkResourceReconciler [m [32m[INFO
][flink-testing-service/flink-testing-service] UPGRADE change(s) detected
(FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo]
differs from
FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob]),
starting reconciliation.
...
[m [33m2023-10-02 22:15:09,001 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
][flink-testing-service/flink-testing-service] >>> Status | Info |
UPGRADING | The resource is being upgraded
...
[m [33m2023-10-02 22:17:23,911 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
][flink-testing-service/flink-testing-service] >>> Status | Error |
DEPLOYED |
{"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"back-off
20s restarting failed container=flink-main-container
pod=flink-testing-service-749dd97c75-4w9ps_flink-testing-service(6db1adb3-4ca4-4924-a8c3-57a417818d85)","additionalMetadata":{"reason":"CrashLoopBackOff"},"throwableList":[]}

...
[m [33m2023-10-02 22:17:33,576 [m [36mo.a.f.k.o.o.d.ApplicationObserver [m
[32m[INFO ][flink-testing-service/flink-testing-service] Observing
JobManager deployment. Previous status: ERROR


What I do next is I change the spec of the FlinkDeployment so that it uses
a Flink image that is healthy. The operator shows that the spec has changed:

[m [33m2023-10-02 22:45:37,445 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
][flink-testing-service/flink-testing-service] >>> Event | Info |
SPECCHANGED | UPGRADE change(s) detected
(FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-329a14-2f8264206b1d]
differs from
FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-dc1077-134923759e30]),
starting reconciliation.


However, the Flink operator cannot reconcile this spec change, and the
jobmanager is now permanently failing because it's still running the bad
Flink image:

[m [33m2023-10-02 22:45:37,461 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
][flink-testing-service/flink-testing-service] >>> Event | Warning |
UPGRADEFAILED | JobManager deployment is missing and HA data is not
available to make stateful upgrades. It is possible that the job has
finished or terminally failed, or the configmaps have been deleted. Manual
restore required.

I can simply delete this FlinkDeployment and redeploy with the healthy
Flink image, but I would like to avoid manual restores if possible. Is it
possible to recover by just changing the FlinkDeployment spec?

Thanks,
Tony

-- 

<http://www.robinhood.com/>

Tony Chen

Software Engineer

Menlo Park, CA

Don't copy, share, or use this email without permission. If you received it
by accident, please let us know and then delete it right away.

Reply via email to