Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

Tony Chen Thu, 05 Oct 2023 08:28:19 -0700

I tried this out with operator version 1.4 and it didn't work for me. I
noticed that when I was deploying a bad version, the Kubernetes HA metadata
and configmaps were deleted:


[m [33m2023-10-05 14:52:17,493 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
][flink-testing-service/flink-testing-service] >>> Event | Info |
SPECCHANGED | UPGRADE change(s) detected
(FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo,job.initialSavepointPath=s3a://robinhood-prod-flink/flink-testing-service/savepoints/savepoint-b832ef-05b185cb5800]
differs from
FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob,job.initialSavepointPath=<null>]),
starting reconciliation.
...
[m [33m2023-10-05 14:52:51,054 [m [36mo.a.f.k.o.s.AbstractFlinkService [m
[32m[INFO ][flink-testing-service/flink-testing-service] Cluster shutdown
completed.
[m [33m2023-10-05 14:52:51,054 [m [36mo.a.f.k.o.s.AbstractFlinkService [m
[32m[INFO ][flink-testing-service/flink-testing-service] Deleting
Kubernetes HA metadata
[m [33m2023-10-05 14:52:51,196 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
][flink-testing-service/flink-testing-service] >>> Status | Info |
UPGRADING | The resource is being upgraded



Eventually, the rollbak fails because the HA metadata is missing:

[m [33m2023-10-05 14:58:16,119 [m
[36mo.a.f.k.o.r.d.AbstractFlinkResourceReconciler [m [33m[WARN
][flink-testing-service/flink-testing-service] Rollback is not possible due
to missing HA metadata



Besides setting kubernetes.operator.deployment.rollback.enabled: true, is
there anything else that I need to configure?

On Thu, Oct 5, 2023 at 10:35 AM Tony Chen <tony.ch...@robinhood.com> wrote:

> I just saw this experimental feature in the documentation:
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#application-upgrade-rollbacks-experimental
>
> I'm guessing this is the only way to automate rollbacks for now.
>
> On Wed, Oct 4, 2023 at 3:25 PM Tony Chen <tony.ch...@robinhood.com> wrote:
>
>> Hi Flink Community,
>>
>> I am currently running Apache flink-kubernetes-operator on our kubernetes
>> clusters, and I have Flink applications that are deployed using the
>> FlinkDeployment Custom Resources (CR). I am trying to automate the process
>> of rollbacks and I am running into some issues.
>>
>> I was testing out a bad deployment where the jobmanager never becomes
>> healthy. I simulated this bad deployment by creating a Flink image with a
>> bug in it. I see in the operator logs that the jobmanager is unhealthy:
>>
>> [m [33m2023-10-02 22:14:34,874 [m
>> [36mo.a.f.k.o.r.d.AbstractFlinkResourceReconciler [m [32m[INFO
>> ][flink-testing-service/flink-testing-service] UPGRADE change(s) detected
>> (FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo]
>> differs from
>> FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob]),
>> starting reconciliation.
>> ...
>> [m [33m2023-10-02 22:15:09,001 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
>> ][flink-testing-service/flink-testing-service] >>> Status | Info |
>> UPGRADING | The resource is being upgraded
>> ...
>> [m [33m2023-10-02 22:17:23,911 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
>> ][flink-testing-service/flink-testing-service] >>> Status | Error |
>> DEPLOYED |
>> {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"back-off
>> 20s restarting failed container=flink-main-container
>> pod=flink-testing-service-749dd97c75-4w9ps_flink-testing-service(6db1adb3-4ca4-4924-a8c3-57a417818d85)","additionalMetadata":{"reason":"CrashLoopBackOff"},"throwableList":[]}
>>
>> ...
>> [m [33m2023-10-02 22:17:33,576 [m [36mo.a.f.k.o.o.d.ApplicationObserver
>> [m [32m[INFO ][flink-testing-service/flink-testing-service] Observing
>> JobManager deployment. Previous status: ERROR
>>
>>
>> What I do next is I change the spec of the FlinkDeployment so that it
>> uses a Flink image that is healthy. The operator shows that the spec has
>> changed:
>>
>> [m [33m2023-10-02 22:45:37,445 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
>> ][flink-testing-service/flink-testing-service] >>> Event | Info |
>> SPECCHANGED | UPGRADE change(s) detected
>> (FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-329a14-2f8264206b1d]
>> differs from
>> FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-dc1077-134923759e30]),
>> starting reconciliation.
>>
>>
>> However, the Flink operator cannot reconcile this spec change, and the
>> jobmanager is now permanently failing because it's still running the bad
>> Flink image:
>>
>> [m [33m2023-10-02 22:45:37,461 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
>> ][flink-testing-service/flink-testing-service] >>> Event | Warning |
>> UPGRADEFAILED | JobManager deployment is missing and HA data is not
>> available to make stateful upgrades. It is possible that the job has
>> finished or terminally failed, or the configmaps have been deleted. Manual
>> restore required.
>>
>> I can simply delete this FlinkDeployment and redeploy with the healthy
>> Flink image, but I would like to avoid manual restores if possible. Is it
>> possible to recover by just changing the FlinkDeployment spec?
>>
>> Thanks,
>> Tony
>>
>> --
>>
>> <http://www.robinhood.com/>
>>
>> Tony Chen
>>
>> Software Engineer
>>
>> Menlo Park, CA
>>
>> Don't copy, share, or use this email without permission. If you received
>> it by accident, please let us know and then delete it right away.
>>
>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>


-- 

<http://www.robinhood.com/>

Tony Chen

Software Engineer

Menlo Park, CA

Don't copy, share, or use this email without permission. If you received it
by accident, please let us know and then delete it right away.

Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

Reply via email to