Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

Gyula Fóra Wed, 19 Jul 2023 12:48:26 -0700

Hey Tony,

Please see:
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades


The operator is made especially to handle stateful application upgrades
robustly. In general any spec change that you make that will lead to an
upgrade will be executed using the latest available / checkpoint or
savepoint. This is controlled by the `upgradeMode` setting for jobs, as
long as you have last-state or savepoint you will always get the latest
state.

This is somewhat orthogonal to the savepoint trigger / initialSavepointPath
mechanisms. The initialSavepointPath should be used only the first time the
deployment is created because at that point the operator is not aware of
the latest state. After that all upgrades always use the latest state
unless the upgradeMode is stateless in which case no state is used.
Savepoint triggering can help you keep backups for failure recovery but
they should not be executed as part of your upgrade flow because the
operator already does this for you.

Cheers,
Gyula

On Wed, Jul 19, 2023 at 8:20 PM Tony Chen <tony.ch...@robinhood.com> wrote:

> Hi Flink Community,
>
> My name is Tony Chen, and I am a software engineer at Robinhood. I have
> some questions on restarting a Flink Application from a savepoint or
> checkpoint.
>
> We currently store our checkpoints and savepoints in S3, and we would like
> to use the Apache Flink Kubernetes Operator to manage our Flink
> applications. I know that there is a field called "initialSavepointPath" (
> doc
> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#manual-recovery>)
> that I can set in my kubernetes manifest so that whenever I want my Flink
> application to start from a particular savepoint, it will start from
> the savepoint directory in this field. However, if I delete this
> FlinkDeployment resource altogether after new savepoints were triggered,
> and then redeploy this FlinkDeployment resource, it looks like I have to
> manually update the "initialSavepointPath" to a newer savepoint directory
> so that the Flink application starts from a newer savepoint.
>
> Is there a way for us to redeploy FlinkDeployment resources so that the
> latest checkpoint or savepoint is used, and without having to update the
> "initialSavepointPath" field? I noticed in my testing that whenever I
> deleted the FlinkDeployment resource and redeploy, it would either start
> from the savepoint in initialSavepointPath or from checkpoint 1 if
> initialSavepointPath was not set.
>
> For example, let's say I restarted a Flink application at savepoint 10
> with initialSavepointPath set to s3://savepoints/savepoint-10, and then
> later on a savepoint 20 was completed and stored at
> s3://savepoints/savepoint-20. Is there a way for me to delete this
> FlinkDeployment and redeploy it without updating initialSavepointPath?
>
> Thanks,
> Tony
>
> P.S. I'm going through the source code more for Apache Flink Kubernetes
> Operator to understand how the operator starts a Flink job. Some relevant
> code:
>
>    -
>    
> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L500
>    -
>    
> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SavepointObserver.java#L204
>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>

Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

Reply via email to