Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

Gyula Fóra Wed, 19 Jul 2023 22:18:42 -0700

Hey!

Please help us understand why you need to delete and recreate the
FlinkDeployment objects in your ecosystem. Maybe we can help suggest some
alternative to make your life easier :)


Of course every prod ecosystem is unique in its own way and larger
platforms generally have a layer on top of the operator to manage these
special requirements.

In most cases it’s possible to contribute these changes to Flink as long as
they fit the scope / larger development direction of the project . This
would require a FLIP.

But before going there I think it’s worth talking about this
delete/recreate requirement because it sounds a bit strange in the
Kubernetes world . We specifically designed the operator in a way so that
you wouldn’t have to do this if you want the latest state and so far this
is the first I hear this ask :)

Cheers
Gyula

On Thu, 20 Jul 2023 at 00:07, Tony Chen <[email protected]> wrote:

> Hi Gyula,
>
> Got it. Our use case might be unique to our own ecosystem here at
> Robinhood, so I will have to look into creating a service that can search
> for the latest savepoint / checkpoint in S3 and provide that to the
> FlinkDeployment resource.
>
> Will the Flink Community be okay with us adding this feature to the GitHub
> repo eventually? I was going through this guide
> <https://flink.apache.org/how-to-contribute/contribute-code/>, and it
> looks like I need to get consensus first.
>
> Thanks,
> Tony
>
> On Wed, Jul 19, 2023 at 4:33 PM Gyula Fóra <[email protected]> wrote:
>
>> Hi!
>>
>> I don’t understand why you need to delete the deployment to restart. You
>> can suspend, use the restartNonce or simply upgrade .
>>
>> These should cover most upgrade/restart scenarios. Like with other
>> resources in Kubernetes once you delete them the status is gone, so the
>> FlinkDeployment won’t keep the last state info.
>>
>> To keep the state after deletion you would have to introduce new
>> resources or an external state store. We are not planning to support this
>> as it goes against the standard Kubernetes resource management flow.
>>
>> I think you should look into simply suspending the job for the while or
>> just use a regular upgrade to fit your needs .
>>
>> Cheers
>> Gyula
>>
>> On Wed, 19 Jul 2023 at 22:19, Tony Chen <[email protected]> wrote:
>>
>>> Hi Gyula,
>>>
>>> Thank you for responding so quickly. I went through the page you sent me
>>> a bit more, and I see the following (
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.4/docs/custom-resource/job-management/#running-suspending-and-deleting-applications
>>> ):
>>>
>>> Deleting a deployment will remove all checkpoint and status information.
>>>> Future deployments will from an empty state unless manually overridden by
>>>> the user.
>>>>
>>>
>>> For our use case, we do delete the deployment and redeploy the Flink
>>> application sometimes in order to restart our Flink applications. We were
>>> wondering if it's possible for the operator to retain checkpoint and status
>>> information even after the deployment gets deleted.
>>>
>>> Thanks,
>>> Tony
>>>
>>> On Wed, Jul 19, 2023 at 3:46 PM Gyula Fóra <[email protected]> wrote:
>>>
>>>> Hey Tony,
>>>>
>>>> Please see:
>>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>>>
>>>> The operator is made especially to handle stateful application upgrades
>>>> robustly. In general any spec change that you make that will lead to an
>>>> upgrade will be executed using the latest available / checkpoint or
>>>> savepoint. This is controlled by the `upgradeMode` setting for jobs, as
>>>> long as you have last-state or savepoint you will always get the latest
>>>> state.
>>>>
>>>> This is somewhat orthogonal to the savepoint trigger /
>>>> initialSavepointPath mechanisms. The initialSavepointPath should be used
>>>> only the first time the deployment is created because at that point the
>>>> operator is not aware of the latest state. After that all upgrades always
>>>> use the latest state unless the upgradeMode is stateless in which case no
>>>> state is used. Savepoint triggering can help you keep backups for failure
>>>> recovery but they should not be executed as part of your upgrade flow
>>>> because the operator already does this for you.
>>>>
>>>> Cheers,
>>>> Gyula
>>>>
>>>> On Wed, Jul 19, 2023 at 8:20 PM Tony Chen <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Flink Community,
>>>>>
>>>>> My name is Tony Chen, and I am a software engineer at Robinhood. I
>>>>> have some questions on restarting a Flink Application from a savepoint or
>>>>> checkpoint.
>>>>>
>>>>> We currently store our checkpoints and savepoints in S3, and we would
>>>>> like to use the Apache Flink Kubernetes Operator to manage our Flink
>>>>> applications. I know that there is a field called "initialSavepointPath" (
>>>>> doc
>>>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#manual-recovery>)
>>>>> that I can set in my kubernetes manifest so that whenever I want my Flink
>>>>> application to start from a particular savepoint, it will start from
>>>>> the savepoint directory in this field. However, if I delete this
>>>>> FlinkDeployment resource altogether after new savepoints were triggered,
>>>>> and then redeploy this FlinkDeployment resource, it looks like I have to
>>>>> manually update the "initialSavepointPath" to a newer savepoint directory
>>>>> so that the Flink application starts from a newer savepoint.
>>>>>
>>>>> Is there a way for us to redeploy FlinkDeployment resources so that
>>>>> the latest checkpoint or savepoint is used, and without having to update
>>>>> the "initialSavepointPath" field? I noticed in my testing that whenever I
>>>>> deleted the FlinkDeployment resource and redeploy, it would either start
>>>>> from the savepoint in initialSavepointPath or from checkpoint 1 if
>>>>> initialSavepointPath was not set.
>>>>>
>>>>> For example, let's say I restarted a Flink application at savepoint 10
>>>>> with initialSavepointPath set to s3://savepoints/savepoint-10, and then
>>>>> later on a savepoint 20 was completed and stored at
>>>>> s3://savepoints/savepoint-20. Is there a way for me to delete this
>>>>> FlinkDeployment and redeploy it without updating initialSavepointPath?
>>>>>
>>>>> Thanks,
>>>>> Tony
>>>>>
>>>>> P.S. I'm going through the source code more for Apache Flink
>>>>> Kubernetes Operator to understand how the operator starts a Flink job. 
>>>>> Some
>>>>> relevant code:
>>>>>
>>>>>    -
>>>>>    
>>>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L500
>>>>>    -
>>>>>    
>>>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SavepointObserver.java#L204
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> <http://www.robinhood.com/>
>>>>>
>>>>> Tony Chen
>>>>>
>>>>> Software Engineer
>>>>>
>>>>> Menlo Park, CA
>>>>>
>>>>> Don't copy, share, or use this email without permission. If you
>>>>> received it by accident, please let us know and then delete it right away.
>>>>>
>>>>
>>>
>>> --
>>>
>>> <http://www.robinhood.com/>
>>>
>>> Tony Chen
>>>
>>> Software Engineer
>>>
>>> Menlo Park, CA
>>>
>>> Don't copy, share, or use this email without permission. If you received
>>> it by accident, please let us know and then delete it right away.
>>>
>>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>

Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

Reply via email to