Hey!

Please help us understand why you need to delete and recreate the
FlinkDeployment objects in your ecosystem. Maybe we can help suggest some
alternative to make your life easier :)

Of course every prod ecosystem is unique in its own way and larger
platforms generally have a layer on top of the operator to manage these
special requirements.

In most cases it’s possible to contribute these changes to Flink as long as
they fit the scope / larger development direction of the project . This
would require a FLIP.

But before going there I think it’s worth talking about this
delete/recreate requirement because it sounds a bit strange in the
Kubernetes world . We specifically designed the operator in a way so that
you wouldn’t have to do this if you want the latest state and so far this
is the first I hear this ask :)

Cheers
Gyula

On Thu, 20 Jul 2023 at 00:07, Tony Chen <tony.ch...@robinhood.com> wrote:

> Hi Gyula,
>
> Got it. Our use case might be unique to our own ecosystem here at
> Robinhood, so I will have to look into creating a service that can search
> for the latest savepoint / checkpoint in S3 and provide that to the
> FlinkDeployment resource.
>
> Will the Flink Community be okay with us adding this feature to the GitHub
> repo eventually? I was going through this guide
> <https://flink.apache.org/how-to-contribute/contribute-code/>, and it
> looks like I need to get consensus first.
>
> Thanks,
> Tony
>
> On Wed, Jul 19, 2023 at 4:33 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hi!
>>
>> I don’t understand why you need to delete the deployment to restart. You
>> can suspend, use the restartNonce or simply upgrade .
>>
>> These should cover most upgrade/restart scenarios. Like with other
>> resources in Kubernetes once you delete them the status is gone, so the
>> FlinkDeployment won’t keep the last state info.
>>
>> To keep the state after deletion you would have to introduce new
>> resources or an external state store. We are not planning to support this
>> as it goes against the standard Kubernetes resource management flow.
>>
>> I think you should look into simply suspending the job for the while or
>> just use a regular upgrade to fit your needs .
>>
>> Cheers
>> Gyula
>>
>> On Wed, 19 Jul 2023 at 22:19, Tony Chen <tony.ch...@robinhood.com> wrote:
>>
>>> Hi Gyula,
>>>
>>> Thank you for responding so quickly. I went through the page you sent me
>>> a bit more, and I see the following (
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.4/docs/custom-resource/job-management/#running-suspending-and-deleting-applications
>>> ):
>>>
>>> Deleting a deployment will remove all checkpoint and status information.
>>>> Future deployments will from an empty state unless manually overridden by
>>>> the user.
>>>>
>>>
>>> For our use case, we do delete the deployment and redeploy the Flink
>>> application sometimes in order to restart our Flink applications. We were
>>> wondering if it's possible for the operator to retain checkpoint and status
>>> information even after the deployment gets deleted.
>>>
>>> Thanks,
>>> Tony
>>>
>>> On Wed, Jul 19, 2023 at 3:46 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>>>
>>>> Hey Tony,
>>>>
>>>> Please see:
>>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>>>
>>>> The operator is made especially to handle stateful application upgrades
>>>> robustly. In general any spec change that you make that will lead to an
>>>> upgrade will be executed using the latest available / checkpoint or
>>>> savepoint. This is controlled by the `upgradeMode` setting for jobs, as
>>>> long as you have last-state or savepoint you will always get the latest
>>>> state.
>>>>
>>>> This is somewhat orthogonal to the savepoint trigger /
>>>> initialSavepointPath mechanisms. The initialSavepointPath should be used
>>>> only the first time the deployment is created because at that point the
>>>> operator is not aware of the latest state. After that all upgrades always
>>>> use the latest state unless the upgradeMode is stateless in which case no
>>>> state is used. Savepoint triggering can help you keep backups for failure
>>>> recovery but they should not be executed as part of your upgrade flow
>>>> because the operator already does this for you.
>>>>
>>>> Cheers,
>>>> Gyula
>>>>
>>>> On Wed, Jul 19, 2023 at 8:20 PM Tony Chen <tony.ch...@robinhood.com>
>>>> wrote:
>>>>
>>>>> Hi Flink Community,
>>>>>
>>>>> My name is Tony Chen, and I am a software engineer at Robinhood. I
>>>>> have some questions on restarting a Flink Application from a savepoint or
>>>>> checkpoint.
>>>>>
>>>>> We currently store our checkpoints and savepoints in S3, and we would
>>>>> like to use the Apache Flink Kubernetes Operator to manage our Flink
>>>>> applications. I know that there is a field called "initialSavepointPath" (
>>>>> doc
>>>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#manual-recovery>)
>>>>> that I can set in my kubernetes manifest so that whenever I want my Flink
>>>>> application to start from a particular savepoint, it will start from
>>>>> the savepoint directory in this field. However, if I delete this
>>>>> FlinkDeployment resource altogether after new savepoints were triggered,
>>>>> and then redeploy this FlinkDeployment resource, it looks like I have to
>>>>> manually update the "initialSavepointPath" to a newer savepoint directory
>>>>> so that the Flink application starts from a newer savepoint.
>>>>>
>>>>> Is there a way for us to redeploy FlinkDeployment resources so that
>>>>> the latest checkpoint or savepoint is used, and without having to update
>>>>> the "initialSavepointPath" field? I noticed in my testing that whenever I
>>>>> deleted the FlinkDeployment resource and redeploy, it would either start
>>>>> from the savepoint in initialSavepointPath or from checkpoint 1 if
>>>>> initialSavepointPath was not set.
>>>>>
>>>>> For example, let's say I restarted a Flink application at savepoint 10
>>>>> with initialSavepointPath set to s3://savepoints/savepoint-10, and then
>>>>> later on a savepoint 20 was completed and stored at
>>>>> s3://savepoints/savepoint-20. Is there a way for me to delete this
>>>>> FlinkDeployment and redeploy it without updating initialSavepointPath?
>>>>>
>>>>> Thanks,
>>>>> Tony
>>>>>
>>>>> P.S. I'm going through the source code more for Apache Flink
>>>>> Kubernetes Operator to understand how the operator starts a Flink job. 
>>>>> Some
>>>>> relevant code:
>>>>>
>>>>>    -
>>>>>    
>>>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L500
>>>>>    -
>>>>>    
>>>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SavepointObserver.java#L204
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> <http://www.robinhood.com/>
>>>>>
>>>>> Tony Chen
>>>>>
>>>>> Software Engineer
>>>>>
>>>>> Menlo Park, CA
>>>>>
>>>>> Don't copy, share, or use this email without permission. If you
>>>>> received it by accident, please let us know and then delete it right away.
>>>>>
>>>>
>>>
>>> --
>>>
>>> <http://www.robinhood.com/>
>>>
>>> Tony Chen
>>>
>>> Software Engineer
>>>
>>> Menlo Park, CA
>>>
>>> Don't copy, share, or use this email without permission. If you received
>>> it by accident, please let us know and then delete it right away.
>>>
>>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>

Reply via email to