Hey! Please help us understand why you need to delete and recreate the FlinkDeployment objects in your ecosystem. Maybe we can help suggest some alternative to make your life easier :)
Of course every prod ecosystem is unique in its own way and larger platforms generally have a layer on top of the operator to manage these special requirements. In most cases it’s possible to contribute these changes to Flink as long as they fit the scope / larger development direction of the project . This would require a FLIP. But before going there I think it’s worth talking about this delete/recreate requirement because it sounds a bit strange in the Kubernetes world . We specifically designed the operator in a way so that you wouldn’t have to do this if you want the latest state and so far this is the first I hear this ask :) Cheers Gyula On Thu, 20 Jul 2023 at 00:07, Tony Chen <tony.ch...@robinhood.com> wrote: > Hi Gyula, > > Got it. Our use case might be unique to our own ecosystem here at > Robinhood, so I will have to look into creating a service that can search > for the latest savepoint / checkpoint in S3 and provide that to the > FlinkDeployment resource. > > Will the Flink Community be okay with us adding this feature to the GitHub > repo eventually? I was going through this guide > <https://flink.apache.org/how-to-contribute/contribute-code/>, and it > looks like I need to get consensus first. > > Thanks, > Tony > > On Wed, Jul 19, 2023 at 4:33 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > >> Hi! >> >> I don’t understand why you need to delete the deployment to restart. You >> can suspend, use the restartNonce or simply upgrade . >> >> These should cover most upgrade/restart scenarios. Like with other >> resources in Kubernetes once you delete them the status is gone, so the >> FlinkDeployment won’t keep the last state info. >> >> To keep the state after deletion you would have to introduce new >> resources or an external state store. We are not planning to support this >> as it goes against the standard Kubernetes resource management flow. >> >> I think you should look into simply suspending the job for the while or >> just use a regular upgrade to fit your needs . >> >> Cheers >> Gyula >> >> On Wed, 19 Jul 2023 at 22:19, Tony Chen <tony.ch...@robinhood.com> wrote: >> >>> Hi Gyula, >>> >>> Thank you for responding so quickly. I went through the page you sent me >>> a bit more, and I see the following ( >>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.4/docs/custom-resource/job-management/#running-suspending-and-deleting-applications >>> ): >>> >>> Deleting a deployment will remove all checkpoint and status information. >>>> Future deployments will from an empty state unless manually overridden by >>>> the user. >>>> >>> >>> For our use case, we do delete the deployment and redeploy the Flink >>> application sometimes in order to restart our Flink applications. We were >>> wondering if it's possible for the operator to retain checkpoint and status >>> information even after the deployment gets deleted. >>> >>> Thanks, >>> Tony >>> >>> On Wed, Jul 19, 2023 at 3:46 PM Gyula Fóra <gyula.f...@gmail.com> wrote: >>> >>>> Hey Tony, >>>> >>>> Please see: >>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades >>>> >>>> The operator is made especially to handle stateful application upgrades >>>> robustly. In general any spec change that you make that will lead to an >>>> upgrade will be executed using the latest available / checkpoint or >>>> savepoint. This is controlled by the `upgradeMode` setting for jobs, as >>>> long as you have last-state or savepoint you will always get the latest >>>> state. >>>> >>>> This is somewhat orthogonal to the savepoint trigger / >>>> initialSavepointPath mechanisms. The initialSavepointPath should be used >>>> only the first time the deployment is created because at that point the >>>> operator is not aware of the latest state. After that all upgrades always >>>> use the latest state unless the upgradeMode is stateless in which case no >>>> state is used. Savepoint triggering can help you keep backups for failure >>>> recovery but they should not be executed as part of your upgrade flow >>>> because the operator already does this for you. >>>> >>>> Cheers, >>>> Gyula >>>> >>>> On Wed, Jul 19, 2023 at 8:20 PM Tony Chen <tony.ch...@robinhood.com> >>>> wrote: >>>> >>>>> Hi Flink Community, >>>>> >>>>> My name is Tony Chen, and I am a software engineer at Robinhood. I >>>>> have some questions on restarting a Flink Application from a savepoint or >>>>> checkpoint. >>>>> >>>>> We currently store our checkpoints and savepoints in S3, and we would >>>>> like to use the Apache Flink Kubernetes Operator to manage our Flink >>>>> applications. I know that there is a field called "initialSavepointPath" ( >>>>> doc >>>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#manual-recovery>) >>>>> that I can set in my kubernetes manifest so that whenever I want my Flink >>>>> application to start from a particular savepoint, it will start from >>>>> the savepoint directory in this field. However, if I delete this >>>>> FlinkDeployment resource altogether after new savepoints were triggered, >>>>> and then redeploy this FlinkDeployment resource, it looks like I have to >>>>> manually update the "initialSavepointPath" to a newer savepoint directory >>>>> so that the Flink application starts from a newer savepoint. >>>>> >>>>> Is there a way for us to redeploy FlinkDeployment resources so that >>>>> the latest checkpoint or savepoint is used, and without having to update >>>>> the "initialSavepointPath" field? I noticed in my testing that whenever I >>>>> deleted the FlinkDeployment resource and redeploy, it would either start >>>>> from the savepoint in initialSavepointPath or from checkpoint 1 if >>>>> initialSavepointPath was not set. >>>>> >>>>> For example, let's say I restarted a Flink application at savepoint 10 >>>>> with initialSavepointPath set to s3://savepoints/savepoint-10, and then >>>>> later on a savepoint 20 was completed and stored at >>>>> s3://savepoints/savepoint-20. Is there a way for me to delete this >>>>> FlinkDeployment and redeploy it without updating initialSavepointPath? >>>>> >>>>> Thanks, >>>>> Tony >>>>> >>>>> P.S. I'm going through the source code more for Apache Flink >>>>> Kubernetes Operator to understand how the operator starts a Flink job. >>>>> Some >>>>> relevant code: >>>>> >>>>> - >>>>> >>>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L500 >>>>> - >>>>> >>>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SavepointObserver.java#L204 >>>>> >>>>> >>>>> -- >>>>> >>>>> <http://www.robinhood.com/> >>>>> >>>>> Tony Chen >>>>> >>>>> Software Engineer >>>>> >>>>> Menlo Park, CA >>>>> >>>>> Don't copy, share, or use this email without permission. If you >>>>> received it by accident, please let us know and then delete it right away. >>>>> >>>> >>> >>> -- >>> >>> <http://www.robinhood.com/> >>> >>> Tony Chen >>> >>> Software Engineer >>> >>> Menlo Park, CA >>> >>> Don't copy, share, or use this email without permission. If you received >>> it by accident, please let us know and then delete it right away. >>> >> > > -- > > <http://www.robinhood.com/> > > Tony Chen > > Software Engineer > > Menlo Park, CA > > Don't copy, share, or use this email without permission. If you received > it by accident, please let us know and then delete it right away. >