Can’t you simply use JobManager in HA mode? It would pick up where it left off if you don’t provide a Savepoint.
> On Sep 25, 2019, at 6:07 AM, Sean Hester <sean.hes...@bettercloud.com> wrote: > > thanks for all replies! i'll definitely take a look at the Flink k8s Operator > project. > > i'll try to restate the issue to clarify. this issue is specific to starting > a job from a savepoint in job-cluster mode. in these cases the Job Manager > container is configured to run a single Flink job at start-up. the savepoint > needs to be provided as an argument to the entrypoint. the Flink > documentation for this approach is here: > > https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint > > <https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint> > > the issue is that taking this approach means that the job will always start > from the savepoint provided as the start argument in the Kubernetes YAML. > this includes unplanned restarts of the job manager, but we'd really prefer > any unplanned restarts resume for the most recent checkpoint instead of > restarting from the configured savepoint. so in a sense we want the savepoint > argument to be transient, only being used during the initial deployment, but > this runs counter to the design of Kubernetes which always wants to restore a > deployment to the "goal state" as defined in the YAML. > > i hope this helps. if you want more details please let me know, and thanks > again for your time. > > > On Tue, Sep 24, 2019 at 1:09 PM Hao Sun <ha...@zendesk.com > <mailto:ha...@zendesk.com>> wrote: > I think I overlooked it. Good point. I am using Redis to save the path to my > savepoint, I might be able to set a TTL to avoid such issue. > > Hao Sun > > > On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com > <mailto:yuva...@gmail.com>> wrote: > Hi Hao, > > I think he's exactly talking about the usecase where the JM/TM restart and > they come back up from the latest savepoint which might be stale by that time. > > On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com > <mailto:ha...@zendesk.com>> wrote: > We always make a savepoint before we shutdown the job-cluster. So the > savepoint is always the latest. When we fix a bug or change the job graph, it > can resume well. > We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM, > uncaught exception, etc. > > Maybe I do not understand your use case well, I do not see a need to start > from checkpoint after a bug fix. > From what I know, currently you can use checkpoint as a savepoint as well > > Hao Sun > > > On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com > <mailto:yuva...@gmail.com>> wrote: > AFAIK there's currently nothing implemented to solve this problem, but > working on a possible fix can be implemented on top of > https://github.com/lyft/flinkk8soperator > <https://github.com/lyft/flinkk8soperator> which already has a pretty fancy > state machine for rolling upgrades. I'd love to be involved as this is an > issue I've been thinking about as well. > > Yuval > > On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <sean.hes...@bettercloud.com > <mailto:sean.hes...@bettercloud.com>> wrote: > hi all--we've run into a gap (knowledge? design? tbd?) for our use cases when > deploying Flink jobs to start from savepoints using the job-cluster mode in > Kubernetes. > > we're running a ~15 different jobs, all in job-cluster mode, using a mix of > Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these are all > long-running streaming jobs, all essentially acting as microservices. we're > using Helm charts to configure all of our deployments. > > we have a number of use cases where we want to restart jobs from a savepoint > to replay recent events, i.e. when we've enhanced the job logic or fixed a > bug. but after the deployment we want to have the job resume it's > "long-running" behavior, where any unplanned restarts resume from the latest > checkpoint. > > the issue we run into is that any obvious/standard/idiomatic Kubernetes > deployment includes the savepoint argument in the configuration. if the Job > Manager container(s) have an unplanned restart, when they come back up they > will start from the savepoint instead of resuming from the latest checkpoint. > everything is working as configured, but that's not exactly what we want. we > want the savepoint argument to be transient somehow (only used during the > initial deployment), but Kubernetes doesn't really support the concept of > transient configuration. > > i can see a couple of potential solutions that either involve custom code in > the jobs or custom logic in the container (i.e. a custom entrypoint script > that records that the configured savepoint has already been used in a file on > a persistent volume or GCS, and potentially when/why/by which deployment). > but these seem like unexpected and hacky solutions. before we head down that > road i wanted to ask: > is this is already a solved problem that i've missed? > is this issue already on the community's radar? > thanks in advance! > > -- > Sean Hester | Senior Staff Software Engineer | m. 404-828-0865 <> > 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 > <http://www.bettercloud.com/> <http://www.bettercloud.com/> > Altitude 2019 in San Francisco | Sept. 23 - 25 > It’s not just an IT conference, it’s “a complete learning and networking > experience” > <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> > > > > -- > Best Regards, > Yuval Itzchakov. > > > -- > Sean Hester | Senior Staff Software Engineer | m. 404-828-0865 <> > 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 > <http://www.bettercloud.com/> <http://www.bettercloud.com/> > Altitude 2019 in San Francisco | Sept. 23 - 25 > It’s not just an IT conference, it’s “a complete learning and networking > experience” > <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >