Re: Challenges Deploying Flink With Savepoints On Kubernetes

Aleksandar Mastilovic Wed, 25 Sep 2019 10:57:50 -0700

Can’t you simply use JobManager in HA mode? It would pick up where it left off 
if you don’t provide a Savepoint.


> On Sep 25, 2019, at 6:07 AM, Sean Hester <sean.hes...@bettercloud.com> wrote:
> 
> thanks for all replies! i'll definitely take a look at the Flink k8s Operator 
> project.
> 
> i'll try to restate the issue to clarify. this issue is specific to starting 
> a job from a savepoint in job-cluster mode. in these cases the Job Manager 
> container is configured to run a single Flink job at start-up. the savepoint 
> needs to be provided as an argument to the entrypoint. the Flink 
> documentation for this approach is here:
> 
> https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint
>  
> <https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint>
> 
> the issue is that taking this approach means that the job will always start 
> from the savepoint provided as the start argument in the Kubernetes YAML. 
> this includes unplanned restarts of the job manager, but we'd really prefer 
> any unplanned restarts resume for the most recent checkpoint instead of 
> restarting from the configured savepoint. so in a sense we want the savepoint 
> argument to be transient, only being used during the initial deployment, but 
> this runs counter to the design of Kubernetes which always wants to restore a 
> deployment to the "goal state" as defined in the YAML.
> 
> i hope this helps. if you want more details please let me know, and thanks 
> again for your time.
> 
> 
> On Tue, Sep 24, 2019 at 1:09 PM Hao Sun <ha...@zendesk.com 
> <mailto:ha...@zendesk.com>> wrote:
> I think I overlooked it. Good point. I am using Redis to save the path to my 
> savepoint, I might be able to set a TTL to avoid such issue.
> 
> Hao Sun
> 
> 
> On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com 
> <mailto:yuva...@gmail.com>> wrote:
> Hi Hao,
> 
> I think he's exactly talking about the usecase where the JM/TM restart and 
> they come back up from the latest savepoint which might be stale by that time.
> 
> On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com 
> <mailto:ha...@zendesk.com>> wrote:
> We always make a savepoint before we shutdown the job-cluster. So the 
> savepoint is always the latest. When we fix a bug or change the job graph, it 
> can resume well.
> We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM, 
> uncaught exception, etc.
> 
> Maybe I do not understand your use case well, I do not see a need to start 
> from checkpoint after a bug fix.
> From what I know, currently you can use checkpoint as a savepoint as well
> 
> Hao Sun
> 
> 
> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com 
> <mailto:yuva...@gmail.com>> wrote:
> AFAIK there's currently nothing implemented to solve this problem, but 
> working on a possible fix can be implemented on top of 
> https://github.com/lyft/flinkk8soperator 
> <https://github.com/lyft/flinkk8soperator> which already has a pretty fancy 
> state machine for rolling upgrades. I'd love to be involved as this is an 
> issue I've been thinking about as well.
> 
> Yuval
> 
> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <sean.hes...@bettercloud.com 
> <mailto:sean.hes...@bettercloud.com>> wrote:
> hi all--we've run into a gap (knowledge? design? tbd?) for our use cases when 
> deploying Flink jobs to start from savepoints using the job-cluster mode in 
> Kubernetes.
> 
> we're running a ~15 different jobs, all in job-cluster mode, using a mix of 
> Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these are all 
> long-running streaming jobs, all essentially acting as microservices. we're 
> using Helm charts to configure all of our deployments.
> 
> we have a number of use cases where we want to restart jobs from a savepoint 
> to replay recent events, i.e. when we've enhanced the job logic or fixed a 
> bug. but after the deployment we want to have the job resume it's 
> "long-running" behavior, where any unplanned restarts resume from the latest 
> checkpoint.
> 
> the issue we run into is that any obvious/standard/idiomatic Kubernetes 
> deployment includes the savepoint argument in the configuration. if the Job 
> Manager container(s) have an unplanned restart, when they come back up they 
> will start from the savepoint instead of resuming from the latest checkpoint. 
> everything is working as configured, but that's not exactly what we want. we 
> want the savepoint argument to be transient somehow (only used during the 
> initial deployment), but Kubernetes doesn't really support the concept of 
> transient configuration.
> 
> i can see a couple of potential solutions that either involve custom code in 
> the jobs or custom logic in the container (i.e. a custom entrypoint script 
> that records that the configured savepoint has already been used in a file on 
> a persistent volume or GCS, and potentially when/why/by which deployment). 
> but these seem like unexpected and hacky solutions. before we head down that 
> road i wanted to ask:
> is this is already a solved problem that i've missed?
> is this issue already on the community's radar?
> thanks in advance!
> 
> -- 
> Sean Hester | Senior Staff Software Engineer | m. 404-828-0865 <>
> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 
>  <http://www.bettercloud.com/> <http://www.bettercloud.com/>
> Altitude 2019 in San Francisco | Sept. 23 - 25
> It’s not just an IT conference, it’s “a complete learning and networking 
> experience” 
> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
> 
> 
> 
> -- 
> Best Regards,
> Yuval Itzchakov.
> 
> 
> -- 
> Sean Hester | Senior Staff Software Engineer | m. 404-828-0865 <>
> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 
>  <http://www.bettercloud.com/> <http://www.bettercloud.com/>
> Altitude 2019 in San Francisco | Sept. 23 - 25
> It’s not just an IT conference, it’s “a complete learning and networking 
> experience” 
> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>

Re: Challenges Deploying Flink With Savepoints On Kubernetes

Reply via email to