Re: Challenges Deploying Flink With Savepoints On Kubernetes

2020-12-21 Thread vishalovercome
Thanks for your reply! What I have seen is that the job terminates when there's intermittent loss of connectivity with zookeeper. This is in-fact the most common reason why our jobs are terminating at this point. Worse, it's unable to restore from checkpoint during some (not all) of these terminat

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2020-12-17 Thread Till Rohrmann
Flink should try to pick the latest checkpoint and will only use the savepoint if no newer checkpoint could be found. Cheers, Till On Wed, Dec 16, 2020 at 10:13 PM vishalovercome wrote: > I'm not sure if this addresses the original concern. For instance consider > this sequence: > > 1. Job star

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2020-12-16 Thread vishalovercome
I'm not sure if this addresses the original concern. For instance consider this sequence: 1. Job starts from savepoint 2. Job creates a few checkpoints 3. Job manager (just one in kubernetes) crashes and restarts with the commands specified in the kubernetes manifest which has the savepoint path

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2020-06-11 Thread Matt Magsombol
I'm not the original poster, but I'm running into this same issue. What you just described is exactly what I want. I presume you guys are using some variant of this helm https://github.com/docker-flink/examples/tree/master/helm/flink to configure your k8s cluster? I'm also assuming that this cl

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-14 Thread Vijay Bhaskar
] >>> Caused by: [No route to host] >>> 2019-09-24 17:40:39,006 WARN akka.remote.transport.netty.NettyTransport >>>- Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> >>> On Fri, O

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-14 Thread Till Rohrmann
11, 2019 at 9:39 AM Yun Tang wrote: >> >>> Hi Hao >>> >>> It seems that I misunderstood the background of usage for your cases. >>> High availability configuration targets for fault tolerance not for general >>> development evolution. If you wa

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-11 Thread Vijay Bhaskar
ology, just follow >> the general rule to restore from savepoint/checkpoint, do not rely on HA to >> do job migration things. >> >> Best >> Yun Tang >> ---------- >> *From:* Hao Sun >> *Sent:* Friday, October 11, 2019 8:33 >&

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-11 Thread Vijay Bhaskar
Tang > *Cc:* Vijay Bhaskar ; Yang Wang < > danrtsey...@gmail.com>; Sean Hester ; > Aleksandar Mastilovic ; Yuval Itzchakov < > yuva...@gmail.com>; user > *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes > > Yep I know that option.

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-10 Thread Yun Tang
loud.com>>; Aleksandar Mastilovic mailto:amastilo...@sightmachine.com>>; Yun Tang mailto:myas...@live.com>>; Hao Sun mailto:ha...@zendesk.com>>; Yuval Itzchakov mailto:yuva...@gmail.com>>; user mailto:user@flink.apache.org>> Subject: Re: Challenges Deploying Flink Wi

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-10 Thread Hao Sun
ean Hester ; Aleksandar Mastilovic < > amastilo...@sightmachine.com>; Yun Tang ; Hao Sun < > ha...@zendesk.com>; Yuval Itzchakov ; user < > user@flink.apache.org> > *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes > > Thanks Yang. We will

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-10 Thread Yun Tang
st Yun Tang From: Vijay Bhaskar Sent: Thursday, October 10, 2019 19:24 To: Yang Wang Cc: Sean Hester ; Aleksandar Mastilovic ; Yun Tang ; Hao Sun ; Yuval Itzchakov ; user Subject: Re: Challenges Deploying Flink With Savepoints On Kubernetes Thanks Yang. We will

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-10 Thread Vijay Bhaskar
ed, i think the job >>>>>>>> could recover both at exceptionally crash and restart by >>>>>>>> expectation. >>>>>>>> >>>>>>>> @Aleksandar Mastilovic , we are also >>>>>>>> ha

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-10 Thread Yang Wang
gt;> feature to the community. >>>>>>> >>>>>>> [1]. >>>>>>> https://docs.google.com/document/d/1Z-VdJlPPEQoWT1WLm5woM4y0bFep4FrgdJ9ipQuRv8g/edit >>>>>>> >>>>>>> Best, >>>>>>> Yang &

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-09 Thread Vijay Bhaskar
y0bFep4FrgdJ9ipQuRv8g/edit >>>>>> >>>>>> Best, >>>>>> Yang >>>>>> >>>>>> Aleksandar Mastilovic 于2019年9月26日周四 >>>>>> 上午4:11写道: >>>>>> >>>>>>> Would you gu

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-08 Thread Yang Wang
okeeper-less HA? I could ask the managers how they feel about >>>>>> open-sourcing the improvement. >>>>>> >>>>>> On Sep 25, 2019, at 11:49 AM, Yun Tang wrote: >>>>>> >>>>>> As Aleksandar said, k8s with HA conf

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-30 Thread Sean Hester
gt;> On Sep 25, 2019, at 11:49 AM, Yun Tang wrote: >>>>> >>>>> As Aleksandar said, k8s with HA configuration could solve your >>>>> problem. There already have some discussion about how to implement such HA >>>>> in k8s if we don't hav

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-26 Thread Vijay Bhaskar
1105 [1] and FLINK-12884 [2]. >>>> Currently, you might only have to choose zookeeper as high-availability >>>> service. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-11105 >>>> [2] https://issues.apache.org/jira/browse/FLINK-12

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-26 Thread Vijay Bhaskar
ps://issues.apache.org/jira/browse/FLINK-11105 >>> [2] https://issues.apache.org/jira/browse/FLINK-12884 >>> >>> Best >>> Yun Tang >>> -- >>> *From:* Aleksandar Mastilovic >>> *Sent:* Thursday, September 26, 2019

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-26 Thread Sean Hester
;> Yun Tang >> ---------- >> *From:* Aleksandar Mastilovic >> *Sent:* Thursday, September 26, 2019 1:57 >> *To:* Sean Hester >> *Cc:* Hao Sun ; Yuval Itzchakov ; >> user >> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kuberne

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-26 Thread Yang Wang
day, September 26, 2019 1:57 > *To:* Sean Hester > *Cc:* Hao Sun ; Yuval Itzchakov ; > user > *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes > > Can’t you simply use JobManager in HA mode? It would pick up where it left > off if you don’t provide a Savepo

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-25 Thread Aleksandar Mastilovic
ber 26, 2019 1:57 > To: Sean Hester > Cc: Hao Sun ; Yuval Itzchakov ; user > > Subject: Re: Challenges Deploying Flink With Savepoints On Kubernetes > > Can’t you simply use JobManager in HA mode? It would pick up where it left > off if you don’t provide a Savepoint. >

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-25 Thread Yun Tang
t: Re: Challenges Deploying Flink With Savepoints On Kubernetes Can’t you simply use JobManager in HA mode? It would pick up where it left off if you don’t provide a Savepoint. On Sep 25, 2019, at 6:07 AM, Sean Hester mailto:sean.hes...@bettercloud.com>> wrote: thanks for all replies! i&#x

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-25 Thread Aleksandar Mastilovic
Can’t you simply use JobManager in HA mode? It would pick up where it left off if you don’t provide a Savepoint. > On Sep 25, 2019, at 6:07 AM, Sean Hester wrote: > > thanks for all replies! i'll definitely take a look at the Flink k8s Operator > project. > > i'll try to restate the issue to

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-25 Thread Vijay Bhaskar
One of the way you should do is, have a separate cluster job manager program in kubernetes, which is actually managing jobs. So that you can decouple the job control. While restarting the job, make sure to follow the below steps: a) First job manager takes save point by killing the job and notes d

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-25 Thread Sean Hester
thanks for all replies! i'll definitely take a look at the Flink k8s Operator project. i'll try to restate the issue to clarify. this issue is specific to starting a job from a savepoint in job-cluster mode. in these cases the Job Manager container is configured to run a single Flink job at start-

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-24 Thread Hao Sun
I think I overlooked it. Good point. I am using Redis to save the path to my savepoint, I might be able to set a TTL to avoid such issue. Hao Sun On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov wrote: > Hi Hao, > > I think he's exactly talking about the usecase where the JM/TM restart and > th

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-24 Thread Yuval Itzchakov
Hi Hao, I think he's exactly talking about the usecase where the JM/TM restart and they come back up from the latest savepoint which might be stale by that time. On Tue, 24 Sep 2019, 19:24 Hao Sun, wrote: > We always make a savepoint before we shutdown the job-cluster. So the > savepoint is alw

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-24 Thread Hao Sun
We always make a savepoint before we shutdown the job-cluster. So the savepoint is always the latest. When we fix a bug or change the job graph, it can resume well. We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM, uncaught exception, etc. Maybe I do not understand your use ca

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-24 Thread Yuval Itzchakov
AFAIK there's currently nothing implemented to solve this problem, but working on a possible fix can be implemented on top of https://github.com/lyft/flinkk8soperator which already has a pretty fancy state machine for rolling upgrades. I'd love to be involved as this is an issue I've been thinking

Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-24 Thread Sean Hester
hi all--we've run into a gap (knowledge? design? tbd?) for our use cases when deploying Flink jobs to start from savepoints using the job-cluster mode in Kubernetes. we're running a ~15 different jobs, all in job-cluster mode, using a mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engi