subject:"Restart hook and checkpoint"

Re: Restart hook and checkpoint

2018-03-23 Thread Fabian Hueske

Yes, that would be great! Thank you, Fabian 2018-03-23 3:06 GMT+01:00 Ashish Pokharel : > Fabian, that sounds good. Should I recap some bullets in an email and > start a new thread then? > > Thanks, Ashish > > > On Mar 22, 2018, at 5:16 AM, Fabian Hueske wrote: > > Hi Ashish, > > Agreed! > I th

Re: Restart hook and checkpoint

2018-03-22 Thread Ashish Pokharel

Fabian, that sounds good. Should I recap some bullets in an email and start a new thread then? Thanks, Ashish > On Mar 22, 2018, at 5:16 AM, Fabian Hueske wrote: > > Hi Ashish, > > Agreed! > I think the right approach would be to gather the requirements and start a > discussion on the dev m

Re: Restart hook and checkpoint

2018-03-22 Thread Fabian Hueske

Hi Ashish, Agreed! I think the right approach would be to gather the requirements and start a discussion on the dev mailing list. Contributors and committers who are more familiar with the checkpointing and recovery internals should discuss a solution that can be integrated and doesn't break with

Re: Restart hook and checkpoint

2018-03-20 Thread Ashish Pokharel

I definitely like the idea of event based checkpointing :) Fabian, I do agree with your point that it is not possible to take a rescue checkpoint consistently. The basis here however is not around the operator that actually failed. It’s to avoid data loss across 100s (probably 1000s of paralle

Re: Restart hook and checkpoint

2018-03-20 Thread Fabian Hueske

Well, that's not that easy to do, because checkpoints must be coordinated and triggered the JobManager. Also, the checkpointing mechanism with flowing checkpoint barriers (to ensure checkpoint consistency) won't work once a task failed because it cannot continue processing and forward barriers. If

Re: Restart hook and checkpoint

2018-03-19 Thread makeyang

currently there is only time based way to trigger a checkpoint. based on this discussion, I think flink need to introduce event based way to trigger checkpoint such as restart a task manager should be count as a event. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble

Re: Restart hook and checkpoint

2018-03-18 Thread Ashish Pokharel

Thanks Fabian! Yes, that is exactly what we are looking to achieve. I looked at fine grained recovery FLIP but not sure if that will do the trick. Like Fabian mentioned, we haven’t been enabling checkpointing (reasons below). I do understand it might not always be possible to actually take a ch

Re: Restart hook and checkpoint

2018-03-15 Thread Fabian Hueske

If I understand fine-grained recovery correctly, one would still need to take checkpoints. Ashish would like to avoid checkpointing and accept to lose the state of the failed task. However, he would like to avoid losing more state than necessary due to restarting of tasks that did not fail. Best,

Re: Restart hook and checkpoint

2018-03-14 Thread Aljoscha Krettek

Hi, Have you looked into fine-grained recovery? https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures Stefan cc'ed might be able to give you som

Re: Restart hook and checkpoint

2018-03-06 Thread Ashish Pokharel

Hi Gordon, The issue really is we are trying to avoid checkpointing as datasets are really heavy and all of the states are really transient in a few of our apps (flushed within few seconds). So high volume/velocity and transient nature of state make those app good candidates to just not have ch

Re: Restart hook and checkpoint

2018-03-06 Thread Tzu-Li Tai

Hi Ashish, Could you elaborate a bit more on why you think the restart of all operators lead to data loss? When restart occurs, Flink will restart the job from the latest complete checkpoint. All operator states will be reloaded with state written in that checkpoint, and the position of the input

Restart hook and checkpoint

2018-03-02 Thread ashish pok

All, It looks like Flink's default behavior is to restart all operators on a single operator error - in my case it is a Kafka Producer timing out. When this happens, I see logs that all operators are restarted. This essentially leads to data loss. In my case the volume of data is so high that it

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Re: Restart hook and checkpoint

Restart hook and checkpoint

12 matches

Site Navigation

Mail list logo

Footer information