Re: Restart hook and checkpoint

2018-03-23 Thread Fabian Hueske
Yes, that would be great! Thank you, Fabian 2018-03-23 3:06 GMT+01:00 Ashish Pokharel : > Fabian, that sounds good. Should I recap some bullets in an email and > start a new thread then? > > Thanks, Ashish > > > On Mar 22, 2018, at 5:16 AM, Fabian Hueske wrote: > > Hi Ashish, > > Agreed! > I th

Re: Restart hook and checkpoint

2018-03-22 Thread Ashish Pokharel
Fabian, that sounds good. Should I recap some bullets in an email and start a new thread then? Thanks, Ashish > On Mar 22, 2018, at 5:16 AM, Fabian Hueske wrote: > > Hi Ashish, > > Agreed! > I think the right approach would be to gather the requirements and start a > discussion on the dev m

Re: Restart hook and checkpoint

2018-03-22 Thread Fabian Hueske
Hi Ashish, Agreed! I think the right approach would be to gather the requirements and start a discussion on the dev mailing list. Contributors and committers who are more familiar with the checkpointing and recovery internals should discuss a solution that can be integrated and doesn't break with

Re: Restart hook and checkpoint

2018-03-20 Thread Ashish Pokharel
I definitely like the idea of event based checkpointing :) Fabian, I do agree with your point that it is not possible to take a rescue checkpoint consistently. The basis here however is not around the operator that actually failed. It’s to avoid data loss across 100s (probably 1000s of paralle

Re: Restart hook and checkpoint

2018-03-20 Thread Fabian Hueske
Well, that's not that easy to do, because checkpoints must be coordinated and triggered the JobManager. Also, the checkpointing mechanism with flowing checkpoint barriers (to ensure checkpoint consistency) won't work once a task failed because it cannot continue processing and forward barriers. If

Re: Restart hook and checkpoint

2018-03-19 Thread makeyang
currently there is only time based way to trigger a checkpoint. based on this discussion, I think flink need to introduce event based way to trigger checkpoint such as restart a task manager should be count as a event. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble

Re: Restart hook and checkpoint

2018-03-18 Thread Ashish Pokharel
Thanks Fabian! Yes, that is exactly what we are looking to achieve. I looked at fine grained recovery FLIP but not sure if that will do the trick. Like Fabian mentioned, we haven’t been enabling checkpointing (reasons below). I do understand it might not always be possible to actually take a ch

Re: Restart hook and checkpoint

2018-03-15 Thread Fabian Hueske
If I understand fine-grained recovery correctly, one would still need to take checkpoints. Ashish would like to avoid checkpointing and accept to lose the state of the failed task. However, he would like to avoid losing more state than necessary due to restarting of tasks that did not fail. Best,

Re: Restart hook and checkpoint

2018-03-14 Thread Aljoscha Krettek
Hi, Have you looked into fine-grained recovery? https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures Stefan cc'ed might be able to give you som

Re: Restart hook and checkpoint

2018-03-06 Thread Ashish Pokharel
Hi Gordon, The issue really is we are trying to avoid checkpointing as datasets are really heavy and all of the states are really transient in a few of our apps (flushed within few seconds). So high volume/velocity and transient nature of state make those app good candidates to just not have ch

Re: Restart hook and checkpoint

2018-03-06 Thread Tzu-Li Tai
Hi Ashish, Could you elaborate a bit more on why you think the restart of all operators lead to data loss? When restart occurs, Flink will restart the job from the latest complete checkpoint. All operator states will be reloaded with state written in that checkpoint, and the position of the input

Restart hook and checkpoint

2018-03-02 Thread ashish pok
All, It looks like Flink's default behavior is to restart all operators on a single operator error - in my case it is a Kafka Producer timing out. When this happens, I see logs that all operators are restarted. This essentially leads to data loss. In my case the volume of data is so high that it