Re: Failing to recover once checkpoint fails

2018-03-09 Thread Vishal Santoshi
Yes. We have not hit the snag in 1.4.0 ( our current version ). Again though this occurs under sustained down time on hadoop and it has been more stable lately :) On Wed, Mar 7, 2018 at 4:09 PM, Stephan Ewen wrote: > The assumption in your previous mail is correct. > > Just to double check: >

Re: Failing to recover once checkpoint fails

2018-03-07 Thread Stephan Ewen
The assumption in your previous mail is correct. Just to double check: - The initially affected version you were running was 1.3.2, correct? The issue should be fixed in all active branches (1.4, 1.5, 1.6) and additional in 1.3. Currently released versions with this fix: 1.4.0, 1.4.1 1.5.0 is

Re: Failing to recover once checkpoint fails

2018-01-25 Thread Vishal Santoshi
To add to this, we are assuming that the default configuration will fail a pipeline if a checkpoint fails and will hit the recover loop only and only if the retry limit is not reached On Thu, Jan 25, 2018 at 7:00 AM, Vishal Santoshi wrote: > Sorry. > > There are 2 scenerios > > * Idempoten

Re: Failing to recover once checkpoint fails

2018-01-25 Thread Vishal Santoshi
Sorry. There are 2 scenerios * Idempotent Sinks Use Case where we would want to restore from the latest valid checkpoint. If I understand the code correctly we try to retrieve all completed checkpoints for all handles in ZK and abort ( throw an exception ) if there are handles but no correspo

Re: Failing to recover once checkpoint fails

2018-01-24 Thread Aljoscha Krettek
Did you see my second mail? > On 24. Jan 2018, at 12:50, Vishal Santoshi wrote: > > As in, if there are chk handles in zk, there should no reason to start a new > job ( bad handle, no hdfs connectivity etc ), > yes that sums it up. > > On Wed, Jan 24, 2018 at 5:35 AM, Aljoscha Krettek

Re: Failing to recover once checkpoint fails

2018-01-24 Thread Vishal Santoshi
As in, if there are chk handles in zk, there should no reason to start a new job ( bad handle, no hdfs connectivity etc ), yes that sums it up. On Wed, Jan 24, 2018 at 5:35 AM, Aljoscha Krettek wrote: > Wait a sec, I just checked out the code again and it seems we already do > that: https://git

Re: Failing to recover once checkpoint fails

2018-01-24 Thread Aljoscha Krettek
Wait a sec, I just checked out the code again and it seems we already do that: https://github.com/apache/flink/blob/9071e3befb8c279f73c3094c9f6bddc0e7cce9e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L210

Re: Failing to recover once checkpoint fails

2018-01-24 Thread Aljoscha Krettek
That sounds reasonable: We would keep the first fix, i.e. never delete checkpoints if they're "corrupt", only when they're subsumed. Additionally, we fail the job if there are some checkpoints in ZooKeeper but none of them can be restored to prevent the case where a job starts from scratch even

Re: Failing to recover once checkpoint fails

2018-01-23 Thread Vishal Santoshi
If we hit the retry limit, abort the job. In our case we will restart from the last SP ( we as any production pile do it is n time s a day ) and that I would think should be OK for most folks ? On Tue, Jan 23, 2018 at 11:38 AM, Vishal Santoshi wrote: > Thank you for considering this. If I under

Re: Failing to recover once checkpoint fails

2018-01-23 Thread Vishal Santoshi
Thank you for considering this. If I understand you correctly. * CHK pointer on ZK for a CHK state on hdfs was done successfully. * Some issue restarted the pipeline. * The NN was down unfortunately and flink could not retrieve the CHK state from the CHK pointer on ZK. Before * The CHK pointer

Re: Failing to recover once checkpoint fails

2018-01-23 Thread Aljoscha Krettek
Hi Vishal, I think you might be right. We fixed the problem that checkpoints where dropped via https://issues.apache.org/jira/browse/FLINK-7783 . However, we still have the problem that if the DFS is not up at all then it will look as if the job

Re: Failing to recover once checkpoint fails

2018-01-23 Thread Fabian Hueske
Sorry for the late reply. I created FLINK-8487 [1] to track this problem @Vishal, can you have a look and check if if forgot some details? I logged the issue for Flink 1.3.2, is that correct? Please add more information if you think it is relevant. Thanks, Fabian [1] https://issues.apache.org/j

Re: Failing to recover once checkpoint fails

2018-01-18 Thread Vishal Santoshi
Or this one https://issues.apache.org/jira/browse/FLINK-4815 On Thu, Jan 18, 2018 at 4:13 PM, Vishal Santoshi wrote: > ping. > > This happened again on production and it seems reasonable to abort > when a checkpoint is not found rather than behave as if it is a brand new > pipeline. > > On

Re: Failing to recover once checkpoint fails

2018-01-18 Thread Vishal Santoshi
ping. This happened again on production and it seems reasonable to abort when a checkpoint is not found rather than behave as if it is a brand new pipeline. On Tue, Jan 16, 2018 at 9:33 AM, Vishal Santoshi wrote: > Folks sorry for being late on this. Can some body with the knowledge of > th

Re: Failing to recover once checkpoint fails

2018-01-16 Thread Vishal Santoshi
Folks sorry for being late on this. Can some body with the knowledge of this code base create a jira issue for the above ? We have seen this more than once on production. On Mon, Oct 9, 2017 at 10:21 AM, Aljoscha Krettek wrote: > Hi Vishal, > > Some relevant Jira issues for you are: > > - https

Re: Failing to recover once checkpoint fails

2017-10-09 Thread Aljoscha Krettek
Hi Vishal, Some relevant Jira issues for you are: - https://issues.apache.org/jira/browse/FLINK-4808: Allow skipping failed checkpoints - https://issues.apache.org/jira/browse/FLINK-4815:

Re: Failing to recover once checkpoint fails

2017-10-09 Thread Fabian Hueske
Hi Vishal, it would be great if you could create a JIRA ticket with Blocker priority. Please add all relevant information of your detailed analysis, add a link to this email thread (see [1] for the web archive of the mailing list), and post the id of the JIRA issue here. Thanks for looking into t

Re: Failing to recover once checkpoint fails

2017-10-06 Thread Vishal Santoshi
Thank you for confirming. I think this is a critical bug. In essence any checkpoint store ( hdfs/S3/File) will loose state if it is unavailable at resume. This becomes all the more painful with your confirming that "failed checkpoints killing the job" b'coz essentially it mean that if remote

Re: Failing to recover once checkpoint fails

2017-10-05 Thread Aljoscha Krettek
Hi Vishal, I think you're right! And thanks for looking into this so deeply. With your last mail your basically saying, that the checkpoint could not be restored because your HDFS was temporarily down. If Flink had not deleted that checkpoint it might have been possible to restore it at a late

Re: Failing to recover once checkpoint fails

2017-10-05 Thread Vishal Santoshi
I think this is the offending piece. There is a catch all Exception, which IMHO should understand a recoverable exception from an unrecoverable on. try { completedCheckpoint = retrieveCompletedCheckpoint(checkpointStateHandle); if (completedCheckpoint != null) { completedCheckpoints.add(completed

Re: Failing to recover once checkpoint fails

2017-10-05 Thread Vishal Santoshi
So this is the issue and tell us that it is wrong. ZK had some state ( backed by hdfs ) that referred to a checkpoint ( the same exact last successful checkpoint that was successful before NN screwed us ). When the JM tried to recreate the state and b'coz NN was down failed to retrieve the CHK hand

Re: Failing to recover once checkpoint fails

2017-10-05 Thread Vishal Santoshi
Also note that the zookeeper recovery did ( sadly on the same hdfs cluster ) also showed the same behavior. It had the pointers to the chk point ( I think that is what it does, keeps metadata of where the checkpoint etc ) . It too decided to keep the recovery file from the failed state. -rw-

Re: Failing to recover once checkpoint fails

2017-10-05 Thread Vishal Santoshi
Another thing I noted was this thing drwxr-xr-x - root hadoop 0 2017-10-04 13:54 /flink-checkpoints/prod/c4af8dfa864e2f9a51764de9f0725b39/chk-44286 drwxr-xr-x - root hadoop 0 2017-10-05 09:15 /flink-checkpoints/prod/c4af8dfa864e2f9a51764de9f0725b39/chk-45428 Generally what

Re: Failing to recover once checkpoint fails

2017-10-05 Thread Vishal Santoshi
Hello Fabian, First of all congratulations on this fabulous framework. I have worked with GDF and though GDF has some natural pluses Flink's state management is far more advanced. With kafka as a source it negates issues GDF has ( GDF integration with pub/sub is organic and th

Re: Failing to recover once checkpoint fails

2017-10-05 Thread Fabian Hueske
Hi Vishal, window operators are always stateful because the operator needs to remember previously received events (WindowFunction) or intermediate results (ReduceFunction). Given the program you described, a checkpoint should include the Kafka consumer offset and the state of the window operator.

Re: Failing to recover once checkpoint fails

2017-10-04 Thread Vishal Santoshi
To add to it, my pipeline is a simple keyBy(0) .timeWindow(Time.of(window_size, TimeUnit.MINUTES)) .allowedLateness(Time.of(late_by, TimeUnit.SECONDS)) .reduce(new ReduceFunction(), new WindowFunction()) On Wed, Oct 4, 2017 at 8:19 PM, Vishal Santoshi wrote: > Hello fol

Failing to recover once checkpoint fails

2017-10-04 Thread Vishal Santoshi
Hello folks, As far as I know checkpoint failure should be ignored and retried with potentially larger state. I had this situation * hdfs went into a safe mode b'coz of Name Node issues * exception was thrown org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Ope