Will try the setting out. Do not want to push it, but the exception can be much more descriptive :)
Thanks much On Tue, Jul 10, 2018 at 7:48 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Whether a Flink task should fail in case of a checkpoint error or not can > be configured via the CheckpointConfig which you can access via the > StreamExecutionEnvironment. You have to call `CheckpointConfig# > setFailOnCheckpointingErrors(false)` to deactivate the default behaviour > where the task always fails in case of a checkpoint error. > > Cheers, > Till > > On Tue, Jul 10, 2018 at 10:50 AM Vishal Santoshi < > vishal.santo...@gmail.com> wrote: > >> That makes sense, what does not make sense is that the pipeline >> restarted. I would have imagined that an aborted chk point would not abort >> the pipeline. >> >> On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hi Vishal, >>> >>> it looks as if the flushing of the checkpoint data to HDFS failed due to >>> some expired lease on the checkpoint file. Therefore, Flink aborted the >>> checkpoint `chk-125` and removed it. This is the normal behaviour if Flink >>> cannot complete a checkpoint. As you can see, afterwards, the checkpoints >>> are again successful. >>> >>> Cheers, >>> Till >>> >>> On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> drwxr-xr-x - root hadoop 0 2018-07-09 12:33 >>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123 >>>> drwxr-xr-x - root hadoop 0 2018-07-09 12:35 >>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124 >>>> drwxr-xr-x - root hadoop 0 2018-07-09 12:51 >>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126 >>>> drwxr-xr-x - root hadoop 0 2018-07-09 12:53 >>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127 >>>> drwxr-xr-x - root hadoop 0 2018-07-09 12:55 >>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128 >>>> >>>> See the missing chk-125 >>>> >>>> So I see the above checkpoints for a job. at the 2018-07-09, >>>> 12:38:43 this exception was thrown >>>> >>>> >>>> the chk-125 is missing from hdfs and the job complains about it >>>> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache. >>>> hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on >>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428 >>>> bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): >>>> File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 >>>> does not have any open files. >>>> >>>> At about the same time >>>> >>>> ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before >>>> completing.. >>>> >>>> >>>> Is this some race condition. A checkpoint had to be taken and , that >>>> was was chk-125, it took longer than the configure time ( 1 minute ). It >>>> aborted the pipe. Should it have ? It actually did not even create the >>>> chk-125 >>>> but then refers to it and aborts the pipe. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> This is the full exception. >>>> >>>> AsynchronousException{java.lang.Exception: Could not materialize >>>> checkpoint 125 for operator 360 minute interval -> 360 minutes to >>>> TimeSeries.Entry.2 (5/6).} >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885) >>>> at >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: java.lang.Exception: Could not materialize checkpoint 125 for >>>> operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6). >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943) >>>> ... 6 more >>>> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: >>>> Could not flush and close the file system output stream to >>>> hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e >>>> in order to obtain the stream state handle >>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>>> at >>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53) >>>> at >>>> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854) >>>> ... 5 more >>>> Caused by: java.io.IOException: Could not flush and close the file system >>>> output stream to >>>> hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e >>>> in order to obtain the stream state handle >>>> at >>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325) >>>> at >>>> org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77) >>>> at >>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705) >>>> at >>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641) >>>> at >>>> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> at >>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50) >>>> >>>> Caused by: >>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >>>> No lease on >>>> /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e >>>> (inode 1987098987): File does not exist. Holder >>>> DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files. >>>> >>>> >>>> >>>> >>