Re: Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-22 Thread PedroMrChaves
Unfortunately the audit logs for hdfs were not enabled. We will enable them and post he results when the problem happens again. Nonetheless, we don't have ay other process using hadoop besides flink - Best Regards, Pedro Chaves -- Sent from: http://apache-flink-user-mailing-list-archive.2336

Re: Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-21 Thread Congxian Qiu
Hi Pedro From the previous given log, I found that checkpoint 65912 has been expired then, the raise the IOException. When some checkpoint expired, the checkpoint dir will be deleted(CheckpointCoordinator#549 on release-1.6 branch), and the unfinished task will still write to the previous files,

Re: Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-21 Thread PedroMrChaves
The issue happened again. /AsynchronousException{java.lang.Exception: Could not materialize checkpoint 47400 for operator ENRICH (1/4).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153) at org.

Re: Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-16 Thread PedroMrChaves
Hello Andrey, The audit log doesn't have anything that would point to it being deleted. The only thing worth mentioning is the following line. /2019-05-15 10:01:39,082 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1248714854_174974084 is COMMITTED but not COMPLETE(numNodes=

Re: Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-16 Thread PedroMrChaves
Hello, Thanks for the help. I've attached the logs. Our cluster has 2 job managers (HA) and 4 task managers. logs.tgz Regards, Pedro - Best Regards, Pedro Chaves -- Sent from: http://apache-flink

Re: Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-16 Thread Congxian Qiu
Hi Pedro Could you please share the audit log for file `/flink/data/checkpoints/76f7b4f5c679e8f2d822c9c3c73faf5d/chk-65912/68776faf-b687-403b-ba0c-17419f8684dc`, seems this did not exist cause this problem (maybe this file was created and deleted for some reason) Best, Congxian Andrey Zagrebin

Re: Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-16 Thread Andrey Zagrebin
Hi, could you also post job master logs? and ideally full task manager logs. This failure can be caused by some other previous failure. Best, Andrey On Wed, May 15, 2019 at 2:48 PM PedroMrChaves wrote: > Hello, > > Every once in a while our checkpoints fail with the following exception: > > /A

Checkpoints periodically fail with hdfs as the state backend - Could not flush and close the file system output stream

2019-05-15 Thread PedroMrChaves
Hello, Every once in a while our checkpoints fail with the following exception: /AsynchronousException{java.lang.Exception: Could not materialize checkpoint 65912 for operator AGGREGATION-FILTER (2/2).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler