Hi,

   I am using FileIO with continuously watching folder for new files to
process. The problem is when flink starts reading 200MB file (around 3M 
elements) and also starts checkpointing. Checkpoint never finishes until 
WHOLE file is processed.

Minimal example :
https://github.com/seznam/beam/blob/simunek/failingCheckpoint/examples/java/
src/main/java/org/apache/beam/examples/CheckpointFailingExample.java
(https://github.com/seznam/beam/blob/simunek/failingCheckpoint/examples/java/src/main/java/org/apache/beam/examples/CheckpointFailingExample.java)

My theory what could be wrong from my understanding :
CheckpointMark in this case starts from Create.ofProvider and then its
propagated to downstream operators where it will be (in queue) behind all 
splits, which means all splits have to be read to successfully checkpoint 
the operator. The problem is even bigger when there are more files, then we
need to wait for processing all files to successfully checkpoint.

1. Are my assumption correct?
2. Is there some possibility to improve behavior of SplittableDoFn (or
subsequent reading from BoundedSource) for Flink to better propagate
checkpoint barrier?
 
For now my fix is reading smaller files (30MB) one by one, by it’s not very
future proof.

Versions:
Beam 2.17
Flink 1.9

Please correct my poor understanding of checkpointing with Beam and Flink 
and it would be wonderful if you have some advice what to improve or where
to look.

Reply via email to