I read through this thread and didn't see any resolution to the slow
checkpoint issue (just that someone resolved their backpressure issue).

We are experiencing the same problem: 
- When there is no backpressure, checkpoints take less than 100ms
- When there is high backpressure, checkpoints take anywhere from 5 minutes
to 25 minutes.

This is preventing us from using the checkpointing feature at all, since
periodic backpressure is unavoidable.

We are experiencing this when running on Flink 1.4.0.
We are retaining only a single checkpoint, and the size of retained
checkpoint is less than 250KB, so there's not a lot of state.
   state.backend: jobmanager
   state.backend.async: true
   state.backend.fs.checkpointdir: hdfs://checkpoints
   state.checkpoints.num-retained: 1
   max concurrent checkpoints: 1
   checkpointing mode: AT_LEAST_ONCE

One other data point: if I rewrite the job to allow chaining all steps (i.e.
same parallelism on all steps, so they fit in 1 task slot), the checkpoints
are still slow under backpressure, but are an order of magnitude faster --
they take about 60 seconds rather than 15 minutes.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to