I read through this thread and didn't see any resolution to the slow checkpoint issue (just that someone resolved their backpressure issue).
We are experiencing the same problem: - When there is no backpressure, checkpoints take less than 100ms - When there is high backpressure, checkpoints take anywhere from 5 minutes to 25 minutes. This is preventing us from using the checkpointing feature at all, since periodic backpressure is unavoidable. We are experiencing this when running on Flink 1.4.0. We are retaining only a single checkpoint, and the size of retained checkpoint is less than 250KB, so there's not a lot of state. state.backend: jobmanager state.backend.async: true state.backend.fs.checkpointdir: hdfs://checkpoints state.checkpoints.num-retained: 1 max concurrent checkpoints: 1 checkpointing mode: AT_LEAST_ONCE One other data point: if I rewrite the job to allow chaining all steps (i.e. same parallelism on all steps, so they fit in 1 task slot), the checkpoints are still slow under backpressure, but are an order of magnitude faster -- they take about 60 seconds rather than 15 minutes. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/