Hi Robin First of all, did you get the state size from the web UI? If so, the state size is the incremental checkpoint size not the actual full size [1]. I assume you only have one RocksDB instance per slot, the incremental checkpoint size for each RocksDB instance is 2011MB, which is some how quite large as an incremental size.
If every operator would only upload 2011MB to S3, the overall time of 58min is really too large. Would you please check the async phase of your checkpoint details of all tasks. The async time would reflect the S3 performance for writing data. I guess your async time would not be too large, the most common reason is operator receiving the barrier late which lead to the end-to-end duration large. I hope you could offer the UI of your checkpoint details for further investigation. [1] https://issues.apache.org/jira/browse/FLINK-13390 Best Yun Tang ________________________________ From: Robin Cassan <robin.cas...@contentsquare.com> Sent: Wednesday, April 15, 2020 18:35 To: user <user@flink.apache.org> Subject: Quick survey on checkpointing performance Hi all, We are currently experiencing long checkpointing times on S3 and are wondering how abnormal it is compared to other loads and setups. Could some of you share a few stats in your running architecture so we can compare? Here are our stats: Architecture: 28 TM on Kubernetes, 4 slots per TM, local NVME SSDs (r5d.2xlarge instances), RocksDB state backend, incremental checkpoints on Amazon S3 (without entropy injection), checkpoint interval of 1 hour Typical state size for one checkpoint: 220gb Checkpointing duration (End to End): 58 minutes We are surprised to see such a long duration to send 220gb to S3, we observe no backpressure in our job and the checkpointing duration is more or less the same for each subtask. We'd love to check if it's a normal duration or not, so thanks a lot for your answers! Cheers, Robin