Hi Robin

First of all, did you get the state size from the web UI? If so, the state size 
is the incremental checkpoint size not the actual full size [1]. I assume you 
only have one RocksDB instance per slot, the incremental checkpoint size for 
each RocksDB instance is 2011MB, which is some
how quite large as an incremental size.

If every operator would only upload 2011MB to S3, the overall time of 58min is 
really too large. Would you please check the async phase of your checkpoint 
details of all tasks. The async time would reflect the S3 performance for 
writing data. I guess your async time would not be too large, the most common 
reason is operator receiving the barrier late which lead to the end-to-end 
duration large. I hope you could offer the UI of your checkpoint details for 
further investigation.


[1] https://issues.apache.org/jira/browse/FLINK-13390

Best
Yun Tang
________________________________
From: Robin Cassan <robin.cas...@contentsquare.com>
Sent: Wednesday, April 15, 2020 18:35
To: user <user@flink.apache.org>
Subject: Quick survey on checkpointing performance

Hi all,

We are currently experiencing long checkpointing times on S3 and are wondering 
how abnormal it is compared to other loads and setups. Could some of you share 
a few stats in your running architecture so we can compare?

Here are our stats:

Architecture: 28 TM on Kubernetes, 4 slots per TM, local NVME SSDs (r5d.2xlarge 
instances), RocksDB state backend, incremental checkpoints on Amazon S3 
(without entropy injection), checkpoint interval of 1 hour

Typical state size for one checkpoint: 220gb

Checkpointing duration (End to End): 58 minutes

We are surprised to see such a long duration to send 220gb to S3, we observe no 
backpressure in our job and the checkpointing duration is more or less the same 
for each subtask. We'd love to check if it's a normal duration or not, so 
thanks a lot for your answers!

Cheers,
Robin

Reply via email to