Very slow recovery from Savepoint

Yordan Pavlov Thu, 28 Jan 2021 08:42:25 -0800

Hello there,
I am trying to find the solution for a problem we are having in our Flink
setup related to very slow recovery from a Savepoint. I have searched in the
mailing list, found a somewhat similar problem, the bottleneck there was the
HD usage, but I am not seeing this in our case. Here is a description of
what our setup is:
* Flink 1.11.3
* Running on top of Kubernetes on dedicated hardware.
* The Flink job consists of 4 task manager running on separate Kubernetes
pods along with a Jobmanager also running on separate Pod.
* We use RocksDB state backend with incremental checkpointing.
* The size of the savepoint I try to recover is around 35 GB
* The file system that RocksDB uses is S3, or more precisely a S3
emulation (Minio), we are not subject to any EBS burst credits and so
on.


The time it takes for the Flink job to be operational and start consuming
new records is around 5 hours. During that time I am not seeing any heavy
resource usage on any of the TaskManager pods. I am attaching a
screenshot of the resources of one of the Taskmanager pods.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png>

In this graph the job was started at around 14:00 o'clock. There is this
huge spike shortly after this and then there is not much happening. This
goes on for around 5 hours after which the job starts, but again working
quite slowly. What would be the way to profile where the bottleneck
is? I have checked my network connectivity and I am able to download
the whole savepoint for several minutes manually. It seems like Flink
is very slow to build its internal state but then again the CPU is not
being utilized. I would be grateful for any suggestions on how to
proceed with this investigation.

Regards,
Yordan

Very slow recovery from Savepoint

Reply via email to