Re: Very slow recovery from Savepoint

Robert Metzger Tue, 02 Feb 2021 10:22:40 -0800

Hey Yordan,

have you checked the log files from the processes in that cluster?
The JobManager log should give you hints about issues with the coordination
/ scheduling of the job. Could it be something unexpected, like your job
could not start, because there were not enough TaskManagers available?
The TaskManager logs could give you also hints about potential retries etc.


What you could also do is manually sample the TaskManagers (you can access
thread dumps via the web ui) to see what they are doing.

Hope this helps!

On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <y.d.pav...@gmail.com> wrote:

> Hello there,
> I am trying to find the solution for a problem we are having in our Flink
> setup related to very slow recovery from a Savepoint. I have searched in
> the
> mailing list, found a somewhat similar problem, the bottleneck there was
> the
> HD usage, but I am not seeing this in our case. Here is a description of
> what our setup is:
> * Flink 1.11.3
> * Running on top of Kubernetes on dedicated hardware.
> * The Flink job consists of 4 task manager running on separate Kubernetes
> pods along with a Jobmanager also running on separate Pod.
> * We use RocksDB state backend with incremental checkpointing.
> * The size of the savepoint I try to recover is around 35 GB
> * The file system that RocksDB uses is S3, or more precisely a S3
> emulation (Minio), we are not subject to any EBS burst credits and so
> on.
>
> The time it takes for the Flink job to be operational and start consuming
> new records is around 5 hours. During that time I am not seeing any heavy
> resource usage on any of the TaskManager pods. I am attaching a
> screenshot of the resources of one of the Taskmanager pods.
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png
> >
>
> In this graph the job was started at around 14:00 o'clock. There is this
> huge spike shortly after this and then there is not much happening. This
> goes on for around 5 hours after which the job starts, but again working
> quite slowly. What would be the way to profile where the bottleneck
> is? I have checked my network connectivity and I am able to download
> the whole savepoint for several minutes manually. It seems like Flink
> is very slow to build its internal state but then again the CPU is not
> being utilized. I would be grateful for any suggestions on how to
> proceed with this investigation.
>
> Regards,
> Yordan
>

Re: Very slow recovery from Savepoint

Reply via email to