Re: Very slow recovery from Savepoint

Robert Metzger Fri, 05 Feb 2021 03:12:31 -0800

Great to hear that you were able to resolve the issue!

On Thu, Feb 4, 2021 at 5:12 PM Yordan Pavlov <[email protected]> wrote:


> Thank you for your tips Robert,
> I think I narrowed down the problem to having slow Hard disks. Once
> the memory runs out, RocksDb starts spilling to the disk and the
> performance degradates greatly. I Moved the jobs to SSD disks and the
> performance has been better.
>
> Best regards!
>
> On Tue, 2 Feb 2021 at 20:22, Robert Metzger <[email protected]> wrote:
> >
> > Hey Yordan,
> >
> > have you checked the log files from the processes in that cluster?
> > The JobManager log should give you hints about issues with the
> coordination / scheduling of the job. Could it be something unexpected,
> like your job could not start, because there were not enough TaskManagers
> available?
> > The TaskManager logs could give you also hints about potential retries
> etc.
> >
> > What you could also do is manually sample the TaskManagers (you can
> access thread dumps via the web ui) to see what they are doing.
> >
> > Hope this helps!
> >
> > On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <[email protected]>
> wrote:
> >>
> >> Hello there,
> >> I am trying to find the solution for a problem we are having in our
> Flink
> >> setup related to very slow recovery from a Savepoint. I have searched
> in the
> >> mailing list, found a somewhat similar problem, the bottleneck there
> was the
> >> HD usage, but I am not seeing this in our case. Here is a description of
> >> what our setup is:
> >> * Flink 1.11.3
> >> * Running on top of Kubernetes on dedicated hardware.
> >> * The Flink job consists of 4 task manager running on separate
> Kubernetes
> >> pods along with a Jobmanager also running on separate Pod.
> >> * We use RocksDB state backend with incremental checkpointing.
> >> * The size of the savepoint I try to recover is around 35 GB
> >> * The file system that RocksDB uses is S3, or more precisely a S3
> >> emulation (Minio), we are not subject to any EBS burst credits and so
> >> on.
> >>
> >> The time it takes for the Flink job to be operational and start
> consuming
> >> new records is around 5 hours. During that time I am not seeing any
> heavy
> >> resource usage on any of the TaskManager pods. I am attaching a
> >> screenshot of the resources of one of the Taskmanager pods.
> >> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png
> >
> >>
> >> In this graph the job was started at around 14:00 o'clock. There is this
> >> huge spike shortly after this and then there is not much happening. This
> >> goes on for around 5 hours after which the job starts, but again working
> >> quite slowly. What would be the way to profile where the bottleneck
> >> is? I have checked my network connectivity and I am able to download
> >> the whole savepoint for several minutes manually. It seems like Flink
> >> is very slow to build its internal state but then again the CPU is not
> >> being utilized. I would be grateful for any suggestions on how to
> >> proceed with this investigation.
> >>
> >> Regards,
> >> Yordan
>

Re: Very slow recovery from Savepoint

Reply via email to