Great to hear that you were able to resolve the issue! On Thu, Feb 4, 2021 at 5:12 PM Yordan Pavlov <[email protected]> wrote:
> Thank you for your tips Robert, > I think I narrowed down the problem to having slow Hard disks. Once > the memory runs out, RocksDb starts spilling to the disk and the > performance degradates greatly. I Moved the jobs to SSD disks and the > performance has been better. > > Best regards! > > On Tue, 2 Feb 2021 at 20:22, Robert Metzger <[email protected]> wrote: > > > > Hey Yordan, > > > > have you checked the log files from the processes in that cluster? > > The JobManager log should give you hints about issues with the > coordination / scheduling of the job. Could it be something unexpected, > like your job could not start, because there were not enough TaskManagers > available? > > The TaskManager logs could give you also hints about potential retries > etc. > > > > What you could also do is manually sample the TaskManagers (you can > access thread dumps via the web ui) to see what they are doing. > > > > Hope this helps! > > > > On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <[email protected]> > wrote: > >> > >> Hello there, > >> I am trying to find the solution for a problem we are having in our > Flink > >> setup related to very slow recovery from a Savepoint. I have searched > in the > >> mailing list, found a somewhat similar problem, the bottleneck there > was the > >> HD usage, but I am not seeing this in our case. Here is a description of > >> what our setup is: > >> * Flink 1.11.3 > >> * Running on top of Kubernetes on dedicated hardware. > >> * The Flink job consists of 4 task manager running on separate > Kubernetes > >> pods along with a Jobmanager also running on separate Pod. > >> * We use RocksDB state backend with incremental checkpointing. > >> * The size of the savepoint I try to recover is around 35 GB > >> * The file system that RocksDB uses is S3, or more precisely a S3 > >> emulation (Minio), we are not subject to any EBS burst credits and so > >> on. > >> > >> The time it takes for the Flink job to be operational and start > consuming > >> new records is around 5 hours. During that time I am not seeing any > heavy > >> resource usage on any of the TaskManager pods. I am attaching a > >> screenshot of the resources of one of the Taskmanager pods. > >> < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png > > > >> > >> In this graph the job was started at around 14:00 o'clock. There is this > >> huge spike shortly after this and then there is not much happening. This > >> goes on for around 5 hours after which the job starts, but again working > >> quite slowly. What would be the way to profile where the bottleneck > >> is? I have checked my network connectivity and I am able to download > >> the whole savepoint for several minutes manually. It seems like Flink > >> is very slow to build its internal state but then again the CPU is not > >> being utilized. I would be grateful for any suggestions on how to > >> proceed with this investigation. > >> > >> Regards, > >> Yordan >
