Hello

We have two computing clusters, a larger one with old machines running
Centos 7, and a new test cluster consisting of three 512-thread AMD EPYC
system running AlmaLinux 9 (5.14.0-570.42.2 kernel). Both clusters use
cephfs 17.2.5 as their shared file system.

Everything is fine when metadata-intensive jobs are spread-out among the
compute nodes of the older cluster (no more than one or two dozen of those
jobs per system). However, if a large number of metadata-intensive jobs run
on our big 512-thread systems, ceph performance essentially collapses (at
least 60x slowdown compared to the old cluster), with processes spending
most of their time in the kernel instead of doing their work.

Under load, the EPYC compute nodes have plenty of spare memory; NICs
are barely utilized; average CPU utilization 50-80%, almost all of it in
sys. Our cephfs metadata pool is on SSDs. It seems like local lock
contention is the bottleneck, rather than the MDSes (which are busy-ish,
but are not obviously overloaded). I have attached two flamegraphs
illustrating the situation.

* slowpath.svg shows what happens when a lot (200-500) of processes are
frequently calling the stat system call on the same files.
* osq_lock.svg show what happens when, I think, a lot of processes are
either listing a directory or accessing files in a directory (with about
1000 files).

We have also observed MDSes frequently evicting the big EPIC machines, but
I haven't figured out what workloads trigger that yet.

Does anybody have any advice on how to improve cephfs performance/stability
on large systems?



Thanks in advance

Vlad
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to