Hello We have two computing clusters, a larger one with old machines running Centos 7, and a new test cluster consisting of three 512-thread AMD EPYC system running AlmaLinux 9 (5.14.0-570.42.2 kernel). Both clusters use cephfs 17.2.5 as their shared file system.
Everything is fine when metadata-intensive jobs are spread-out among the compute nodes of the older cluster (no more than one or two dozen of those jobs per system). However, if a large number of metadata-intensive jobs run on our big 512-thread systems, ceph performance essentially collapses (at least 60x slowdown compared to the old cluster), with processes spending most of their time in the kernel instead of doing their work. Under load, the EPYC compute nodes have plenty of spare memory; NICs are barely utilized; average CPU utilization 50-80%, almost all of it in sys. Our cephfs metadata pool is on SSDs. It seems like local lock contention is the bottleneck, rather than the MDSes (which are busy-ish, but are not obviously overloaded). I have attached two flamegraphs illustrating the situation. * slowpath.svg shows what happens when a lot (200-500) of processes are frequently calling the stat system call on the same files. * osq_lock.svg show what happens when, I think, a lot of processes are either listing a directory or accessing files in a directory (with about 1000 files). We have also observed MDSes frequently evicting the big EPIC machines, but I haven't figured out what workloads trigger that yet. Does anybody have any advice on how to improve cephfs performance/stability on large systems? Thanks in advance Vlad _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
