Hello Devin, I struggled a lot with MDS tuning on a full NVMe SSD cephfs only cluster.
To give you an idea of the HW and sizing we have 3 MON nodes (128 cores / 128GB) and 7 OSD nodes (48cores / 256GB RAM ) with 8x 7.68TB NVMe SSD (2x OSDs per device) Obviously the MDS tuning and CephFS usage in general heavily depends on the workload and I/O patterns that are using it. To provide some context, we have ~150 nodes of Alps (https://www.cscs.ch/computers/alps ) and ~20 nodes of a Kubernetes cluster (running on VMs) mounting cephfs and doing I/O in predictable paths and directory structures. The filesystem is used as scratch file system for jobs, each jobs creates a number of directories in a known path, which makes directory pinning easier. With a single MDS it was not able to cope with the load and using this very useful dashboard (https://grafana.com/grafana/dashboards/9340-ceph-cephfs/) I was able to see MDS accumulating log segments and finally crash. One crash even led to FS corruption that I was not able to fix and I had to nuke the whole fs. That said, after upgrading to Squid I decided to scale UP MDSs to 6 and to configure ephemeral pinning on the parent directory of the jobs. This resulted in decent spread of load on the MDS servers and good overall behavior of the filesystem, avg reply latencies in the order of a few ms with some spikes up to ~100ms under heavy load, no more log segments accumulation, stable operations. This is our current MDS configuration: mds_cache_memory_limit 68719476736 mds_cache_mid 0.700000 mds_cache_reservation 0.050000 mds_cache_trim_decay_rate 0.800000 mds_cache_trim_threshold 524288 mds_cap_revoke_eviction_timeout 300.000000 mds_health_cache_threshold 1.500000 mds_log_max_segments 256 mds_max_caps_per_client 50000 mds_recall_global_max_decay_threshold 131072 mds_recall_max_caps 30000 mds_recall_max_decay_rate 1.500000 mds_recall_max_decay_threshold 131072 mds_recall_warning_decay_rate 60.000000 mds_recall_warning_threshold 262144 mds_beacon_grace 120.000000 mon_mds_skip_sanity true mds_client_delegate_inos_pct 0 mds_session_blocklist_on_evict false mds_session_blocklist_on_timeout false I don’t consider myself an expert but I hope this can be helpful. Best Regards, ELIA OGGIAN SYSTEM ENGINEER CSCS Centro Svizzero di Calcolo Scientifico Swiss National Supercomputing Centre [email protected]<mailto:[email protected]> www.cscs.ch<http://www.cscs.ch> From: Devin A. Bougie via ceph-users <[email protected]> Date: Thursday, 15 January 2026 at 23:37 To: ceph-users <[email protected]> Subject: [ceph-users] MDS tuning for large production cluster We have a 19.2.3 cluster managed by cephadm with five management nodes and 21 OSD nodes. We have roughly 300 linux clients across several subnets doing cephfs kernel mounts running a wide range of applications. Each management node has 760GB of memory. We’re currently using a single active MDS daemon, but have experimented with multiple MDS daemons and directory pinning. Each OSD for the cephfs data pool uses hdd (SATA) drives, with the DB and WAL on nvme's internal to the storage node. The cephfs metadata pool is on nvme drives internal to the management nodes. After a lot of testing in attempts to quiet down persistent MDS_CLIENT_LATE_RELEASE "clients failing to respond to capability release” and MDS_CLIENT_RECALL "clients failing to respond to cache pressure” warnings, we’ve ended up with the following settings. This seems to be working well. We still get periodic MDS_CLIENT_RECALL warnings, but not nearly as many as we were seeing and they clear relatively quickly. I’d greatly appreciate any suggestions for further improvements, or any concerns anyone has with these. Many thanks, Devin ——— cephfs session_timeout 120 mds mds_cache_memory_limit 549755813888 mds mds_cache_mid 0.700000 mds mds_cache_reservation 0.100000 mds mds_cache_trim_decay_rate 0.900000 mds mds_cache_trim_threshold 524288 mds mds_cap_revoke_eviction_timeout 30.000000 mds mds_health_cache_threshold 2.000000 mds mds_log_max_segments 256 mds mds_max_caps_per_client 5000000 mds mds_recall_global_max_decay_threshold 1048576 mds mds_recall_max_caps 5000 mds mds_recall_max_decay_rate 1.500000 mds mds_recall_max_decay_threshold 524288 mds mds_recall_warning_decay_rate 120.000000 mds mds_recall_warning_threshold 262144 ——— _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
