[ceph-users] Re: MDS tuning for large production cluster

Oggian Elia via ceph-users Fri, 16 Jan 2026 03:11:31 -0800

Hello Devin,

I struggled a lot with MDS tuning on a full NVMe SSD cephfs only cluster.


To give you an idea of the HW and sizing we have 3 MON nodes (128 cores / 
128GB) and 7 OSD nodes (48cores / 256GB RAM ) with 8x 7.68TB NVMe SSD (2x OSDs 
per device)

Obviously the MDS tuning and CephFS usage in general heavily depends on the 
workload and I/O patterns that are using it.

To provide some context, we have  ~150 nodes of Alps 
(https://www.cscs.ch/computers/alps ) and ~20 nodes of a Kubernetes cluster 
(running on VMs) mounting cephfs and doing I/O in predictable paths and 
directory structures. The filesystem is used as scratch file system for jobs, 
each jobs creates a number of directories in a known path, which makes 
directory pinning easier.

With a single MDS it was not able to cope with the load and using this very 
useful dashboard (https://grafana.com/grafana/dashboards/9340-ceph-cephfs/) I 
was able to see MDS accumulating log segments and finally crash. One crash even 
led to FS corruption that I was not able to fix and I had to nuke the whole fs.

That said, after upgrading to Squid I decided to scale UP MDSs to 6 and to 
configure ephemeral pinning on the parent directory of the jobs. This resulted 
in decent spread of load on the MDS servers and good overall behavior of the 
filesystem, avg reply latencies in the order of a few ms with some spikes up to 
~100ms under heavy load, no more log segments accumulation, stable operations.

This is our current MDS configuration:

mds_cache_memory_limit                      68719476736
mds_cache_mid                               0.700000
mds_cache_reservation                       0.050000
mds_cache_trim_decay_rate                   0.800000
mds_cache_trim_threshold                    524288
mds_cap_revoke_eviction_timeout             300.000000
mds_health_cache_threshold                  1.500000
mds_log_max_segments                        256
mds_max_caps_per_client                     50000
mds_recall_global_max_decay_threshold       131072
mds_recall_max_caps                         30000
mds_recall_max_decay_rate                   1.500000
mds_recall_max_decay_threshold              131072
mds_recall_warning_decay_rate               60.000000
mds_recall_warning_threshold                262144

mds_beacon_grace                            120.000000
mon_mds_skip_sanity                         true
mds_client_delegate_inos_pct                0
mds_session_blocklist_on_evict              false
mds_session_blocklist_on_timeout            false


I don’t consider myself an expert but I hope this can be helpful.

Best Regards,


ELIA OGGIAN
SYSTEM ENGINEER

CSCS
Centro Svizzero di Calcolo Scientifico
Swiss National Supercomputing Centre
[email protected]<mailto:[email protected]>
www.cscs.ch<http://www.cscs.ch>



From: Devin A. Bougie via ceph-users <[email protected]>
Date: Thursday, 15 January 2026 at 23:37
To: ceph-users <[email protected]>
Subject: [ceph-users] MDS tuning for large production cluster

We have a 19.2.3 cluster managed by cephadm with five management nodes and 21 
OSD nodes.  We have roughly 300 linux clients across several subnets doing 
cephfs kernel mounts running a wide range of applications.

Each management node has 760GB of memory.  We’re currently using a single 
active MDS daemon, but have experimented with multiple MDS daemons and 
directory pinning.

Each OSD for the cephfs data pool uses hdd (SATA) drives, with the DB and WAL 
on nvme's internal to the storage node.  The cephfs metadata pool is on nvme 
drives internal to the management nodes.

After a lot of testing in attempts to quiet down persistent 
MDS_CLIENT_LATE_RELEASE "clients failing to respond to capability release” and 
MDS_CLIENT_RECALL "clients failing to respond to cache pressure” warnings, 
we’ve ended up with the following settings.  This seems to be working well.  We 
still get periodic MDS_CLIENT_RECALL warnings, but not nearly as many as we 
were seeing and they clear relatively quickly.

I’d greatly appreciate any suggestions for further improvements, or any 
concerns anyone has with these.

Many thanks,
Devin

———
cephfs  session_timeout 120
mds     mds_cache_memory_limit  549755813888                                    
                                              mds       mds_cache_mid   
0.700000                                                                        
              mds       mds_cache_reservation   0.100000                        
                                                              mds       
mds_cache_trim_decay_rate       0.900000                                        
                                              mds       
mds_cache_trim_threshold        524288                                          
                                              mds       
mds_cap_revoke_eviction_timeout 30.000000                                       
                                              mds       
mds_health_cache_threshold      2.000000                                        
                                              mds       mds_log_max_segments    
256                                                                             
              mds       mds_max_caps_per_client 5000000                         
                                                              mds       
mds_recall_global_max_decay_threshold   1048576                                 
                                                      mds       
mds_recall_max_caps     5000                                                    
                                      mds       mds_recall_max_decay_rate       
1.500000                                                                        
              mds       mds_recall_max_decay_threshold  524288                  
                                                                      mds       
mds_recall_warning_decay_rate   120.000000                                      
                                              mds       
mds_recall_warning_threshold    262144
———
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: MDS tuning for large production cluster

Reply via email to