Hi, thanks for the update - indeed, now the IBM recommendation makes more sense to me. For example, let's say you have 100+ file shares, putting them in 1 volume each requires 200+ pools and more than 100+ MDS. This might be overkill, especially if some of the file shares are not frequently used. Three (final?) questions which I think are somewhat independent of specific use cases:
- Just to understand it correctly: You mention that two MDS are required per fs. Is this really required? I would think that standby MDS do not need to be linked to an fs volume, they can also wait for any failing MDS to be active. So for let's say 100 volumes you might just need to create e.g. 10 standby MDS (the exact number and setup has to match the failure domain), correct? - When it comes to fs failures, e.g. during rolling upgrades (with the orchestrator such that max_mds temporarily is set to 1, which you have mentioned below) or other issues, wouldn't it be good advise to not base all file shares as subvolumes in one single cephfs [single point of failure] but rather go for at least a hybrid setting with a few volumes (that contain subvolumes per share) such that not everything is lost in case of fs failures? Also, rolling upgrades (with the orchestrator) might be easier to accomplish successfully if the volume is not too large (such that max_mds=1 per volume is doable without running into severe bottlenecks)? - Finally, when using one cephfs and subvolumes, in order to keep data loss in case of failure limited, would it make sense to still create data pools per subvolume? The IBM recommendation to aim at a single fs to me seems to blindly assume that data on cephfs is safe. Thanks a lot, Sophonet "Eugen Block" [email protected] – 24. September 2025 10:50 > Multiple filesystems (or volumes) can be the right choice, it really > depends. But you need to be aware that for each CephFS you need (at > leats) two pools, plus one standby daemon for each active daemon. > While for a single FS (multi-active) it could be sufficient to have > one or two standby daemons in total because they automatically take > over the failed rank. As an example, you have 8 filesystems, that > means you need at least 16 pools (maybe more if you want to use EC) > which can be limited by the number of OSDs you have available. Then > you also need 16 MDS daemons (one active, one standby for each FS). In > a single FS scenario with 8 active MDS daemons it could be sufficient > to have 9 or 10 daemons in total, and you need fewer pools. > > If your setup is rather static and you don't have to create a new FS > every other week, and the total number of filesystems stays the same, > it might be the better approach for you. > > So I can't really recommend anything, you'll need to figure out which > scenario you need to cover. But Ceph is quite flexible, so you can > just start at one point and then develop from there. > > Zitat von Sophonet <[email protected]>: > > > Hi, > > > > thanks for the information - it seems that with pinning of > > subvolumes/directories you can distribute the load to different MDS. > > But in that case, what would be the difference to setting up > > different top-level volumes and attach them to different MDS? What I > > am not clear about is whether setting up one fs volume and pin > > subvolumes to different MDS is basically equivalent to using > > multiple fs volumes and attaching them to different MDS. Quotas/auth > > caps etc. can both be set for volumes as well as subvolumes. > > > > The only recommendation I have found on [0] says > > > > „...it is recommended to consolidate file system workloads onto a > > single CephFS file system, when possible. Consolidate the workloads > > to avoid over-allocating resources to MDS servers that can be > > underutilized.“ > > > > Is there a workload difference when using multiple fs volumes vs. a > > single one and subvolumes? Intuitively I would think that multiple > > fs volumes might provide some more error resilience in case of > > failures - in which case only one fs (of several) would fail instead > > of the whole cluster (if there is just a single volume and > > subvolumes are used). > > > > Any insights? Thanks, > > > > Sophonet > > > > [0] > > https://www.ibm.com/docs/en/storage-ceph/8.1.0?topic=systems-cephfs-volumes-subvolumes-subvolume-groups > > > >> Am 23.09.2025 um 15:45 schrieb Eugen Block <[email protected]>: > >> > >> Hi, > >> > >> with multiple active MDS daemons you can use pinning. This allows > >> you to pin specific directories (or subvolumes) to a specific rank > >> to spread the load. You can find the relevant docs here [0]. > >> > >> Note that during an upgrade, max_mds is reduced to 1 (automatically > >> if you use the orchestrator), which can have a significant impact > >> because all the load previously spreaded across multiple daemons is > >> now shuffled onto a single node. This can crash a file system, just > >> so you're aware. > >> > >> So there are several options, two or three "fat" MDS nodes in > >> active/standby mode which can handle all the load. Or you have more > >> "fat" nodes which could handle all the load during an upgrade, > >> spreading the load again after the upgrade is finished. Or you have > >> multiple "not so fat" nodes to spread the workload but with a > >> higher risk of an issue during an upgrade. > >> > >> Regards, > >> Eugen > >> > >> [0] https://docs.ceph.com/en/latest/cephfs/multimds/ > >> [1] > >> https://docs.ceph.com/en/latest/cephfs/upgrading/#upgrading-the-mds-cluster > >> > >> Zitat von Sophonet <[email protected]>: > >> > >>> Hi list, > >>> > >>> for multiple project-level file shares (with individual access > >>> rights) I am planning to use CephFS. > >>> > >>> Technically this can be implemented both with multiple toplevel > >>> cephfs or with a single cephfs in the cluster and subvolumes. > >>> > >>> What is the preferred choice? I have not found any guidance in > >>> http://docs.ceph.com <http://docs.ceph.com/>. The only location that > >>> suggests to use subvolumes is > >>> https://www.ibm.com/docs/en/storage-ceph/8.1.0?topic=systems-cephfs-volumes-subvolumes-subvolume-groups. > >>> However, how can I avoid that only one MDS is responsible for serving > >>> all subvolumes? Is there some current literature (books or web docs) that > >>> contain recommendations and examples? A couple of ceph-related books are > >>> available in well-known online book stores, but many of them are rather > >>> old (6 years or even > >>> more). > >>> > >>> Thanks a lot, > >>> > >>> Sophonet > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- [email protected] > >>> To unsubscribe send an email to [email protected] > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- [email protected] > >> To unsubscribe send an email to [email protected] > > > > _______________________________________________ > > ceph-users mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > > > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
