[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
My current settings are: mds advanced mds_beacon_grace 15.00 True. I might as well remove it completely, it's an artefact of earlier experiments. This should be a global setting. It is used by the mons and mdss. mds basic mds_cache_memory_limit 4294967296 mds advanced mds_cache_trim_threshold 393216 globaladvanced mds_export_ephemeral_distributed true mds advanced mds_recall_global_max_decay_threshold 393216 mds advanced mds_recall_max_caps 3 mds advanced mds_recall_max_decay_threshold 98304 mds advanced mds_recall_warning_threshold 196608 globaladvanced mon_compact_on_start true I haven't had any noticeable slow downs or crashes in a while with 3 active MDS and 3 hot standbys. Thanks for sharing the settings that worked for you. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
On Tue, Dec 15, 2020 at 12:50 AM Janek Bevendorff wrote: > > My current settings are: > > mds advanced mds_beacon_grace 15.00 This should be a global setting. It is used by the mons and mdss. > mds basic mds_cache_memory_limit 4294967296 > mds advanced mds_cache_trim_threshold 393216 > globaladvanced mds_export_ephemeral_distributed true > mds advanced mds_recall_global_max_decay_threshold 393216 > mds advanced mds_recall_max_caps 3 > mds advanced mds_recall_max_decay_threshold 98304 > mds advanced mds_recall_warning_threshold 196608 > globaladvanced mon_compact_on_start true > > I haven't had any noticeable slow downs or crashes in a while with 3 > active MDS and 3 hot standbys. Thanks for sharing the settings that worked for you. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
My current settings are: mds advanced mds_beacon_grace 15.00 mds basic mds_cache_memory_limit 4294967296 mds advanced mds_cache_trim_threshold 393216 global advanced mds_export_ephemeral_distributed true mds advanced mds_recall_global_max_decay_threshold 393216 mds advanced mds_recall_max_caps 3 mds advanced mds_recall_max_decay_threshold 98304 mds advanced mds_recall_warning_threshold 196608 global advanced mon_compact_on_start true I haven't had any noticeable slow downs or crashes in a while with 3 active MDS and 3 hot standbys. On 14/12/2020 22:33, Patrick Donnelly wrote: On Mon, Dec 7, 2020 at 12:06 PM Patrick Donnelly wrote: Hi Dan & Janek, On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster wrote: My understanding is that the recall thresholds (see my list below) should be scaled proportionally. OTOH, I haven't played with the decay rates (and don't know if there's any significant value to tuning those). I haven't gone through this thread yet but I want to note for those reading that we do now have documentation (thanks for the frequent pokes Janek!) for the recall configurations: https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall Please let us know if it's missing information or if something could be more clear. I also now have a PR open for updating the defaults based on these and other discussions: https://github.com/ceph/ceph/pull/38574 Feedback welcome. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
On Mon, Dec 7, 2020 at 12:06 PM Patrick Donnelly wrote: > > Hi Dan & Janek, > > On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster wrote: > > My understanding is that the recall thresholds (see my list below) > > should be scaled proportionally. OTOH, I haven't played with the decay > > rates (and don't know if there's any significant value to tuning > > those). > > I haven't gone through this thread yet but I want to note for those > reading that we do now have documentation (thanks for the frequent > pokes Janek!) for the recall configurations: > > https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall > > Please let us know if it's missing information or if something could > be more clear. I also now have a PR open for updating the defaults based on these and other discussions: https://github.com/ceph/ceph/pull/38574 Feedback welcome. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
Hi Patrick, I haven't gone through this thread yet but I want to note for those reading that we do now have documentation (thanks for the frequent pokes Janek!) for the recall configurations: https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall Please let us know if it's missing information or if something could be more clear. The documentation has helped a big deal already and I've been playing around with them quite a bit recently. What's missing, obviously, are recommended settings for individual scenarios (at least ballparks). But that is hard to come by without experimenting first (I wouldn't call our deployment massive, but very likely significantly above average and I don't know what scale the developers are usually testing at). As I mentioned in the other thread, I am testing Dan's recommendations at the moment and will refine them for our purposes. The effects of individual tweaks are hard to assess without dedicated benchmarks (although "MDS not hanging up" is already somewhat of a benchmark :-)). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
Hi Dan & Janek, On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster wrote: > My understanding is that the recall thresholds (see my list below) > should be scaled proportionally. OTOH, I haven't played with the decay > rates (and don't know if there's any significant value to tuning > those). I haven't gone through this thread yet but I want to note for those reading that we do now have documentation (thanks for the frequent pokes Janek!) for the recall configurations: https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall Please let us know if it's missing information or if something could be more clear. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
Never mind, when I enable it on a more busy directory, I do see new ephemeral pins popping up. Just not on the directories I set it on originally. Let's see how that holds up. On 07/12/2020 13:04, Janek Bevendorff wrote: Thanks. I tried playing around a bit with mds_export_ephemeral_distributed just now, because it's pretty much the same thing that your script does manually. Unfortunately, it seems to have no effect. I pinned all top-level directories to mds.0 and then enabled ceph.dir.pin.distributed for a few sub trees. Despite mds_export_ephemeral_distributed being set to true, all work is done by mds.0 now and I also don't see any additional pins in ceph tell mds.\* get subtrees. Any ideas why that might be? On 07/12/2020 10:49, Dan van der Ster wrote: On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff wrote: What exactly do you set to 64k? We used to set mds_max_caps_per_client to 5, but once we started using the tuned caps recall config, we reverted that back to the default 1M without issue. mds_max_caps_per_client. As I mentioned, some clients hit this limit regularly and they aren't entirely idle. I will keep tuning the recall settings, though. This 15k caps client I mentioned is not related to the max caps per client config. In recent nautilus, the MDS will proactively recall caps from idle clients -- so a client with even just a few caps like this can provoke the caps recall warnings (if it is buggy, like in this case). The client doesn't cause any real problems, just the annoying warnings. We only see the warnings during normal operation. I remember having massive issues with early Nautilus releases, but thanks to more aggressive recall behaviour in newer releases, that is fixed. Back then it was virtually impossible to keep the MDS within the bounds of its memory limit. Nowadays, the warnings only appear when the MDS is really stressed. In that situation, the whole FS performance is already degraded massively and MDSs are likely to fail and run into the rejoin loop. Multi-active + pinning definitely increases the overall MD throughput (once you can get the relevant inodes cached), because as you know the MDS is single threaded and CPU bound at the limit. We could get something like 4-5k handle_client_requests out of a single MDS, and that really does scale horizontally as you add MDSs (and pin). Okay, I will definitely re-evaluate options for pinning individual directories, perhaps a small script can do it. There is a new ephemeral pinning option in the latest latest releases, but we didn't try it yet. Here's our script -- it assumes the parent dir is pinned to zero or that bal is disabled: https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard Too many pins can cause problems -- we have something like 700 pins at the moment and it's fine, though. Cheers, Dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
Thanks. I tried playing around a bit with mds_export_ephemeral_distributed just now, because it's pretty much the same thing that your script does manually. Unfortunately, it seems to have no effect. I pinned all top-level directories to mds.0 and then enabled ceph.dir.pin.distributed for a few sub trees. Despite mds_export_ephemeral_distributed being set to true, all work is done by mds.0 now and I also don't see any additional pins in ceph tell mds.\* get subtrees. Any ideas why that might be? On 07/12/2020 10:49, Dan van der Ster wrote: On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff wrote: What exactly do you set to 64k? We used to set mds_max_caps_per_client to 5, but once we started using the tuned caps recall config, we reverted that back to the default 1M without issue. mds_max_caps_per_client. As I mentioned, some clients hit this limit regularly and they aren't entirely idle. I will keep tuning the recall settings, though. This 15k caps client I mentioned is not related to the max caps per client config. In recent nautilus, the MDS will proactively recall caps from idle clients -- so a client with even just a few caps like this can provoke the caps recall warnings (if it is buggy, like in this case). The client doesn't cause any real problems, just the annoying warnings. We only see the warnings during normal operation. I remember having massive issues with early Nautilus releases, but thanks to more aggressive recall behaviour in newer releases, that is fixed. Back then it was virtually impossible to keep the MDS within the bounds of its memory limit. Nowadays, the warnings only appear when the MDS is really stressed. In that situation, the whole FS performance is already degraded massively and MDSs are likely to fail and run into the rejoin loop. Multi-active + pinning definitely increases the overall MD throughput (once you can get the relevant inodes cached), because as you know the MDS is single threaded and CPU bound at the limit. We could get something like 4-5k handle_client_requests out of a single MDS, and that really does scale horizontally as you add MDSs (and pin). Okay, I will definitely re-evaluate options for pinning individual directories, perhaps a small script can do it. There is a new ephemeral pinning option in the latest latest releases, but we didn't try it yet. Here's our script -- it assumes the parent dir is pinned to zero or that bal is disabled: https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard Too many pins can cause problems -- we have something like 700 pins at the moment and it's fine, though. Cheers, Dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff wrote: > > > > What exactly do you set to 64k? > > We used to set mds_max_caps_per_client to 5, but once we started > > using the tuned caps recall config, we reverted that back to the > > default 1M without issue. > > mds_max_caps_per_client. As I mentioned, some clients hit this limit > regularly and they aren't entirely idle. I will keep tuning the recall > settings, though. > > > This 15k caps client I mentioned is not related to the max caps per > > client config. In recent nautilus, the MDS will proactively recall > > caps from idle clients -- so a client with even just a few caps like > > this can provoke the caps recall warnings (if it is buggy, like in > > this case). The client doesn't cause any real problems, just the > > annoying warnings. > > We only see the warnings during normal operation. I remember having > massive issues with early Nautilus releases, but thanks to more > aggressive recall behaviour in newer releases, that is fixed. Back then > it was virtually impossible to keep the MDS within the bounds of its > memory limit. Nowadays, the warnings only appear when the MDS is really > stressed. In that situation, the whole FS performance is already > degraded massively and MDSs are likely to fail and run into the rejoin loop. > > > Multi-active + pinning definitely increases the overall MD throughput > > (once you can get the relevant inodes cached), because as you know the > > MDS is single threaded and CPU bound at the limit. > > We could get something like 4-5k handle_client_requests out of a > > single MDS, and that really does scale horizontally as you add MDSs > > (and pin). > > Okay, I will definitely re-evaluate options for pinning individual > directories, perhaps a small script can do it. There is a new ephemeral pinning option in the latest latest releases, but we didn't try it yet. Here's our script -- it assumes the parent dir is pinned to zero or that bal is disabled: https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard Too many pins can cause problems -- we have something like 700 pins at the moment and it's fine, though. Cheers, Dan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
What exactly do you set to 64k? We used to set mds_max_caps_per_client to 5, but once we started using the tuned caps recall config, we reverted that back to the default 1M without issue. mds_max_caps_per_client. As I mentioned, some clients hit this limit regularly and they aren't entirely idle. I will keep tuning the recall settings, though. This 15k caps client I mentioned is not related to the max caps per client config. In recent nautilus, the MDS will proactively recall caps from idle clients -- so a client with even just a few caps like this can provoke the caps recall warnings (if it is buggy, like in this case). The client doesn't cause any real problems, just the annoying warnings. We only see the warnings during normal operation. I remember having massive issues with early Nautilus releases, but thanks to more aggressive recall behaviour in newer releases, that is fixed. Back then it was virtually impossible to keep the MDS within the bounds of its memory limit. Nowadays, the warnings only appear when the MDS is really stressed. In that situation, the whole FS performance is already degraded massively and MDSs are likely to fail and run into the rejoin loop. Multi-active + pinning definitely increases the overall MD throughput (once you can get the relevant inodes cached), because as you know the MDS is single threaded and CPU bound at the limit. We could get something like 4-5k handle_client_requests out of a single MDS, and that really does scale horizontally as you add MDSs (and pin). Okay, I will definitely re-evaluate options for pinning individual directories, perhaps a small script can do it. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
On Mon, Dec 7, 2020 at 9:42 AM Janek Bevendorff wrote: > > Thanks, Dan! > > I have played with many thresholds, including the decay rates. It is > indeed very difficult to assess their effects, since our workloads > differ widely depending on what people are working on at the moment. I > would need to develop a proper benchmarking suite to simulate the > different heavy workloads we have. > > > We currently run with all those options scaled up 6x the defaults, and > > we almost never have caps recall warnings these days, with a couple > > thousand cephfs clients. > > Under normal operation, we don't either. We had issues in the past with > Ganesha and still do sometimes, but that's a bug in Ganesha and we don't > really use it for anything but legacy clients any way. Usually, recall +1 we've seen that ganesha issue; it simply won't release caps ever, even with the latest fixes in this area. > works flawlessly, unless some client suddenly starts doing crazy shit. > We have just a few clients who regularly keep tens of thousands of caps > open and had I not limited the number, it would be hundreds of > thousands. Recalling them without threatening stability is not trivial > and at the least it degrades the performance for everybody else. Any > pointers here to better handling this situation are greatly appreciated. > I will definitely try your config recommendations. > > > 2. A user running VSCodium, keeping 15k caps open.. the opportunistic > > caps recall eventually starts recalling those but the (el7 kernel) > > client won't release them. Stopping Codium seems to be the only way to > > release. > > As I said, 15k is not much for us. The limits right now are 64k per > client and a few hit that limit quite regularly. One of those clients is What exactly do you set to 64k? We used to set mds_max_caps_per_client to 5, but once we started using the tuned caps recall config, we reverted that back to the default 1M without issue. This 15k caps client I mentioned is not related to the max caps per client config. In recent nautilus, the MDS will proactively recall caps from idle clients -- so a client with even just a few caps like this can provoke the caps recall warnings (if it is buggy, like in this case). The client doesn't cause any real problems, just the annoying warnings. So what I'm looking for now is a way to disable proactively recalling if the num caps is below some threshold -- `min_caps_per_client` might do this but I haven't tested yet. > our VPN gateway, which, technically, is not a single client, but to the > CephFS it looks like one due to source NAT. This is certainly something > I want to tune further, so that clients are routed directly via their > private IP instead of being NAT'ed. The other ones are our GPU deep > learning servers (just three of them, but they can generate astounding > numbers of iops) and the 135-node Hadoop cluster (which is hard to > sustain for any single machine, so we prefer to use the S3 here). > > > Otherwise, 4GB is normally sufficient in our env for > > mds_cache_memory_limit (3 active MDSs), however this is highly > > workload dependent. If several clients are actively taking 100s of > > thousands of caps, then the 4GB MDS needs to be ultra busy recalling > > caps and latency increases. We saw this live a couple weeks ago: a few > > users started doing intensive rsyncs, and some other users noticed an > > MD latency increase; it was fixed immediately just by increasing the > > mem limit to 8GB. > > So you too have 3 active MDSs? Are you using directory pinning? We have > a very deep and unbalanced directory structure, so I cannot really pin > any top-level directory without skewing the load massively. From my > experience, three MDSs without explicit pinning aren't much better or > even worse than one. But perhaps you have different observations? Yes 3 active today, and lots of pinning thanks to our flat hierarchy. User dirs are pinned to one of three randomly, as are the manila shares. MD balancer = on creates a disaster in our env -- too much ping pong of dirs between the MDSs, too much metadata IO needed to keep up, not to mention "nice export" bugs in the past that forced us to disable the balancer to begin with. We used to have 10 active MDSs, but that is such a pain during upgrades that we're now trying with just three. Next upgrade we'll probably leave it at one for a while to see if that suffices. Multi-active + pinning definitely increases the overall MD throughput (once you can get the relevant inodes cached), because as you know the MDS is single threaded and CPU bound at the limit. We could get something like 4-5k handle_client_requests out of a single MDS, and that really does scale horizontally as you add MDSs (and pin). Cheers, Dan > > > > I agree some sort of tuning best practises should all be documented > > somehow, even though it's complex and rather delicate. > > Indeed! > > > Janek >
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
Thanks, Dan! I have played with many thresholds, including the decay rates. It is indeed very difficult to assess their effects, since our workloads differ widely depending on what people are working on at the moment. I would need to develop a proper benchmarking suite to simulate the different heavy workloads we have. We currently run with all those options scaled up 6x the defaults, and we almost never have caps recall warnings these days, with a couple thousand cephfs clients. Under normal operation, we don't either. We had issues in the past with Ganesha and still do sometimes, but that's a bug in Ganesha and we don't really use it for anything but legacy clients any way. Usually, recall works flawlessly, unless some client suddenly starts doing crazy shit. We have just a few clients who regularly keep tens of thousands of caps open and had I not limited the number, it would be hundreds of thousands. Recalling them without threatening stability is not trivial and at the least it degrades the performance for everybody else. Any pointers here to better handling this situation are greatly appreciated. I will definitely try your config recommendations. 2. A user running VSCodium, keeping 15k caps open.. the opportunistic caps recall eventually starts recalling those but the (el7 kernel) client won't release them. Stopping Codium seems to be the only way to release. As I said, 15k is not much for us. The limits right now are 64k per client and a few hit that limit quite regularly. One of those clients is our VPN gateway, which, technically, is not a single client, but to the CephFS it looks like one due to source NAT. This is certainly something I want to tune further, so that clients are routed directly via their private IP instead of being NAT'ed. The other ones are our GPU deep learning servers (just three of them, but they can generate astounding numbers of iops) and the 135-node Hadoop cluster (which is hard to sustain for any single machine, so we prefer to use the S3 here). Otherwise, 4GB is normally sufficient in our env for mds_cache_memory_limit (3 active MDSs), however this is highly workload dependent. If several clients are actively taking 100s of thousands of caps, then the 4GB MDS needs to be ultra busy recalling caps and latency increases. We saw this live a couple weeks ago: a few users started doing intensive rsyncs, and some other users noticed an MD latency increase; it was fixed immediately just by increasing the mem limit to 8GB. So you too have 3 active MDSs? Are you using directory pinning? We have a very deep and unbalanced directory structure, so I cannot really pin any top-level directory without skewing the load massively. From my experience, three MDSs without explicit pinning aren't much better or even worse than one. But perhaps you have different observations? I agree some sort of tuning best practises should all be documented somehow, even though it's complex and rather delicate. Indeed! Janek ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems
Hi Janek, My understanding is that the recall thresholds (see my list below) should be scaled proportionally. OTOH, I haven't played with the decay rates (and don't know if there's any significant value to tuning those). We have a recall tuning script that we use to deploy different factors whenever there are caps recall issues: X=$1 echo Scaling MDS Recall by ${X}x ceph tell mds.* injectargs -- --mds_recall_max_decay_threshold $((X*16*1024)) --mds_recall_max_caps $((X*5000)) --mds_recall_global_max_decay_threshold $((X*64*1024)) --mds_recall_warning_threshold $((X*32*1024)) --mds_cache_trim_threshold $((X*64*1024)) We currently run with all those options scaled up 6x the defaults, and we almost never have caps recall warnings these days, with a couple thousand cephfs clients. In the past month I've seen 2 different cases of a client not releasing caps even with these options: 1. A user had ceph-fuse mounted /cephfs/ on top of a 2nd ceph-fuse /cephfs. The outer (i.e lower) mountpoint/process had several thousand caps that could never be released until the user cleaned up their mounts. 2. A user running VSCodium, keeping 15k caps open.. the opportunistic caps recall eventually starts recalling those but the (el7 kernel) client won't release them. Stopping Codium seems to be the only way to release. Otherwise, 4GB is normally sufficient in our env for mds_cache_memory_limit (3 active MDSs), however this is highly workload dependent. If several clients are actively taking 100s of thousands of caps, then the 4GB MDS needs to be ultra busy recalling caps and latency increases. We saw this live a couple weeks ago: a few users started doing intensive rsyncs, and some other users noticed an MD latency increase; it was fixed immediately just by increasing the mem limit to 8GB. I agree some sort of tuning best practises should all be documented somehow, even though it's complex and rather delicate. -- Dan On Sat, Jan 25, 2020 at 5:54 PM Janek Bevendorff wrote: > > Hello, > > Over the last week I have tried optimising the performance of our MDS > nodes for the large amount of files and concurrent clients we have. It > turns out that despite various stability fixes in recent releases, the > default configuration still doesn't appear to be optimal for keeping the > cache size under control and avoid intermittent I/O blocks. > > Unfortunately, it is very hard to tweak the configuration to something > that works, because the tuning parameters needed are largely > undocumented or only described in very technical terms in the source > code making them quite unapproachable for administrators not familiar > with all the CephFS internals. I would therefore like to ask if it were > possible to document the "advanced" MDS settings more clearly as to what > they do and in what direction they have to be tuned for more or less > aggressive cap recall, for instance (sometimes it is not clear if a > threshold is a min or a max threshold). > > I am am in the very (un)fortunate situation to have folders with a > several 100K direct sub folders or files (and one extreme case with > almost 7 million dentries), which is a pretty good benchmark for > measuring cap growth while performing operations on them. For the time > being, I came up with this configuration, which seems to work for me, > but is still far from optimal: > > mds basicmds_cache_memory_limit 10737418240 > mds advanced mds_cache_trim_threshold131072 > mds advanced mds_max_caps_per_client 50 > mds advanced mds_recall_max_caps 17408 > mds advanced mds_recall_max_decay_rate 2.00 > > The parameters I am least sure about---because I understand the least > how they actually work---are mds_cache_trim_threshold and > mds_recall_max_decay_rate. Despite reading the description in > src/common/options.cc, I understand only half of what they're doing and > I am also not quite sure in which direction to tune them for optimal > results. > > Another point where I am struggling is the correct configuration of > mds_recall_max_caps. The default of 5K doesn't work too well for me, but > values above 20K also don't seem to be a good choice. While high values > result in fewer blocked ops and better performance without destabilising > the MDS, they also lead to slow but unbounded cache growth, which seems > counter-intuitive. 17K was the maximum I could go. Higher values work > for most use cases, but when listing very large folders with millions of > dentries, the MDS cache size slowly starts to exceed the limit after a > few hours, since the MDSs are failing to keep clients below > mds_max_caps_per_client despite not showing any "failing to respond to > cache pressure" warnings. > > With the configuration above, I do not have cache size issues any more, > but it comes at the cost of performance and slow/blocked ops. A few > hints as to how I could optimise my