[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-15 Thread Janek Bevendorff




My current settings are:

mds   advanced  mds_beacon_grace 15.00


True. I might as well remove it completely, it's an artefact of earlier 
experiments.




This should be a global setting. It is used by the mons and mdss.


mds   basic mds_cache_memory_limit 4294967296
mds   advanced  mds_cache_trim_threshold 393216
globaladvanced  mds_export_ephemeral_distributed true
mds   advanced  mds_recall_global_max_decay_threshold 393216
mds   advanced  mds_recall_max_caps 3
mds   advanced  mds_recall_max_decay_threshold 98304
mds   advanced  mds_recall_warning_threshold 196608
globaladvanced  mon_compact_on_start true

I haven't had any noticeable slow downs or crashes in a while with 3
active MDS and 3 hot standbys.

Thanks for sharing the settings that worked for you.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-15 Thread Patrick Donnelly
On Tue, Dec 15, 2020 at 12:50 AM Janek Bevendorff
 wrote:
>
> My current settings are:
>
> mds   advanced  mds_beacon_grace 15.00

This should be a global setting. It is used by the mons and mdss.

> mds   basic mds_cache_memory_limit 4294967296
> mds   advanced  mds_cache_trim_threshold 393216
> globaladvanced  mds_export_ephemeral_distributed true
> mds   advanced  mds_recall_global_max_decay_threshold 393216
> mds   advanced  mds_recall_max_caps 3
> mds   advanced  mds_recall_max_decay_threshold 98304
> mds   advanced  mds_recall_warning_threshold 196608
> globaladvanced  mon_compact_on_start true
>
> I haven't had any noticeable slow downs or crashes in a while with 3
> active MDS and 3 hot standbys.

Thanks for sharing the settings that worked for you.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-15 Thread Janek Bevendorff

My current settings are:

mds   advanced  mds_beacon_grace 15.00
mds   basic mds_cache_memory_limit 4294967296
mds   advanced  mds_cache_trim_threshold 393216
global    advanced  mds_export_ephemeral_distributed true
mds   advanced  mds_recall_global_max_decay_threshold 393216
mds   advanced  mds_recall_max_caps 3
mds   advanced  mds_recall_max_decay_threshold 98304
mds   advanced  mds_recall_warning_threshold 196608
global    advanced  mon_compact_on_start true

I haven't had any noticeable slow downs or crashes in a while with 3 
active MDS and 3 hot standbys.



On 14/12/2020 22:33, Patrick Donnelly wrote:

On Mon, Dec 7, 2020 at 12:06 PM Patrick Donnelly  wrote:

Hi Dan & Janek,

On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster  wrote:

My understanding is that the recall thresholds (see my list below)
should be scaled proportionally. OTOH, I haven't played with the decay
rates (and don't know if there's any significant value to tuning
those).

I haven't gone through this thread yet but I want to note for those
reading that we do now have documentation (thanks for the frequent
pokes Janek!) for the recall configurations:

https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall

Please let us know if it's missing information or if something could
be more clear.

I also now have a PR open for updating the defaults based on these and
other discussions: https://github.com/ceph/ceph/pull/38574

Feedback welcome.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-14 Thread Patrick Donnelly
On Mon, Dec 7, 2020 at 12:06 PM Patrick Donnelly  wrote:
>
> Hi Dan & Janek,
>
> On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster  wrote:
> > My understanding is that the recall thresholds (see my list below)
> > should be scaled proportionally. OTOH, I haven't played with the decay
> > rates (and don't know if there's any significant value to tuning
> > those).
>
> I haven't gone through this thread yet but I want to note for those
> reading that we do now have documentation (thanks for the frequent
> pokes Janek!) for the recall configurations:
>
> https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall
>
> Please let us know if it's missing information or if something could
> be more clear.

I also now have a PR open for updating the defaults based on these and
other discussions: https://github.com/ceph/ceph/pull/38574

Feedback welcome.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff

Hi Patrick,

I haven't gone through this thread yet but I want to note for those
reading that we do now have documentation (thanks for the frequent
pokes Janek!) for the recall configurations:

https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall

Please let us know if it's missing information or if something could
be more clear.
The documentation has helped a big deal already and I've been playing 
around with them quite a bit recently. What's missing, obviously, are 
recommended settings for individual scenarios (at least ballparks). But 
that is hard to come by without experimenting first (I wouldn't call our 
deployment massive, but very likely significantly above average and I 
don't know what scale the developers are usually testing at). As I 
mentioned in the other thread, I am testing Dan's recommendations at the 
moment and will refine them for our purposes. The effects of individual 
tweaks are hard to assess without dedicated benchmarks (although "MDS 
not hanging up" is already somewhat of a benchmark :-)).

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Patrick Donnelly
Hi Dan & Janek,

On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster  wrote:
> My understanding is that the recall thresholds (see my list below)
> should be scaled proportionally. OTOH, I haven't played with the decay
> rates (and don't know if there's any significant value to tuning
> those).

I haven't gone through this thread yet but I want to note for those
reading that we do now have documentation (thanks for the frequent
pokes Janek!) for the recall configurations:

https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall

Please let us know if it's missing information or if something could
be more clear.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff
Never mind, when I enable it on a more busy directory, I do see new 
ephemeral pins popping up. Just not on the directories I set it on 
originally. Let's see how that holds up.


On 07/12/2020 13:04, Janek Bevendorff wrote:
Thanks. I tried playing around a bit with 
mds_export_ephemeral_distributed just now, because it's pretty much 
the same thing that your script does manually. Unfortunately, it seems 
to have no effect.


I pinned all top-level directories to mds.0 and then enabled 
ceph.dir.pin.distributed for a few sub trees. Despite 
mds_export_ephemeral_distributed being set to true, all work is done 
by mds.0 now and I also don't see any additional pins in ceph tell 
mds.\* get subtrees.


Any ideas why that might be?


On 07/12/2020 10:49, Dan van der Ster wrote:

On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff
 wrote:



What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.

mds_max_caps_per_client. As I mentioned, some clients hit this limit
regularly and they aren't entirely idle. I will keep tuning the recall
settings, though.


This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.

We only see the warnings during normal operation. I remember having
massive issues with early Nautilus releases, but thanks to more
aggressive recall behaviour in newer releases, that is fixed. Back then
it was virtually impossible to keep the MDS within the bounds of its
memory limit. Nowadays, the warnings only appear when the MDS is really
stressed. In that situation, the whole FS performance is already
degraded massively and MDSs are likely to fail and run into the 
rejoin loop.



Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).

Okay, I will definitely re-evaluate options for pinning individual
directories, perhaps a small script can do it.

There is a new ephemeral pinning option in the latest latest releases,
but we didn't try it yet.
Here's our script -- it assumes the parent dir is pinned to zero or
that bal is disabled:

https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard 



Too many pins can cause problems -- we have something like 700 pins at
the moment and it's fine, though.

Cheers, Dan




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff
Thanks. I tried playing around a bit with 
mds_export_ephemeral_distributed just now, because it's pretty much the 
same thing that your script does manually. Unfortunately, it seems to 
have no effect.


I pinned all top-level directories to mds.0 and then enabled 
ceph.dir.pin.distributed for a few sub trees. Despite 
mds_export_ephemeral_distributed being set to true, all work is done by 
mds.0 now and I also don't see any additional pins in ceph tell mds.\* 
get subtrees.


Any ideas why that might be?


On 07/12/2020 10:49, Dan van der Ster wrote:

On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff
 wrote:



What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.

mds_max_caps_per_client. As I mentioned, some clients hit this limit
regularly and they aren't entirely idle. I will keep tuning the recall
settings, though.


This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.

We only see the warnings during normal operation. I remember having
massive issues with early Nautilus releases, but thanks to more
aggressive recall behaviour in newer releases, that is fixed. Back then
it was virtually impossible to keep the MDS within the bounds of its
memory limit. Nowadays, the warnings only appear when the MDS is really
stressed. In that situation, the whole FS performance is already
degraded massively and MDSs are likely to fail and run into the rejoin loop.


Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).

Okay, I will definitely re-evaluate options for pinning individual
directories, perhaps a small script can do it.

There is a new ephemeral pinning option in the latest latest releases,
but we didn't try it yet.
Here's our script -- it assumes the parent dir is pinned to zero or
that bal is disabled:

https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard

Too many pins can cause problems -- we have something like 700 pins at
the moment and it's fine, though.

Cheers, Dan




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Dan van der Ster
On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff
 wrote:
>
>
> > What exactly do you set to 64k?
> > We used to set mds_max_caps_per_client to 5, but once we started
> > using the tuned caps recall config, we reverted that back to the
> > default 1M without issue.
>
> mds_max_caps_per_client. As I mentioned, some clients hit this limit
> regularly and they aren't entirely idle. I will keep tuning the recall
> settings, though.
>
> > This 15k caps client I mentioned is not related to the max caps per
> > client config. In recent nautilus, the MDS will proactively recall
> > caps from idle clients -- so a client with even just a few caps like
> > this can provoke the caps recall warnings (if it is buggy, like in
> > this case). The client doesn't cause any real problems, just the
> > annoying warnings.
>
> We only see the warnings during normal operation. I remember having
> massive issues with early Nautilus releases, but thanks to more
> aggressive recall behaviour in newer releases, that is fixed. Back then
> it was virtually impossible to keep the MDS within the bounds of its
> memory limit. Nowadays, the warnings only appear when the MDS is really
> stressed. In that situation, the whole FS performance is already
> degraded massively and MDSs are likely to fail and run into the rejoin loop.
>
> > Multi-active + pinning definitely increases the overall MD throughput
> > (once you can get the relevant inodes cached), because as you know the
> > MDS is single threaded and CPU bound at the limit.
> > We could get something like 4-5k handle_client_requests out of a
> > single MDS, and that really does scale horizontally as you add MDSs
> > (and pin).
>
> Okay, I will definitely re-evaluate options for pinning individual
> directories, perhaps a small script can do it.

There is a new ephemeral pinning option in the latest latest releases,
but we didn't try it yet.
Here's our script -- it assumes the parent dir is pinned to zero or
that bal is disabled:

https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard

Too many pins can cause problems -- we have something like 700 pins at
the moment and it's fine, though.

Cheers, Dan



>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff




What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.


mds_max_caps_per_client. As I mentioned, some clients hit this limit 
regularly and they aren't entirely idle. I will keep tuning the recall 
settings, though.



This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.


We only see the warnings during normal operation. I remember having 
massive issues with early Nautilus releases, but thanks to more 
aggressive recall behaviour in newer releases, that is fixed. Back then 
it was virtually impossible to keep the MDS within the bounds of its 
memory limit. Nowadays, the warnings only appear when the MDS is really 
stressed. In that situation, the whole FS performance is already 
degraded massively and MDSs are likely to fail and run into the rejoin loop.



Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).


Okay, I will definitely re-evaluate options for pinning individual 
directories, perhaps a small script can do it.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Dan van der Ster
On Mon, Dec 7, 2020 at 9:42 AM Janek Bevendorff
 wrote:
>
> Thanks, Dan!
>
> I have played with many thresholds, including the decay rates. It is
> indeed very difficult to assess their effects, since our workloads
> differ widely depending on what people are working on at the moment. I
> would need to develop a proper benchmarking suite to simulate the
> different heavy workloads we have.
>
> > We currently run with all those options scaled up 6x the defaults, and
> > we almost never have caps recall warnings these days, with a couple
> > thousand cephfs clients.
>
> Under normal operation, we don't either. We had issues in the past with
> Ganesha and still do sometimes, but that's a bug in Ganesha and we don't
> really use it for anything but legacy clients any way. Usually, recall

+1 we've seen that ganesha issue; it simply won't release caps ever,
even with the latest fixes in this area.

> works flawlessly, unless some client suddenly starts doing crazy shit.
> We have just a few clients who regularly keep tens of thousands of caps
> open and had I not limited the number, it would be hundreds of
> thousands. Recalling them without threatening stability is not trivial
> and at the least it degrades the performance for everybody else. Any
> pointers here to better handling this situation are greatly appreciated.
> I will definitely try your config recommendations.
>
> > 2. A user running VSCodium, keeping 15k caps open.. the opportunistic
> > caps recall eventually starts recalling those but the (el7 kernel)
> > client won't release them. Stopping Codium seems to be the only way to
> > release.
>
> As I said, 15k is not much for us. The limits right now are 64k per
> client and a few hit that limit quite regularly. One of those clients is

What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.

This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.

So what I'm looking for now is a way to disable proactively recalling
if the num caps is below some threshold -- `min_caps_per_client` might
do this but I haven't tested yet.

> our VPN gateway, which, technically, is not a single client, but to the
> CephFS it looks like one due to source NAT. This is certainly something
> I want to tune further, so that clients are routed directly via their
> private IP instead of being NAT'ed. The other ones are our GPU deep
> learning servers (just three of them, but they can generate astounding
> numbers of iops) and the 135-node Hadoop cluster (which is hard to
> sustain for any single machine, so we prefer to use the S3 here).
>
> > Otherwise, 4GB is normally sufficient in our env for
> > mds_cache_memory_limit (3 active MDSs), however this is highly
> > workload dependent. If several clients are actively taking 100s of
> > thousands of caps, then the 4GB MDS needs to be ultra busy recalling
> > caps and latency increases. We saw this live a couple weeks ago: a few
> > users started doing intensive rsyncs, and some other users noticed an
> > MD latency increase; it was fixed immediately just by increasing the
> > mem limit to 8GB.
>
> So you too have 3 active MDSs? Are you using directory pinning? We have
> a very deep and unbalanced directory structure, so I cannot really pin
> any top-level directory without skewing the load massively. From my
> experience, three MDSs without explicit pinning aren't much better or
> even worse than one. But perhaps you have different observations?

Yes 3 active today, and lots of pinning thanks to our flat hierarchy.
User dirs are pinned to one of three randomly, as are the manila
shares.
MD balancer = on creates a disaster in our env -- too much ping pong
of dirs between the MDSs, too much metadata IO needed to keep up, not
to mention "nice export" bugs in the past that forced us to disable
the balancer to begin with.

We used to have 10 active MDSs, but that is such a pain during
upgrades that we're now trying with just three. Next upgrade we'll
probably leave it at one for a while to see if that suffices.

Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).

Cheers, Dan

>
>
> > I agree some sort of tuning best practises should all be documented
> > somehow, even though it's complex and rather delicate.
>
> Indeed!
>
>
> Janek
>

[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff

Thanks, Dan!

I have played with many thresholds, including the decay rates. It is 
indeed very difficult to assess their effects, since our workloads 
differ widely depending on what people are working on at the moment. I 
would need to develop a proper benchmarking suite to simulate the 
different heavy workloads we have.



We currently run with all those options scaled up 6x the defaults, and
we almost never have caps recall warnings these days, with a couple
thousand cephfs clients.


Under normal operation, we don't either. We had issues in the past with 
Ganesha and still do sometimes, but that's a bug in Ganesha and we don't 
really use it for anything but legacy clients any way. Usually, recall 
works flawlessly, unless some client suddenly starts doing crazy shit. 
We have just a few clients who regularly keep tens of thousands of caps 
open and had I not limited the number, it would be hundreds of 
thousands. Recalling them without threatening stability is not trivial 
and at the least it degrades the performance for everybody else. Any 
pointers here to better handling this situation are greatly appreciated. 
I will definitely try your config recommendations.



2. A user running VSCodium, keeping 15k caps open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel)
client won't release them. Stopping Codium seems to be the only way to
release.


As I said, 15k is not much for us. The limits right now are 64k per 
client and a few hit that limit quite regularly. One of those clients is 
our VPN gateway, which, technically, is not a single client, but to the 
CephFS it looks like one due to source NAT. This is certainly something 
I want to tune further, so that clients are routed directly via their 
private IP instead of being NAT'ed. The other ones are our GPU deep 
learning servers (just three of them, but they can generate astounding 
numbers of iops) and the 135-node Hadoop cluster (which is hard to 
sustain for any single machine, so we prefer to use the S3 here).



Otherwise, 4GB is normally sufficient in our env for
mds_cache_memory_limit (3 active MDSs), however this is highly
workload dependent. If several clients are actively taking 100s of
thousands of caps, then the 4GB MDS needs to be ultra busy recalling
caps and latency increases. We saw this live a couple weeks ago: a few
users started doing intensive rsyncs, and some other users noticed an
MD latency increase; it was fixed immediately just by increasing the
mem limit to 8GB.


So you too have 3 active MDSs? Are you using directory pinning? We have 
a very deep and unbalanced directory structure, so I cannot really pin 
any top-level directory without skewing the load massively. From my 
experience, three MDSs without explicit pinning aren't much better or 
even worse than one. But perhaps you have different observations?




I agree some sort of tuning best practises should all be documented
somehow, even though it's complex and rather delicate.


Indeed!


Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-05 Thread Dan van der Ster
Hi Janek,

My understanding is that the recall thresholds (see my list below)
should be scaled proportionally. OTOH, I haven't played with the decay
rates (and don't know if there's any significant value to tuning
those).

We have a recall tuning script that we use to deploy different factors
whenever there are caps recall issues:

X=$1
echo Scaling MDS Recall by ${X}x
ceph tell mds.* injectargs -- --mds_recall_max_decay_threshold
$((X*16*1024)) --mds_recall_max_caps $((X*5000))
--mds_recall_global_max_decay_threshold $((X*64*1024))
--mds_recall_warning_threshold $((X*32*1024))
--mds_cache_trim_threshold $((X*64*1024))

We currently run with all those options scaled up 6x the defaults, and
we almost never have caps recall warnings these days, with a couple
thousand cephfs clients.

In the past month I've seen 2 different cases of a client not
releasing caps even with these options:
1. A user had ceph-fuse mounted /cephfs/ on top of a 2nd ceph-fuse
/cephfs. The outer (i.e lower) mountpoint/process had several thousand
caps that could never be released until the user cleaned up their
mounts.
2. A user running VSCodium, keeping 15k caps open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel)
client won't release them. Stopping Codium seems to be the only way to
release.

Otherwise, 4GB is normally sufficient in our env for
mds_cache_memory_limit (3 active MDSs), however this is highly
workload dependent. If several clients are actively taking 100s of
thousands of caps, then the 4GB MDS needs to be ultra busy recalling
caps and latency increases. We saw this live a couple weeks ago: a few
users started doing intensive rsyncs, and some other users noticed an
MD latency increase; it was fixed immediately just by increasing the
mem limit to 8GB.

I agree some sort of tuning best practises should all be documented
somehow, even though it's complex and rather delicate.

-- Dan


On Sat, Jan 25, 2020 at 5:54 PM Janek Bevendorff
 wrote:
>
> Hello,
>
> Over the last week I have tried optimising the performance of our MDS
> nodes for the large amount of files and concurrent clients we have. It
> turns out that despite various stability fixes in recent releases, the
> default configuration still doesn't appear to be optimal for keeping the
> cache size under control and avoid intermittent I/O blocks.
>
> Unfortunately, it is very hard to tweak the configuration to something
> that works, because the tuning parameters needed are largely
> undocumented or only described in very technical terms in the source
> code making them quite unapproachable for administrators not familiar
> with all the CephFS internals. I would therefore like to ask if it were
> possible to document the "advanced" MDS settings more clearly as to what
> they do and in what direction they have to be tuned for more or less
> aggressive cap recall, for instance (sometimes it is not clear if a
> threshold is a min or a max threshold).
>
> I am am in the very (un)fortunate situation to have folders with a
> several 100K direct sub folders or files (and one extreme case with
> almost 7 million dentries), which is a pretty good benchmark for
> measuring cap growth while performing operations on them. For the time
> being, I came up with this configuration, which seems to work for me,
> but is still far from optimal:
>
> mds basicmds_cache_memory_limit  10737418240
> mds advanced mds_cache_trim_threshold131072
> mds advanced mds_max_caps_per_client 50
> mds advanced mds_recall_max_caps 17408
> mds advanced mds_recall_max_decay_rate   2.00
>
> The parameters I am least sure about---because I understand the least
> how they actually work---are mds_cache_trim_threshold and
> mds_recall_max_decay_rate. Despite reading the description in
> src/common/options.cc, I understand only half of what they're doing and
> I am also not quite sure in which direction to tune them for optimal
> results.
>
> Another point where I am struggling is the correct configuration of
> mds_recall_max_caps. The default of 5K doesn't work too well for me, but
> values above 20K also don't seem to be a good choice. While high values
> result in fewer blocked ops and better performance without destabilising
> the MDS, they also lead to slow but unbounded cache growth, which seems
> counter-intuitive. 17K was the maximum I could go. Higher values work
> for most use cases, but when listing very large folders with millions of
> dentries, the MDS cache size slowly starts to exceed the limit after a
> few hours, since the MDSs are failing to keep clients below
> mds_max_caps_per_client despite not showing any "failing to respond to
> cache pressure" warnings.
>
> With the configuration above, I do not have cache size issues any more,
> but it comes at the cost of performance and slow/blocked ops. A few
> hints as to how I could optimise my