[ceph-users] What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-13 Thread Chris Smart
Hi all, I have recently inherited a 10 node Ceph cluster running Luminous (12.2.12) which is running specifically for CephFS (and I don't know much about MDS) with only one active MDS server (two standby). It's not a great cluster IMO, the cephfs_data pool is on high density nodes with high capaci

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-15 Thread Chris Smart
On Tue, 2022-08-16 at 13:21 +1000, distro...@gmail.com wrote: > > I'm not quite sure of the relationship of operations between MDS and > OSD data. The MDS gets written to nvme pool and clients access data > directly on OSD nodes, but do MDS operations also need to wait for > OSDs > to perform oper

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-15 Thread Chris Smart
problematic clients, maybe some combination of num_caps, ops, load, etc... I still think that would be useful to know, even if the bottlenecks in my cluster can be discovered and remedied... Cheers, -c > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 1

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-16 Thread Chris Smart
On Tue, 2022-08-16 at 07:50 +, Eugen Block wrote: > Hi, > > > However, the ceph-mds process is pretty much constantly over 100% > > CPU > > and often over 200%. Given it's a single process, right? It makes > > me > > think that some operations are too slow or some task is pegging the > > CPU >

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-16 Thread Chris Smart
On Tue, 2022-08-16 at 10:52 +, Frank Schilder wrote: > Hi Chris, > > I would strongly advice not to use multi-MDS with 5000 clients on > luminous. I enabled it on mimic with ca. 1750 clients and it was > extremely dependent on luck if it converged to a stable distribution > of dirfrags or ende

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-17 Thread Chris Smart
On Mon, 2022-08-15 at 08:33 +, Eugen Block wrote: > Hi, > > do you see high disk utilization on the OSD nodes? How is the load > on  > the active MDS? How much RAM is configured for the MDS  > (mds_cache_memory_limit)? > You can list all MDS sessions with 'ceph daemon mds. session > ls'  >

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-17 Thread Chris Smart
ash pool that I decided to migrate a 1PB file > system over to the new format. This reduces the meta data IO load on > the HDD pool significantly and even speeds up some operations that > only operate on meta-data. > Fortunately, it's just replication so I think I avoid this issue at lea

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-17 Thread Chris Smart
On Wed, 2022-08-17 at 17:10 +1000, Chris Smart wrote: > Looking at the MDS ops in flight, the majority are journal_and_reply: > > $ sudo ceph daemon mds.$(hostname) dump_ops_in_flight |grep > 'flag_point' |sort |uniq -c > 28 "flag_poin

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-17 Thread Chris Smart
On Wed, 2022-08-17 at 21:43 +1000, Chris Smart wrote: > > O, is "journal_and_reply" actually the very last event in a > successful operation?...[1] No wonder so many are the last event... > :facepalm: > > OK, well assuming that, then I can probably look out for

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-20 Thread Chris Smart
want more performance, they should allow you to buy 4TB or 6TB > NLSAS drives instead of the big SATA ones. You are also a bit low on > memory. 256GB of RAM for 60 OSDs is not much. I operate 70+ OSD-nodes > with 512GB and the OSDs sometimes use up to 80% of that. You might > run int

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-20 Thread Chris Smart
On Fri, 2022-08-19 at 15:48 +0200, Stefan Kooman wrote: > On 8/19/22 15:04, Frank Schilder wrote: > > Hi Chris, > > > > looks like your e-mail stampede is over :) I will cherry-pick some > > questions to answer, other things either follow or you will figure > > it out with the docs and trial-and-e

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-22 Thread Chris Smart
On Mon, 2022-08-22 at 16:42 +0200, Stefan Kooman wrote: > On 8/21/22 04:41, Chris Smart wrote: > > > OK, so basically sounds like I should stick with filestore, ugprade > > the > > cluster to Pacific to inherit the newer settings, then do the > > conversion to blu

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-22 Thread Chris Smart
On Mon, 2022-08-22 at 11:13 +, Frank Schilder wrote: > Hi Chris. > > > Interestingly, when duration gets long and performance gets bad ... > > This observation is likely due to MDS and client cache. My experience > with ceph's cache implementations is that, well, they seem not that > great. I

[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-31 Thread Chris Smart
On Wed, 2022-08-24 at 15:45 +0200, Stefan Kooman wrote: > On 8/21/22 04:31, Chris Smart wrote: > > > > > > Now I'm trying to understand the cause of delay in obtaining locks. > > > > I'd like to confirm first that when I see mds ops waiting to ge