[ceph-users] Adding new OSDs - also adding PGs?
Hi All, I'm going to be adding a bunch of OSDs to our cephfs cluster shortly (increasing the total size by 50%). We're on reef, and will be deploying using the cephadm method, and the OSDs are exactly the same size and disk type as the current ones. So, after adding the new OSDs, my understanding is that ceph will begin rebalancing the data. I will also probably want to increase my PGs to accommodate the new OSDs being added. My question is basically: should I wait for the rebalance to finish before increasing my PG count? Which would kick off another relabance action for the new PGs? Or, should I increase the PG count as soon as the rebalance action starts after adding the new OSDs, and it would then create new PGs and rebalance on the new OSDs at the same time? Thanks for any guidance! -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERN] Re: cache pressure?
I still saw client cache pressure messages, although I think it did in general help a bit. What I additionally just did (like 5 minutes ago), was reduce "mds_recall_max_caps" from 30,000 to 10,000 after looking at this post: https://www.spinics.net/lists/ceph-users/msg73188.html And will try further reducing mds_recall_max_caps if the pressure messages keep coming up. After reducing it to 10,000 a few client cache pressure warnings cleared but I don't know yet if that was the reason it cleared or if it was just luck. If I see it stay clear then I'll call it solved. -erich On 5/7/24 6:55 AM, Dietmar Rieder wrote: On 4/26/24 23:51, Erich Weiler wrote: As Dietmar said, VS Code may cause this. Quite funny to read, actually, because we've been dealing with this issue for over a year, and yesterday was the very first time Ceph complained about a client and we saw VS Code's remote stuff running. Coincidence. I'm holding my breath that the vscode issue is the one affecting us - I got my users to tweak their vscode configs and the problem seemed to go away, but I guess I won't consider it 'solved' until a few days pass without it coming back... :) I wonder if the vscode configs solved your issues, or if you still see the cache pressure messages? Dietmar ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 'ceph fs status' no longer works?
Excellent! Restarting all the MDS daemons fixed it. Thank you. This kinda feels like a bug. -erich On 5/2/24 12:44 PM, Bandelow, Gunnar wrote: Hi Erich, im not sure about this specific error message, but "ceph fs status" sometimes did fail me end of last year/in the beginning of the year. Restarting ALL mon, mgr AND mds fixed it at the time. Best regards, Gunnar === Gunnar Bandelow (dipl. phys.) Universitätsrechenzentrum (URZ) Universität Greifswald Felix-Hausdorff-Straße 18 17489 Greifswald Germany --- Original Nachricht --- *Betreff: *[ceph-users] Re: 'ceph fs status' no longer works? *Von: *"Erich Weiler" mailto:wei...@soe.ucsc.edu>> *An: *"Eugen Block" mailto:ebl...@nde.ag>>, ceph-users@ceph.io <mailto:ceph-users@ceph.io> *Datum: *02-05-2024 21:05 Hi Eugen, Thanks for the tip! I just ran: ceph orch daemon restart mgr.pr-md-01.jemmdf (my specific mgr instance) And it restarted my primary mgr daemon, and in the process failed over to my standby mgr daemon on another server. That went smoothly. Unfortunately, I still cannot get 'ceph fs status' to work (on any node)... # ceph fs status Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/status/module.py", line 109, in handle_fs_status assert metadata AssertionError -erich On 5/2/24 11:07 AM, Eugen Block wrote: > Yep, seen this a couple of times during upgrades. I’ll have to check my > notes if I wrote anything down for that. But try a mgr failover first, > that could help. > > Zitat von Erich Weiler mailto:wei...@soe.ucsc.edu>>: > >> Hi All, >> >> For a while now I've been using 'ceph fs status' to show current MDS >> active servers, filesystem status, etc. I recently took down my MDS >> servers and added RAM to them (one by one, so the filesystem stayed >> online). After doing that with my four MDS servers (I had two active >> and two standby), all looks OK, 'ceph -s' reports HEALTH_OK. But when >> I do 'ceph fs status' now, I get this: >> >> # ceph fs status >> Error EINVAL: Traceback (most recent call last): >> File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command >> return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf) >> File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call >> return self.func(mgr, **kwargs) >> File "/usr/share/ceph/mgr/status/module.py", line 109, in >> handle_fs_status >> assert metadata >> AssertionError >> >> This is on ceph 18.2.1 reef. This is very odd - can anyone think of a >> reason why 'ceph fs status' would stop working after taking each of >> the servers down for maintenance? >> >> The filesystem is online and working just fine however. This ceph >> instance is deployed via the cephadm method on RHEL 9.3, so the >> everything is containerized in podman. >> >> Thanks again, >> erich >> ___ >> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> > > > ___ > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 'ceph fs status' no longer works?
Hi Eugen, Thanks for the tip! I just ran: ceph orch daemon restart mgr.pr-md-01.jemmdf (my specific mgr instance) And it restarted my primary mgr daemon, and in the process failed over to my standby mgr daemon on another server. That went smoothly. Unfortunately, I still cannot get 'ceph fs status' to work (on any node)... # ceph fs status Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/status/module.py", line 109, in handle_fs_status assert metadata AssertionError -erich On 5/2/24 11:07 AM, Eugen Block wrote: Yep, seen this a couple of times during upgrades. I’ll have to check my notes if I wrote anything down for that. But try a mgr failover first, that could help. Zitat von Erich Weiler : Hi All, For a while now I've been using 'ceph fs status' to show current MDS active servers, filesystem status, etc. I recently took down my MDS servers and added RAM to them (one by one, so the filesystem stayed online). After doing that with my four MDS servers (I had two active and two standby), all looks OK, 'ceph -s' reports HEALTH_OK. But when I do 'ceph fs status' now, I get this: # ceph fs status Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/status/module.py", line 109, in handle_fs_status assert metadata AssertionError This is on ceph 18.2.1 reef. This is very odd - can anyone think of a reason why 'ceph fs status' would stop working after taking each of the servers down for maintenance? The filesystem is online and working just fine however. This ceph instance is deployed via the cephadm method on RHEL 9.3, so the everything is containerized in podman. Thanks again, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] 'ceph fs status' no longer works?
Hi All, For a while now I've been using 'ceph fs status' to show current MDS active servers, filesystem status, etc. I recently took down my MDS servers and added RAM to them (one by one, so the filesystem stayed online). After doing that with my four MDS servers (I had two active and two standby), all looks OK, 'ceph -s' reports HEALTH_OK. But when I do 'ceph fs status' now, I get this: # ceph fs status Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/status/module.py", line 109, in handle_fs_status assert metadata AssertionError This is on ceph 18.2.1 reef. This is very odd - can anyone think of a reason why 'ceph fs status' would stop working after taking each of the servers down for maintenance? The filesystem is online and working just fine however. This ceph instance is deployed via the cephadm method on RHEL 9.3, so the everything is containerized in podman. Thanks again, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
Hi Xiubo, Is there any way to possibly get a PR development release we could upgrade to, in order to test and see if the lock order bug per Bug #62123 could be the answer? Although I'm not sure that bug has been fixed yet? -erich On 4/21/24 9:39 PM, Xiubo Li wrote: Hi Erich, I raised one tracker for this https://tracker.ceph.com/issues/65607. Currently I haven't figured out where was holding the 'dn->lock' in the 'lookup' request or somewhere else, since there is not debug log. Hopefully we can get the debug logs, which we can push it further. Thanks - Xiubo On 4/19/24 23:55, Erich Weiler wrote: Hi Xiubo, Nevermind I was wrong, most the blocked ops were 12 hours old. Ug. I restarted the MDS daemon to clear them. I just reset to having one active MDS instead of two, let's see if that makes a difference. I am beginning to think it may be impossible to catch the logs that matter here. I feel like sometimes the blocked ops are just waiting because of load and sometimes they are waiting because they are stuck. But, it's really hard to tell which, without waiting a while. But, I can't wait while having debug turned on because my root disks (which are 150 GB large) fill up with debug logs in 20 minutes. So it almost seems that unless I could somehow store many TB of debug logs we won't be able to catch this. Let's see how having one MDS helps. Or maybe I actually need like 4 MDSs because the load is too high for only one or two. I don't know. Or maybe it's the lock issue you've been working on. I guess I can test the lock order fix when it's available to test. -erich On 4/19/24 7:26 AM, Erich Weiler wrote: So I woke up this morning and checked the blocked_ops again, there were 150 of them. But the age of each ranged from 500 to 4300 seconds. So it seems as if they are eventually being processed. I wonder if we are thinking about this in the wrong way? Maybe I should be *adding* MDS daemons because my current ones are overloaded? Can a single server hold multiple MDS daemons? Right now I have three physical servers each with one MDS daemon on it. I can still try reducing to one. And I'll keep an eye on blocked ops to see if any get to a very old age (and are thus wedged). -erich On 4/18/24 8:55 PM, Xiubo Li wrote: Okay, please try it to set only one active mds. On 4/19/24 11:54, Erich Weiler wrote: We have 2 active MDS daemons and one standby. On 4/18/24 8:52 PM, Xiubo Li wrote: BTW, how man active mds you are using ? On 4/19/24 10:55, Erich Weiler wrote: OK, I'm sure I caught it in the right order this time, the logs should definitely show when the blocked/slow requests start. Check out these logs and dumps: http://hgwdev.gi.ucsc.edu/~weiler/ It's a 762 MB tarball but it uncompresses to 16 GB. -erichll On 4/18/24 6:57 PM, Xiubo Li wrote: Okay, could you try this with 18.2.0 ? I just double it was introduce by: commit e610179a6a59c463eb3d85e87152ed3268c808ff Author: Patrick Donnelly Date: Mon Jul 17 16:10:59 2023 -0400 mds: drop locks and retry when lock set changes An optimization was added to avoid an unnecessary gather on the inode filelock when the client can safely get the file size without also getting issued the requested caps. However, if a retry of getattr is necessary, this conditional inclusion of the inode filelock can cause lock-order violations resulting in deadlock. So, if we've already acquired some of the inode's locks then we must drop locks and retry. Fixes: https://tracker.ceph.com/issues/62052 Fixes: c822b3e2573578c288d170d1031672b74e02dced Signed-off-by: Patrick Donnelly (cherry picked from commit b5719ac32fe6431131842d62ffaf7101c03e9bac) On 4/19/24 09:54, Erich Weiler wrote: I'm on 18.2.1. I think I may have gotten the timing off on the logs and dumps so I'll try again. Just really hard to capture because I need to kind of be looking at it in real time to capture it. Hang on, lemme see if I can get another capture... -erich On 4/18/24 6:35 PM, Xiubo Li wrote: BTW, which ceph version you are using ? On 4/12/24 04:22, Erich Weiler wrote: BTW - it just happened again, I upped the debugging settings as you instructed and got more dumps (then returned the debug settings to normal). Attached are the new dumps. Thanks again, erich On 4/9/24 9:00 PM, Xiubo Li wrote: On 4/10/24 11:48, Erich Weiler wrote: Dos that mean it could be the locker order bug (https://tracker.ceph.com/issues/62123) as Xiubo suggested? I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. Thank you! Yeah, this issue is happening every couple days now. It just happened again today and I got more MDS dumps. If it would help, let me know and I can send them!
[ceph-users] Re: [EXTERN] cache pressure?
Actually should I be excluding my whole cephfs filesystem? Like, if I mount it as /cephfs, should my stanza looks something like: { "files.watcherExclude": { "**/.git/objects/**": true, "**/.git/subtree-cache/**": true, "**/node_modules/*/**": true, "**/.cache/**": true, "**/.conda/**": true, "**/.local/**": true, "**/.nextflow/**": true, "**/work/**": true, "**/cephfs/**": true } } On 4/27/24 12:24 AM, Dietmar Rieder wrote: Hi Erich, hope it helps. Let us know. Dietmar Am 26. April 2024 15:52:06 MESZ schrieb Erich Weiler : Hi Dietmar, We do in fact have a bunch of users running vscode on our HPC head node as well (in addition to a few of our general purpose interactive compute servers). I'll suggest they make the mods you referenced! Thanks for the tip. cheers, erich On 4/24/24 12:58 PM, Dietmar Rieder wrote: Hi Erich, in our case the "client failing to respond to cache pressure" situation is/was often caused by users how have vscode connecting via ssh to our HPC head node. vscode makes heavy use of file watchers and we have seen users with > 400k watchers. All these watched files must be held in the MDS cache and if you have multiple users at the same time running vscode it gets problematic. Unfortunately there is no global setting - at least none that we are aware of - for vscode to exclude certain files or directories from being watched. We asked the users to configure their vscode (Remote Settings -> Watcher Exclude) as follows: { "files.watcherExclude": { "**/.git/objects/**": true, "**/.git/subtree-cache/**": true, "**/node_modules/*/**": true, "**/.cache/**": true, "**/.conda/**": true, "**/.local/**": true, "**/.nextflow/**": true, "**/work/**": true } } ~/.vscode-server/data/Machine/settings.json To monitor and find processes with watcher you may use inotify-info <https://github.com/mikesart/inotify-info <https://github.com/mikesart/inotify-info>> HTH Dietmar On 4/23/24 15:47, Erich Weiler wrote: So I'm trying to figure out ways to reduce the number of warnings I'm getting and I'm thinking about the one "client failing to respond to cache pressure". Is there maybe a way to tell a client (or all clients) to reduce the amount of cache it uses or to release caches quickly? Like, all the time? I know the linux kernel (and maybe ceph) likes to cache everything for a while, and rightfully so, but I suspect in my use case it may be more efficient to more quickly purge the cache or to in general just cache way less overall...? We have many thousands of threads all doing different things that are hitting our filesystem, so I suspect the caching isn't really doing me much good anyway due to the churn, and probably is causing more problems than it helping... -erich ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cache pressure?
As Dietmar said, VS Code may cause this. Quite funny to read, actually, because we've been dealing with this issue for over a year, and yesterday was the very first time Ceph complained about a client and we saw VS Code's remote stuff running. Coincidence. I'm holding my breath that the vscode issue is the one affecting us - I got my users to tweak their vscode configs and the problem seemed to go away, but I guess I won't consider it 'solved' until a few days pass without it coming back... :) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERN] cache pressure?
Hi Dietmar, We do in fact have a bunch of users running vscode on our HPC head node as well (in addition to a few of our general purpose interactive compute servers). I'll suggest they make the mods you referenced! Thanks for the tip. cheers, erich On 4/24/24 12:58 PM, Dietmar Rieder wrote: Hi Erich, in our case the "client failing to respond to cache pressure" situation is/was often caused by users how have vscode connecting via ssh to our HPC head node. vscode makes heavy use of file watchers and we have seen users with > 400k watchers. All these watched files must be held in the MDS cache and if you have multiple users at the same time running vscode it gets problematic. Unfortunately there is no global setting - at least none that we are aware of - for vscode to exclude certain files or directories from being watched. We asked the users to configure their vscode (Remote Settings -> Watcher Exclude) as follows: { "files.watcherExclude": { "**/.git/objects/**": true, "**/.git/subtree-cache/**": true, "**/node_modules/*/**": true, "**/.cache/**": true, "**/.conda/**": true, "**/.local/**": true, "**/.nextflow/**": true, "**/work/**": true } } ~/.vscode-server/data/Machine/settings.json To monitor and find processes with watcher you may use inotify-info <https://github.com/mikesart/inotify-info> HTH Dietmar On 4/23/24 15:47, Erich Weiler wrote: So I'm trying to figure out ways to reduce the number of warnings I'm getting and I'm thinking about the one "client failing to respond to cache pressure". Is there maybe a way to tell a client (or all clients) to reduce the amount of cache it uses or to release caches quickly? Like, all the time? I know the linux kernel (and maybe ceph) likes to cache everything for a while, and rightfully so, but I suspect in my use case it may be more efficient to more quickly purge the cache or to in general just cache way less overall...? We have many thousands of threads all doing different things that are hitting our filesystem, so I suspect the caching isn't really doing me much good anyway due to the churn, and probably is causing more problems than it helping... -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cache pressure?
So I'm trying to figure out ways to reduce the number of warnings I'm getting and I'm thinking about the one "client failing to respond to cache pressure". Is there maybe a way to tell a client (or all clients) to reduce the amount of cache it uses or to release caches quickly? Like, all the time? I know the linux kernel (and maybe ceph) likes to cache everything for a while, and rightfully so, but I suspect in my use case it may be more efficient to more quickly purge the cache or to in general just cache way less overall...? We have many thousands of threads all doing different things that are hitting our filesystem, so I suspect the caching isn't really doing me much good anyway due to the churn, and probably is causing more problems than it helping... -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck in replay?
I was able to start another MDS daemon on another node that had 512GB RAM, and then the active MDS eventually migrated there, and went through the replay (which consumed about 100 GB of RAM), and then things recovered. Phew. I guess I need significantly more RAM in my MDS servers... I had no idea the MDS daemon could require that much RAM. -erich On 4/22/24 11:41 AM, Erich Weiler wrote: possibly but it would be pretty time consuming and difficult... Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe I bring up another MDS on another server with huge amount of RAM and move the MDS there in hopes it will have enough RAM to complete the replay? On 4/22/24 11:37 AM, Sake Ceph wrote: Just a question: is it possible to block or disable all clients? Just to prevent load on the system. Kind regards, Sake Op 22-04-2024 20:33 CEST schreef Erich Weiler : I also see this from 'ceph health detail': # ceph health detail HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 MDSs behind on trimming [WRN] FS_DEGRADED: 1 filesystem is degraded fs slugfs is degraded [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large (19GB/8GB); 0 inodes in use by clients, 0 stray files [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250) max_segments: 250, num_segments: 127084 MDS cache too large? The mds process is taking up 22GB right now and starting to swap my server, so maybe it somehow is too large On 4/22/24 11:17 AM, Erich Weiler wrote: Hi All, We have a somewhat serious situation where we have a cephfs filesystem (18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of the active daemons to unstick a bunch of blocked requests, and the standby went into 'replay' for a very long time, then RAM on that MDS server filled up, and it just stayed there for a while then eventually appeared to give up and switched to the standby, but the cycle started again. So I restarted that MDS, and now I'm in a situation where I see this: # ceph fs status slugfs - 29 clients == RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k 12.2k 0 1 resolve slugfs.pr-md-02.sbblqq 0 3 1 0 POOL TYPE USED AVAIL cephfs_metadata metadata 997G 2948G cephfs_md_and_data data 0 87.6T cephfs_data data 773T 175T STANDBY MDS slugfs.pr-md-03.mclckv MDS version: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) It just stays there indefinitely. All my clients are hung. I tried restarting all MDS daemons and they just went back to this state after coming back up. Is there any way I can somehow escape this state of indefinite replay/resolve? Thanks so much! I'm kinda nervous since none of my clients have filesystem access at the moment... cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck in replay?
possibly but it would be pretty time consuming and difficult... Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe I bring up another MDS on another server with huge amount of RAM and move the MDS there in hopes it will have enough RAM to complete the replay? On 4/22/24 11:37 AM, Sake Ceph wrote: Just a question: is it possible to block or disable all clients? Just to prevent load on the system. Kind regards, Sake Op 22-04-2024 20:33 CEST schreef Erich Weiler : I also see this from 'ceph health detail': # ceph health detail HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 MDSs behind on trimming [WRN] FS_DEGRADED: 1 filesystem is degraded fs slugfs is degraded [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large (19GB/8GB); 0 inodes in use by clients, 0 stray files [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250) max_segments: 250, num_segments: 127084 MDS cache too large? The mds process is taking up 22GB right now and starting to swap my server, so maybe it somehow is too large On 4/22/24 11:17 AM, Erich Weiler wrote: Hi All, We have a somewhat serious situation where we have a cephfs filesystem (18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of the active daemons to unstick a bunch of blocked requests, and the standby went into 'replay' for a very long time, then RAM on that MDS server filled up, and it just stayed there for a while then eventually appeared to give up and switched to the standby, but the cycle started again. So I restarted that MDS, and now I'm in a situation where I see this: # ceph fs status slugfs - 29 clients == RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k 12.2k 0 1 resolve slugfs.pr-md-02.sbblqq 0 3 1 0 POOL TYPE USED AVAIL cephfs_metadata metadata 997G 2948G cephfs_md_and_data data 0 87.6T cephfs_data data 773T 175T STANDBY MDS slugfs.pr-md-03.mclckv MDS version: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) It just stays there indefinitely. All my clients are hung. I tried restarting all MDS daemons and they just went back to this state after coming back up. Is there any way I can somehow escape this state of indefinite replay/resolve? Thanks so much! I'm kinda nervous since none of my clients have filesystem access at the moment... cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck in replay?
I also see this from 'ceph health detail': # ceph health detail HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 MDSs behind on trimming [WRN] FS_DEGRADED: 1 filesystem is degraded fs slugfs is degraded [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large (19GB/8GB); 0 inodes in use by clients, 0 stray files [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250) max_segments: 250, num_segments: 127084 MDS cache too large? The mds process is taking up 22GB right now and starting to swap my server, so maybe it somehow is too large On 4/22/24 11:17 AM, Erich Weiler wrote: Hi All, We have a somewhat serious situation where we have a cephfs filesystem (18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of the active daemons to unstick a bunch of blocked requests, and the standby went into 'replay' for a very long time, then RAM on that MDS server filled up, and it just stayed there for a while then eventually appeared to give up and switched to the standby, but the cycle started again. So I restarted that MDS, and now I'm in a situation where I see this: # ceph fs status slugfs - 29 clients == RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k 12.2k 0 1 resolve slugfs.pr-md-02.sbblqq 0 3 1 0 POOL TYPE USED AVAIL cephfs_metadata metadata 997G 2948G cephfs_md_and_data data 0 87.6T cephfs_data data 773T 175T STANDBY MDS slugfs.pr-md-03.mclckv MDS version: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) It just stays there indefinitely. All my clients are hung. I tried restarting all MDS daemons and they just went back to this state after coming back up. Is there any way I can somehow escape this state of indefinite replay/resolve? Thanks so much! I'm kinda nervous since none of my clients have filesystem access at the moment... cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Stuck in replay?
Hi All, We have a somewhat serious situation where we have a cephfs filesystem (18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of the active daemons to unstick a bunch of blocked requests, and the standby went into 'replay' for a very long time, then RAM on that MDS server filled up, and it just stayed there for a while then eventually appeared to give up and switched to the standby, but the cycle started again. So I restarted that MDS, and now I'm in a situation where I see this: # ceph fs status slugfs - 29 clients == RANK STATEMDSACTIVITY DNSINOS DIRS CAPS 0 replay slugfs.pr-md-01.xdtppo3958k 57.1k 12.2k 0 1resolve slugfs.pr-md-02.sbblqq 0 3 1 0 POOL TYPE USED AVAIL cephfs_metadatametadata 997G 2948G cephfs_md_and_datadata 0 87.6T cephfs_datadata 773T 175T STANDBY MDS slugfs.pr-md-03.mclckv MDS version: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) It just stays there indefinitely. All my clients are hung. I tried restarting all MDS daemons and they just went back to this state after coming back up. Is there any way I can somehow escape this state of indefinite replay/resolve? Thanks so much! I'm kinda nervous since none of my clients have filesystem access at the moment... cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
Hi Xiubo, Nevermind I was wrong, most the blocked ops were 12 hours old. Ug. I restarted the MDS daemon to clear them. I just reset to having one active MDS instead of two, let's see if that makes a difference. I am beginning to think it may be impossible to catch the logs that matter here. I feel like sometimes the blocked ops are just waiting because of load and sometimes they are waiting because they are stuck. But, it's really hard to tell which, without waiting a while. But, I can't wait while having debug turned on because my root disks (which are 150 GB large) fill up with debug logs in 20 minutes. So it almost seems that unless I could somehow store many TB of debug logs we won't be able to catch this. Let's see how having one MDS helps. Or maybe I actually need like 4 MDSs because the load is too high for only one or two. I don't know. Or maybe it's the lock issue you've been working on. I guess I can test the lock order fix when it's available to test. -erich On 4/19/24 7:26 AM, Erich Weiler wrote: So I woke up this morning and checked the blocked_ops again, there were 150 of them. But the age of each ranged from 500 to 4300 seconds. So it seems as if they are eventually being processed. I wonder if we are thinking about this in the wrong way? Maybe I should be *adding* MDS daemons because my current ones are overloaded? Can a single server hold multiple MDS daemons? Right now I have three physical servers each with one MDS daemon on it. I can still try reducing to one. And I'll keep an eye on blocked ops to see if any get to a very old age (and are thus wedged). -erich On 4/18/24 8:55 PM, Xiubo Li wrote: Okay, please try it to set only one active mds. On 4/19/24 11:54, Erich Weiler wrote: We have 2 active MDS daemons and one standby. On 4/18/24 8:52 PM, Xiubo Li wrote: BTW, how man active mds you are using ? On 4/19/24 10:55, Erich Weiler wrote: OK, I'm sure I caught it in the right order this time, the logs should definitely show when the blocked/slow requests start. Check out these logs and dumps: http://hgwdev.gi.ucsc.edu/~weiler/ It's a 762 MB tarball but it uncompresses to 16 GB. -erichll On 4/18/24 6:57 PM, Xiubo Li wrote: Okay, could you try this with 18.2.0 ? I just double it was introduce by: commit e610179a6a59c463eb3d85e87152ed3268c808ff Author: Patrick Donnelly Date: Mon Jul 17 16:10:59 2023 -0400 mds: drop locks and retry when lock set changes An optimization was added to avoid an unnecessary gather on the inode filelock when the client can safely get the file size without also getting issued the requested caps. However, if a retry of getattr is necessary, this conditional inclusion of the inode filelock can cause lock-order violations resulting in deadlock. So, if we've already acquired some of the inode's locks then we must drop locks and retry. Fixes: https://tracker.ceph.com/issues/62052 Fixes: c822b3e2573578c288d170d1031672b74e02dced Signed-off-by: Patrick Donnelly (cherry picked from commit b5719ac32fe6431131842d62ffaf7101c03e9bac) On 4/19/24 09:54, Erich Weiler wrote: I'm on 18.2.1. I think I may have gotten the timing off on the logs and dumps so I'll try again. Just really hard to capture because I need to kind of be looking at it in real time to capture it. Hang on, lemme see if I can get another capture... -erich On 4/18/24 6:35 PM, Xiubo Li wrote: BTW, which ceph version you are using ? On 4/12/24 04:22, Erich Weiler wrote: BTW - it just happened again, I upped the debugging settings as you instructed and got more dumps (then returned the debug settings to normal). Attached are the new dumps. Thanks again, erich On 4/9/24 9:00 PM, Xiubo Li wrote: On 4/10/24 11:48, Erich Weiler wrote: Dos that mean it could be the locker order bug (https://tracker.ceph.com/issues/62123) as Xiubo suggested? I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. Thank you! Yeah, this issue is happening every couple days now. It just happened again today and I got more MDS dumps. If it would help, let me know and I can send them! Once this happen if you could enable the mds debug logs will be better: debug mds = 20 debug ms = 1 And then provide the debug logs together with the MDS dumps. I assume if this fix is approved and backported it will then appear in like 18.2.3 or something? Yeah, it will be backported after being well tested. - Xiubo Thanks again, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Question about PR merge
Have you already shared information about this issue? Please do if not. I am working with Xiubo Li and providing debugging information - in progress! I was wondering if it would be included in 18.2.3 which I *think* should be released soon? Is there any way of knowing if that is true? This PR is primarily a debugging tool. It will not make 18.2.3 as it's not even merged to main yet. Ah, OK. I hope some solution can be had soon for this item if Xiubo figures it out - it's requiring constant attention to keep my filesystem from hanging, or, the restart MDS daemons multiple times a day to "unstick" the filesystem on random cluster nodes. We think it's due to lock contention/deadlocking. Possibly it's not affecting others as much as me... We have an HPC cluster hammering the filesystem (18.2.1) and the MDS daemons seems to be reporting lock issues pretty frequently while nodes and processes fighting to get file and directory locks, and deadlocking (we think). I'll keep working with Xiubo. -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Question about PR merge
Hello, We are tracking PR #56805: https://github.com/ceph/ceph/pull/56805 And the resolution of this item would potentially fix a pervasive and ongoing issue that needs daily attention in our cephfs cluster. I was wondering if it would be included in 18.2.3 which I *think* should be released soon? Is there any way of knowing if that is true? Thanks again, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How to make config changes stick for MDS?
Hi All, I'm having a crazy time getting config items to stick on my MDS daemons. I'm running Reef 18.2.1 on RHEL 9 and the daemons are running in podman, I used cephadm to deploy the daemons. I can adjust the config items in runtime, like so: ceph tell mds.slugfs.pr-md-01.xdtppo config set mds_bal_interval -1 But for the life of me I cannot get that to stick when I restart the MDS daemon. I've tried adding this to /etc/ceph/ceph.conf in the host server: [mds] mds_bal_interval = -1 But that doesn't get picked up on daemon restart. I also added the same config segment to /etc/ceph/ceph.conf *inside* the container, no dice, still doesn't stick. I even tried adding it to /var/lib/ceph//config/ceph.conf and it *still* doesn't stick across daemon restarts. Does anyone know how I can get MDS config items to stick across daemon reboots when the daemon is running in podman under RHEL? Thanks much! -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
Or... Maybe the fix will first appear in the "centos-ceph-reef-test" repo that I see? Is that how RedHat usually does it? On 4/11/24 10:30, Erich Weiler wrote: I guess we are specifically using the "centos-ceph-reef" repository, and it looks like the latest version in that repo is 18.2.2-1.el9s. Will this fix appear in 18.2.2-2.el9s or something like that? I don't know how often the release cycle updates the repos...? On 4/11/24 09:40, Erich Weiler wrote: I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. That's great! When do you think that will be available? Thank you! Yeah, this issue is happening every couple days now. It just happened again today and I got more MDS dumps. If it would help, let me know and I can send them! Once this happen if you could enable the mds debug logs will be better: debug mds = 20 debug ms = 1 And then provide the debug logs together with the MDS dumps. OK next time I see it I'll do that. -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
I guess we are specifically using the "centos-ceph-reef" repository, and it looks like the latest version in that repo is 18.2.2-1.el9s. Will this fix appear in 18.2.2-2.el9s or something like that? I don't know how often the release cycle updates the repos...? On 4/11/24 09:40, Erich Weiler wrote: I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. That's great! When do you think that will be available? Thank you! Yeah, this issue is happening every couple days now. It just happened again today and I got more MDS dumps. If it would help, let me know and I can send them! Once this happen if you could enable the mds debug logs will be better: debug mds = 20 debug ms = 1 And then provide the debug logs together with the MDS dumps. OK next time I see it I'll do that. -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. That's great! When do you think that will be available? Thank you! Yeah, this issue is happening every couple days now. It just happened again today and I got more MDS dumps. If it would help, let me know and I can send them! Once this happen if you could enable the mds debug logs will be better: debug mds = 20 debug ms = 1 And then provide the debug logs together with the MDS dumps. OK next time I see it I'll do that. -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
Dos that mean it could be the locker order bug (https://tracker.ceph.com/issues/62123) as Xiubo suggested? I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. Thank you! Yeah, this issue is happening every couple days now. It just happened again today and I got more MDS dumps. If it would help, let me know and I can send them! I assume if this fix is approved and backported it will then appear in like 18.2.3 or something? Thanks again, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
Ah, I see. Yes, we are already running version 18.2.1 on the server side (we just installed this cluster a few weeks ago from scratch). So I guess if the fix has already been backported to that version, then we still have a problem. Dos that mean it could be the locker order bug (https://tracker.ceph.com/issues/62123) as Xiubo suggested? Thanks again, Erich > On Apr 7, 2024, at 9:00 PM, Alexander E. Patrakov wrote: > > Hi Erich, > >> On Mon, Apr 8, 2024 at 11:51 AM Erich Weiler wrote: >> >> Hi Xiubo, >> >>> Thanks for your logs, and it should be the same issue with >>> https://tracker.ceph.com/issues/62052, could you try to test with this >>> fix again ? >> >> This sounds good - but I'm not clear on what I should do? I see a patch >> in that tracker page, is that what you are referring to? If so, how >> would I apply such a patch? Or is there simply a binary update I can >> apply somehow to the MDS server software? > > The backport of this patch (https://github.com/ceph/ceph/pull/53241) > was merged on October 18, 2023, and Ceph 18.2.1 was released on > December 18, 2023. Therefore, if you are running Ceph 18.2.1 on the > server side, you already have the fix. If you are already running > version 18.2.1 or 18.2.2 (to which you should upgrade anyway), please > complain, as the purported fix is then ineffective. > >> >> Thanks for helping! >> >> -erich >> >>> Please let me know if you still could see this bug then it should be the >>> locker order bug as https://tracker.ceph.com/issues/62123. >>> >>> Thanks >>> >>> - Xiubo >>> >>> >>> On 3/28/24 04:03, Erich Weiler wrote: >>>> Hi All, >>>> >>>> I've been battling this for a while and I'm not sure where to go from >>>> here. I have a Ceph health warning as such: >>>> >>>> # ceph -s >>>> cluster: >>>>id: 58bde08a-d7ed-11ee-9098-506b4b4da440 >>>>health: HEALTH_WARN >>>>1 MDSs report slow requests >>>>1 MDSs behind on trimming >>>> >>>> services: >>>>mon: 5 daemons, quorum >>>> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d) >>>>mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz >>>>mds: 1/1 daemons up, 2 standby >>>>osd: 46 osds: 46 up (since 9h), 46 in (since 2w) >>>> >>>> data: >>>>volumes: 1/1 healthy >>>>pools: 4 pools, 1313 pgs >>>>objects: 260.72M objects, 466 TiB >>>>usage: 704 TiB used, 424 TiB / 1.1 PiB avail >>>>pgs: 1306 active+clean >>>> 4active+clean+scrubbing+deep >>>> 3active+clean+scrubbing >>>> >>>> io: >>>>client: 123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr >>>> >>>> And the specifics are: >>>> >>>> # ceph health detail >>>> HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming >>>> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests >>>>mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > >>>> 30 secs >>>> [WRN] MDS_TRIM: 1 MDSs behind on trimming >>>>mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) >>>> max_segments: 250, num_segments: 13884 >>>> >>>> That "num_segments" number slowly keeps increasing. I suspect I just >>>> need to tell the MDS servers to trim faster but after hours of >>>> googling around I just can't figure out the best way to do it. The >>>> best I could come up with was to decrease "mds_cache_trim_decay_rate" >>>> from 1.0 to .8 (to start), based on this page: >>>> >>>> https://www.suse.com/support/kb/doc/?id=19740 >>>> >>>> But it doesn't seem to help, maybe I should decrease it further? I am >>>> guessing this must be a common issue...? I am running Reef on the MDS >>>> servers, but most clients are on Quincy. >>>> >>>> Thanks for any advice! >>>> >>>> cheers, >>>> erich >>>> ___ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> >>> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > > -- > Alexander E. Patrakov ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
Hi Xiubo, Thanks for your logs, and it should be the same issue with https://tracker.ceph.com/issues/62052, could you try to test with this fix again ? This sounds good - but I'm not clear on what I should do? I see a patch in that tracker page, is that what you are referring to? If so, how would I apply such a patch? Or is there simply a binary update I can apply somehow to the MDS server software? Thanks for helping! -erich Please let me know if you still could see this bug then it should be the locker order bug as https://tracker.ceph.com/issues/62123. Thanks - Xiubo On 3/28/24 04:03, Erich Weiler wrote: Hi All, I've been battling this for a while and I'm not sure where to go from here. I have a Ceph health warning as such: # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 1 MDSs report slow requests 1 MDSs behind on trimming services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d) mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz mds: 1/1 daemons up, 2 standby osd: 46 osds: 46 up (since 9h), 46 in (since 2w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 260.72M objects, 466 TiB usage: 704 TiB used, 424 TiB / 1.1 PiB avail pgs: 1306 active+clean 4 active+clean+scrubbing+deep 3 active+clean+scrubbing io: client: 123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr And the specifics are: # ceph health detail HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) max_segments: 250, num_segments: 13884 That "num_segments" number slowly keeps increasing. I suspect I just need to tell the MDS servers to trim faster but after hours of googling around I just can't figure out the best way to do it. The best I could come up with was to decrease "mds_cache_trim_decay_rate" from 1.0 to .8 (to start), based on this page: https://www.suse.com/support/kb/doc/?id=19740 But it doesn't seem to help, maybe I should decrease it further? I am guessing this must be a common issue...? I am running Reef on the MDS servers, but most clients are on Quincy. Thanks for any advice! cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Multiple MDS Daemon needed?
Hi All, We have a slurm cluster with 25 clients, each with 256 cores, each mounting a cephfs filesystem as their main storage target. The workload can be heavy at times. We have two active MDS daemons and one standby. A lot of the time everything is healthy but we sometimes get warnings about MDS daemons being slow on requests, behind on trimming, etc. I realize their may be a bug in play, but also, I was wondering if we simply didn't have enough MDS daemons to handle the load. Is there a way to know if adding a MDS daemon would help? We could add a third active MDS if needed. But I don't want to start adding a bunch of MDS's if that won't help. The OSD servers seem fine. It's mainly the MDS instances that are complaining. We are running reef 18.2.1. For reference, when things look healthy: # ceph fs status slugfs slugfs - 34 clients == RANK STATEMDS ACTIVITY DNSINOS DIRS CAPS 0active slugfs.pr-md-03.mclckv Reqs: 273 /s 2759k 2636k 362k 1079k 1active slugfs.pr-md-01.xdtppo Reqs: 194 /s 868k 674k 67.3k 351k POOL TYPE USED AVAIL cephfs_metadatametadata 127G 3281G cephfs_md_and_datadata 0 98.3T cephfs_datadata 740T 196T STANDBY MDS slugfs.pr-md-02.sbblqq MDS version: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_OK services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d) mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz mds: 2/2 daemons up, 1 standby osd: 46 osds: 46 up (since 8d), 46 in (since 4w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 271.17M objects, 493 TiB usage: 744 TiB used, 384 TiB / 1.1 PiB avail pgs: 1307 active+clean 4active+clean+scrubbing 2active+clean+scrubbing+deep io: client: 39 MiB/s rd, 108 MiB/s wr, 1.96k op/s rd, 54 op/s wr But when things are in "warning" mode, it looks like this: # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 1 filesystem is degraded 1 clients failing to advance oldest client/flush tid 1 MDSs report slow requests 1 MDSs behind on trimming services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d) mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz mds: 2/2 daemons up, 1 standby osd: 46 osds: 46 up (since 8d), 46 in (since 4w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 271.28M objects, 494 TiB usage: 746 TiB used, 382 TiB / 1.1 PiB avail pgs: 1307 active+clean 5active+clean+scrubbing 1active+clean+scrubbing+deep io: client: 55 MiB/s rd, 2.6 MiB/s wr, 15 op/s rd, 46 op/s wr And this: # ceph health detail HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 2 MDSs report slow requests; 1 MDSs behind on trimming [WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest client/flush tid mds.slugfs.pr-md-01.xdtppo(mds.0): Client phoenix-06.prism failing to advance its oldest client/flush tid. client_id: 125780 mds.slugfs.pr-md-02.sbblqq(mds.1): Client phoenix-00.prism failing to advance its oldest client/flush tid. client_id: 99385 [WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests mds.slugfs.pr-md-01.xdtppo(mds.0): 4 slow requests are blocked > 30 secs mds.slugfs.pr-md-02.sbblqq(mds.1): 67 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-02.sbblqq(mds.1): Behind on trimming (109410/250) max_segments: 250, num_segments: 109410 The "cure" is the restart the active MDS daemons, one at a time. Then everything becomes healthy again, for a time. We also have the following MDS config items in play: mds_cache_memory_limit = 8589934592 mds_cache_trim_decay_rate = .6 mds_log_max_segments = 250 Thanks for any pointers! cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Behind on Trimming...
Could there be an issue with the fact that the servers (MDS, MGR, MON, OSD) are running reef and all the clients are running quincy? I can easily enough get the new reef repo in for all our clients (Ubuntu 22.04) and upgrade the clients to reef if that might help..? On 3/28/24 3:05 PM, Erich Weiler wrote: I asked the user and they said no, no rsync involved. Although I rsync'd 500TB into this filesystem in the beginning without incident, so hopefully it's not a big deal here. I'm asking the user what their workflow does to try and pin this down. Are there any other known reason why a slow request would start on a certain inode, then block a bunch of cache segments behind it, until the MDS is restarted? Once I restart the MDS daemon that is slow, it shows the cache segments transfer to the other MDS server and very quickly drop to zero, then everything is healthy again, the stuck directory in question responds again and all is well. Then a few hours later it started happening again (not always the same directory). I hope I'm not experiencing a bug, but I can't see what would be causing this... On 3/28/24 2:37 PM, Alexander E. Patrakov wrote: Hello Erich, Does the workload, by any chance, involve rsync? It is unfortunately well-known for triggering such issues. A workaround is to export the directory via NFS and run rsync against the NFS mount instead of directly against CephFS. On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler wrote: MDS logs show: Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 3676.400077 secs Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]: mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3 Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : 320 slow requests, 5 included below; oldest blocked for > 3681.400104 secs Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3668.805732 seconds old, received at 2024-03-28T19:41:25.772531+: client_request(client.99375:574268 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3667.883853 seconds old, received at 2024-03-28T19:41:26.694410+: client_request(client.99390:374844 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3663.724571 seconds old, received at 2024-03-28T19:41:30.853692+: client_request(client.99390:375258 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3681.399582 seconds old, received at 2024-03-28T19:41:13.178681+: client_request(client.99385:11712080 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+ caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to rdlock, waiting Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3680.508972 seconds old, received at 2024-03-28T19:41:14.069291+: client_request(client.99385:11712556 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:14.070764+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr The client IDs map to several of our cluster nodes but the inode reference always refers to the same directory in these recent logs: /private/groups/shapirolab/brock/r2/cactus_coord That directory does not respond to an 'ls', but other directories directly above it do just fine. Maybe it's a bad cache item on the MDS? # ceph health detail HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 1 MDSs report slow requests; 1 MDSs behind on trimming [WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest client/flush tid mds.slugfs.pr-md-02.sbblqq(mds.0): Client mustard failing to advance its oldest client/flush tid. client_id: 101305 mds.slugfs.pr-md-01.xdtppo(mds.1): Client failing to advance its oldest client/flush tid. client_id: 101305 [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.slugfs.pr-md-02.sbblqq(mds.0): 201 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-02.sbblqq(mds.0): Behind on trimming (4786/250) max_segments: 250, num_segments: 4786 I think that this is somehow causing the "slow requests", on the nodes listed in the logs, as that directory in inaccessible. And maybe the 'behind on trimming' part is also related, as it can
[ceph-users] Re: MDS Behind on Trimming...
I asked the user and they said no, no rsync involved. Although I rsync'd 500TB into this filesystem in the beginning without incident, so hopefully it's not a big deal here. I'm asking the user what their workflow does to try and pin this down. Are there any other known reason why a slow request would start on a certain inode, then block a bunch of cache segments behind it, until the MDS is restarted? Once I restart the MDS daemon that is slow, it shows the cache segments transfer to the other MDS server and very quickly drop to zero, then everything is healthy again, the stuck directory in question responds again and all is well. Then a few hours later it started happening again (not always the same directory). I hope I'm not experiencing a bug, but I can't see what would be causing this... On 3/28/24 2:37 PM, Alexander E. Patrakov wrote: Hello Erich, Does the workload, by any chance, involve rsync? It is unfortunately well-known for triggering such issues. A workaround is to export the directory via NFS and run rsync against the NFS mount instead of directly against CephFS. On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler wrote: MDS logs show: Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 3676.400077 secs Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]: mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3 Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : 320 slow requests, 5 included below; oldest blocked for > 3681.400104 secs Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3668.805732 seconds old, received at 2024-03-28T19:41:25.772531+: client_request(client.99375:574268 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3667.883853 seconds old, received at 2024-03-28T19:41:26.694410+: client_request(client.99390:374844 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3663.724571 seconds old, received at 2024-03-28T19:41:30.853692+: client_request(client.99390:375258 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3681.399582 seconds old, received at 2024-03-28T19:41:13.178681+: client_request(client.99385:11712080 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+ caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to rdlock, waiting Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3680.508972 seconds old, received at 2024-03-28T19:41:14.069291+: client_request(client.99385:11712556 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:14.070764+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr The client IDs map to several of our cluster nodes but the inode reference always refers to the same directory in these recent logs: /private/groups/shapirolab/brock/r2/cactus_coord That directory does not respond to an 'ls', but other directories directly above it do just fine. Maybe it's a bad cache item on the MDS? # ceph health detail HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 1 MDSs report slow requests; 1 MDSs behind on trimming [WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest client/flush tid mds.slugfs.pr-md-02.sbblqq(mds.0): Client mustard failing to advance its oldest client/flush tid. client_id: 101305 mds.slugfs.pr-md-01.xdtppo(mds.1): Client failing to advance its oldest client/flush tid. client_id: 101305 [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.slugfs.pr-md-02.sbblqq(mds.0): 201 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-02.sbblqq(mds.0): Behind on trimming (4786/250) max_segments: 250, num_segments: 4786 I think that this is somehow causing the "slow requests", on the nodes listed in the logs, as that directory in inaccessible. And maybe the 'behind on trimming' part is also related, as it can't trim past that inode or something? If I restart the MDS daemon this will clear (I've done it before). But it just comes back. Often somewhere in the same directory /private/groups/shapirolab/brock/...[something]. -erich On 3/28/24 10:11 AM, Erich Weiler wrote: Here are some of the MDS logs:
[ceph-users] Re: MDS Behind on Trimming...
MDS logs show: Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 3676.400077 secs Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]: mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3 Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : 320 slow requests, 5 included below; oldest blocked for > 3681.400104 secs Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3668.805732 seconds old, received at 2024-03-28T19:41:25.772531+: client_request(client.99375:574268 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3667.883853 seconds old, received at 2024-03-28T19:41:26.694410+: client_request(client.99390:374844 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3663.724571 seconds old, received at 2024-03-28T19:41:30.853692+: client_request(client.99390:375258 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3681.399582 seconds old, received at 2024-03-28T19:41:13.178681+: client_request(client.99385:11712080 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+ caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to rdlock, waiting Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) log [WRN] : slow request 3680.508972 seconds old, received at 2024-03-28T19:41:14.069291+: client_request(client.99385:11712556 getattr AsXsFs #0x1000c097307 2024-03-28T19:41:14.070764+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr The client IDs map to several of our cluster nodes but the inode reference always refers to the same directory in these recent logs: /private/groups/shapirolab/brock/r2/cactus_coord That directory does not respond to an 'ls', but other directories directly above it do just fine. Maybe it's a bad cache item on the MDS? # ceph health detail HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 1 MDSs report slow requests; 1 MDSs behind on trimming [WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest client/flush tid mds.slugfs.pr-md-02.sbblqq(mds.0): Client mustard failing to advance its oldest client/flush tid. client_id: 101305 mds.slugfs.pr-md-01.xdtppo(mds.1): Client failing to advance its oldest client/flush tid. client_id: 101305 [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.slugfs.pr-md-02.sbblqq(mds.0): 201 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-02.sbblqq(mds.0): Behind on trimming (4786/250) max_segments: 250, num_segments: 4786 I think that this is somehow causing the "slow requests", on the nodes listed in the logs, as that directory in inaccessible. And maybe the 'behind on trimming' part is also related, as it can't trim past that inode or something? If I restart the MDS daemon this will clear (I've done it before). But it just comes back. Often somewhere in the same directory /private/groups/shapirolab/brock/...[something]. -erich On 3/28/24 10:11 AM, Erich Weiler wrote: Here are some of the MDS logs: Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 511.703289 seconds old, received at 2024-03-27T18:49:53.623192+: client_request(client.99375:459393 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:49:53.620806+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 690.189459 seconds old, received at 2024-03-27T18:46:55.137022+: client_request(client.99445:4189994 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 686.308604 seconds old, received at 2024-03-27T18:46:59.017876+: client_request(client.99445:4190508 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 686.156943 sec
[ceph-users] Re: MDS Behind on Trimming...
Wow those are extremely useful commands. Next time this happens I'll be sure to use them. A quick test shows they work just great! cheers, erich On 3/28/24 11:16 AM, Alexander E. Patrakov wrote: Hi Erich, Here is how to map the client ID to some extra info: ceph tell mds.0 client ls id=99445 Here is how to map inode ID to the path: ceph tell mds.0 dump inode 0x100081b9ceb | jq -r .path On Fri, Mar 29, 2024 at 1:12 AM Erich Weiler wrote: Here are some of the MDS logs: Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 511.703289 seconds old, received at 2024-03-27T18:49:53.623192+: client_request(client.99375:459393 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:49:53.620806+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 690.189459 seconds old, received at 2024-03-27T18:46:55.137022+: client_request(client.99445:4189994 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 686.308604 seconds old, received at 2024-03-27T18:46:59.017876+: client_request(client.99445:4190508 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 686.156943 seconds old, received at 2024-03-27T18:46:59.169537+: client_request(client.99400:591887 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:26 pr-md-01.prism ceph-mds[1296468]: mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16631 from mon.0 Mar 27 11:58:30 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 699.385743 secs Mar 27 11:58:34 pr-md-01.prism ceph-mds[1296468]: mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16632 from mon.0 Mar 27 11:58:35 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 704.385896 secs Mar 27 11:58:38 pr-md-01.prism ceph-mds[1296468]: mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16633 from mon.0 Mar 27 11:58:40 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 709.385979 secs Mar 27 11:58:42 pr-md-01.prism ceph-mds[1296468]: mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16634 from mon.0 Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : 78 slow requests, 5 included below; oldest blocked for > 714.386040 secs Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 710.189838 seconds old, received at 2024-03-27T18:46:55.137022+: client_request(client.99445:4189994 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 706.308983 seconds old, received at 2024-03-27T18:46:59.017876+: client_request(client.99445:4190508 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 706.157322 seconds old, received at 2024-03-27T18:46:59.169537+: client_request(client.99400:591887 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 706.086751 seconds old, received at 2024-03-27T18:46:59.240108+: client_request(client.99400:591894 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.242644+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 705.196030 seconds old, received at 2024-03-27T18:47:00.130829+: client_request(client.99400:591985 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:47:00.130641+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16635 from mon.0 Mar 27 11:58:50 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 719.386116 secs Mar 27 11:58:53
[ceph-users] Re: MDS Behind on Trimming...
11:59:00 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 729.386333 secs Mar 27 11:59:02 pr-md-01.prism ceph-mds[1296468]: mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16638 from mon.0 Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : 53 slow requests, 5 included below; oldest blocked for > 734.386400 secs Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 730.190197 seconds old, received at 2024-03-27T18:46:55.137022+: client_request(client.99445:4189994 getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+ caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch getattr Can we tell which client the slow requests are coming from? It says stuff like "client.99445:4189994" but I don't know how to map that to a client... Thanks for the response! -erich On 3/27/24 21:28, Xiubo Li wrote: On 3/28/24 04:03, Erich Weiler wrote: Hi All, I've been battling this for a while and I'm not sure where to go from here. I have a Ceph health warning as such: # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 1 MDSs report slow requests There had slow requests. I just suspect the behind on trimming was caused by this. Could you share the logs about the slow requests ? What are they ? Thanks 1 MDSs behind on trimming services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d) mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz mds: 1/1 daemons up, 2 standby osd: 46 osds: 46 up (since 9h), 46 in (since 2w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 260.72M objects, 466 TiB usage: 704 TiB used, 424 TiB / 1.1 PiB avail pgs: 1306 active+clean 4 active+clean+scrubbing+deep 3 active+clean+scrubbing io: client: 123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr And the specifics are: # ceph health detail HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) max_segments: 250, num_segments: 13884 That "num_segments" number slowly keeps increasing. I suspect I just need to tell the MDS servers to trim faster but after hours of googling around I just can't figure out the best way to do it. The best I could come up with was to decrease "mds_cache_trim_decay_rate" from 1.0 to .8 (to start), based on this page: https://www.suse.com/support/kb/doc/?id=19740 But it doesn't seem to help, maybe I should decrease it further? I am guessing this must be a common issue...? I am running Reef on the MDS servers, but most clients are on Quincy. Thanks for any advice! cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MDS Behind on Trimming...
Hi All, I've been battling this for a while and I'm not sure where to go from here. I have a Ceph health warning as such: # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 1 MDSs report slow requests 1 MDSs behind on trimming services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d) mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz mds: 1/1 daemons up, 2 standby osd: 46 osds: 46 up (since 9h), 46 in (since 2w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 260.72M objects, 466 TiB usage: 704 TiB used, 424 TiB / 1.1 PiB avail pgs: 1306 active+clean 4active+clean+scrubbing+deep 3active+clean+scrubbing io: client: 123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr And the specifics are: # ceph health detail HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) max_segments: 250, num_segments: 13884 That "num_segments" number slowly keeps increasing. I suspect I just need to tell the MDS servers to trim faster but after hours of googling around I just can't figure out the best way to do it. The best I could come up with was to decrease "mds_cache_trim_decay_rate" from 1.0 to .8 (to start), based on this page: https://www.suse.com/support/kb/doc/?id=19740 But it doesn't seem to help, maybe I should decrease it further? I am guessing this must be a common issue...? I am running Reef on the MDS servers, but most clients are on Quincy. Thanks for any advice! cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CephFS filesystem mount tanks on some nodes?
Hi All, We have a CephFS filesystem where we are running Reef on the servers (OSD/MDS/MGR/MON) and Quincy on the clients. Every once in a while, one of the clients will stop allowing access to my CephFS filesystem, the error being "permission denied" while try to access the filesystem on that node. The fix is to force unmount the filesystem and remount it, then it's fine again. Any idea how I can prevent this? I see this in the client node logs: Mar 25 11:34:46 phoenix-07 kernel: [50508.354036] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:34:46 phoenix-07 kernel: [50508.359650] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:34:46 phoenix-07 kernel: [50508.367657] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:36:46 phoenix-07 kernel: [50629.189000] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:36:46 phoenix-07 kernel: [50629.192579] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:36:46 phoenix-07 kernel: [50629.196103] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:38:47 phoenix-07 kernel: [50750.024268] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:38:47 phoenix-07 kernel: [50750.031520] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:38:47 phoenix-07 kernel: [50750.038594] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 11:40:48 phoenix-07 kernel: [50870.853281] ? __touch_cap+0x24/0xd0 [ceph] Mar 25 22:55:38 phoenix-07 kernel: [91360.583032] libceph: mds0 (1)10.50.1.75:6801 socket closed (con state OPEN) Mar 25 22:55:38 phoenix-07 kernel: [91360.667914] libceph: mds0 (1)10.50.1.75:6801 session reset Mar 25 22:55:38 phoenix-07 kernel: [91360.667923] ceph: mds0 closed our session Mar 25 22:55:38 phoenix-07 kernel: [91360.667925] ceph: mds0 reconnect start Mar 25 22:55:52 phoenix-07 kernel: [91374.541614] ceph: mds0 reconnect denied Mar 25 22:55:52 phoenix-07 kernel: [91374.541726] ceph: dropping dirty+flushing Fw state for ea96c18f 1099683115069 Mar 25 22:55:52 phoenix-07 kernel: [91374.541732] ceph: dropping dirty+flushing Fw state for ce495f00 1099687100635 Mar 25 22:55:52 phoenix-07 kernel: [91374.541737] ceph: dropping dirty+flushing Fw state for 73ebb190 1099687100636 Mar 25 22:55:52 phoenix-07 kernel: [91374.541744] ceph: dropping dirty+flushing Fw state for 91337e6a 1099687100637 Mar 25 22:55:52 phoenix-07 kernel: [91374.541746] ceph: dropping dirty+flushing Fw state for 9075ecd8 1099687100634 Mar 25 22:55:52 phoenix-07 kernel: [91374.541751] ceph: dropping dirty+flushing Fw state for d1d4c51f 1099687100633 Mar 25 22:55:52 phoenix-07 kernel: [91374.541781] ceph: dropping dirty+flushing Fw state for 63dec1e4 1099687100632 Mar 25 22:55:52 phoenix-07 kernel: [91374.541793] ceph: dropping dirty+flushing Fw state for 8b3124db 1099687100638 Mar 25 22:55:52 phoenix-07 kernel: [91374.541796] ceph: dropping dirty+flushing Fw state for d9e76d8b 1099687100471 Mar 25 22:55:52 phoenix-07 kernel: [91374.541798] ceph: dropping dirty+flushing Fw state for b57da610 1099685041085 Mar 25 22:55:52 phoenix-07 kernel: [91374.542235] libceph: mds0 (1)10.50.1.75:6801 socket closed (con state V1_CONNECT_MSG) Mar 25 22:55:52 phoenix-07 kernel: [91374.791652] ceph: mds0 rejected session Mar 25 23:01:51 phoenix-07 kernel: [91733.308806] ceph: get_quota_realm: ino (1.fffe) null i_snap_realm Mar 25 23:01:56 phoenix-07 kernel: [91738.182127] ceph: check_quota_exceeded: ino (1000a1cb4a8.fffe) null i_snap_realm Mar 25 23:01:56 phoenix-07 kernel: [91738.188225] ceph: check_quota_exceeded: ino (1000a1cb4a8.fffe) null i_snap_realm Mar 25 23:01:56 phoenix-07 kernel: [91738.233658] ceph: check_quota_exceeded: ino (1000a1cb4aa.fffe) null i_snap_realm Mar 25 23:25:52 phoenix-07 kernel: [93174.787630] libceph: mds0 (1)10.50.1.75:6801 socket closed (con state OPEN) Mar 25 23:39:45 phoenix-07 kernel: [94007.751879] ceph: get_quota_realm: ino (1.fffe) null i_snap_realm Mar 26 00:03:28 phoenix-07 kernel: [95430.158646] ceph: get_quota_realm: ino (1.fffe) null i_snap_realm Mar 26 00:39:45 phoenix-07 kernel: [97607.685421] ceph: get_quota_realm: ino (1.fffe) null i_snap_realm Mar 26 00:43:34 phoenix-07 kernel: [97836.681145] ceph: check_quota_exceeded: ino (1000a306503.fffe) null i_snap_realm Mar 26 00:43:34 phoenix-07 kernel: [97836.686797] ceph: check_quota_exceeded: ino (1000a306503.fffe) null i_snap_realm Mar 26 00:43:34 phoenix-07 kernel: [97836.729046] ceph: check_quota_exceeded: ino (1000a306505.fffe) null i_snap_realm Mar 26 00:49:39 phoenix-07 kernel: [98201.302564] ceph: check_quota_exceeded: ino (1000a75677d.fffe) null i_snap_realm Mar 26 00:49:39 phoenix-07 kernel: [98201.305676] ceph: check_quota_exceeded: ino (1000a75677d.fffe) null i_snap_realm Mar 26 00:49:39 phoenix-07 kernel: [98201.347267] ceph: check_quota_exceeded: ino (1000a755fe3.fffe) null i_snap_realm Mar 26 01:04:49 pho
[ceph-users] Re: Clients failing to advance oldest client?
Thank you! The OSD/mon/mgr/MDS servers are on 18.2.1, and the clients are mostly 17.2.6. -erich On 3/25/24 11:57 PM, Dhairya Parmar wrote: I think this bug has already been worked on in https://tracker.ceph.com/issues/63364 <https://tracker.ceph.com/issues/63364>, can you tell which version you're on? -- *Dhairya Parmar* Associate Software Engineer, CephFS IBM, Inc. On Tue, Mar 26, 2024 at 2:32 AM Erich Weiler <mailto:wei...@soe.ucsc.edu>> wrote: Hi Y'all, I'm seeing this warning via 'ceph -s' (this is on Reef): # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 3 clients failing to advance oldest client/flush tid 1 MDSs report slow requests 1 MDSs behind on trimming services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d) mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz mds: 1/1 daemons up, 1 standby osd: 46 osds: 46 up (since 3d), 46 in (since 2w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 258.13M objects, 454 TiB usage: 688 TiB used, 441 TiB / 1.1 PiB avail pgs: 1303 active+clean 8 active+clean+scrubbing 2 active+clean+scrubbing+deep io: client: 131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr I googled around and looked at the docs and it seems like this isn't a critical problem, but I couldn't find a clear path to resolution. Does anyone have any advice on what I can do to resolve the health issues up top? My CephFS filesystem is incredibly busy so I have a feeling that has some impact here, but not 100% sure... Thanks as always for the help! cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Clients failing to advance oldest client?
Ok! Thank you. Is there a way to tell which client is slow? > On Mar 25, 2024, at 9:06 PM, David Yang wrote: > > It is recommended to disconnect the client first and then observe > whether the cluster's slow requests recover. > > Erich Weiler 于2024年3月26日周二 05:02写道: >> >> Hi Y'all, >> >> I'm seeing this warning via 'ceph -s' (this is on Reef): >> >> # ceph -s >> cluster: >> id: 58bde08a-d7ed-11ee-9098-506b4b4da440 >> health: HEALTH_WARN >> 3 clients failing to advance oldest client/flush tid >> 1 MDSs report slow requests >> 1 MDSs behind on trimming >> >> services: >> mon: 5 daemons, quorum >> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d) >> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz >> mds: 1/1 daemons up, 1 standby >> osd: 46 osds: 46 up (since 3d), 46 in (since 2w) >> >> data: >> volumes: 1/1 healthy >> pools: 4 pools, 1313 pgs >> objects: 258.13M objects, 454 TiB >> usage: 688 TiB used, 441 TiB / 1.1 PiB avail >> pgs: 1303 active+clean >> 8active+clean+scrubbing >> 2active+clean+scrubbing+deep >> >> io: >> client: 131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr >> >> I googled around and looked at the docs and it seems like this isn't a >> critical problem, but I couldn't find a clear path to resolution. Does >> anyone have any advice on what I can do to resolve the health issues up top? >> >> My CephFS filesystem is incredibly busy so I have a feeling that has >> some impact here, but not 100% sure... >> >> Thanks as always for the help! >> >> cheers, >> erich >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Clients failing to advance oldest client?
Hi Y'all, I'm seeing this warning via 'ceph -s' (this is on Reef): # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 3 clients failing to advance oldest client/flush tid 1 MDSs report slow requests 1 MDSs behind on trimming services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d) mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz mds: 1/1 daemons up, 1 standby osd: 46 osds: 46 up (since 3d), 46 in (since 2w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 258.13M objects, 454 TiB usage: 688 TiB used, 441 TiB / 1.1 PiB avail pgs: 1303 active+clean 8active+clean+scrubbing 2active+clean+scrubbing+deep io: client: 131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr I googled around and looked at the docs and it seems like this isn't a critical problem, but I couldn't find a clear path to resolution. Does anyone have any advice on what I can do to resolve the health issues up top? My CephFS filesystem is incredibly busy so I have a feeling that has some impact here, but not 100% sure... Thanks as always for the help! cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Question about erasure coding on cephfs
Hi Y'all, We have a new ceph cluster online that looks like this: md-01 : monitor, manager, mds md-02 : monitor, manager, mds md-03 : monitor, manager store-01 : twenty 30TB NVMe OSDs store-02 : twenty 30TB NVMe OSDs The cephfs storage is using erasure coding at 4:2. The crush domain is set to "osd". (I know that's not optimal but let me get to that in a minute) We have a current regular single NFS server (nfs-01) with the same storage as the OSD servers above (twenty 30TB NVME disks). We want to wipe the NFS server and integrate it into the above ceph cluster as "store-03". When we do that, we would then have three OSD servers. We would then switch the crush domain to "host". My question is this: Given that we have 4:2 erasure coding, would the data rebalance evenly across the three OSD servers after we add store-03 such that if a single OSD server went down, the other two would be enough to keep the system online? Like, with 4:2 erasure coding, would 2 shards go on store-01, then 2 shards on store-02, and then 2 shards on store-03? Is that how I understand it? Thanks for any insight! -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io