[ceph-users] Adding new OSDs - also adding PGs?

2024-06-04 Thread Erich Weiler

Hi All,

I'm going to be adding a bunch of OSDs to our cephfs cluster shortly 
(increasing the total size by 50%).  We're on reef, and will be 
deploying using the cephadm method, and the OSDs are exactly the same 
size and disk type as the current ones.


So, after adding the new OSDs, my understanding is that ceph will begin 
rebalancing the data.  I will also probably want to increase my PGs to 
accommodate the new OSDs being added.  My question is basically: should 
I wait for the rebalance to finish before increasing my PG count?  Which 
would kick off another relabance action for the new PGs?  Or, should I 
increase the PG count as soon as the rebalance action starts after 
adding the new OSDs, and it would then create new PGs and rebalance on 
the new OSDs at the same time?


Thanks for any guidance!

-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERN] Re: cache pressure?

2024-05-07 Thread Erich Weiler
I still saw client cache pressure messages, although I think it did in 
general help a bit.  What I additionally just did (like 5 minutes ago), 
was reduce "mds_recall_max_caps" from 30,000 to 10,000 after looking at 
this post:


https://www.spinics.net/lists/ceph-users/msg73188.html

And will try further reducing mds_recall_max_caps if the pressure 
messages keep coming up.  After reducing it to 10,000 a few client cache 
pressure warnings cleared but I don't know yet if that was the reason it 
cleared or if it was just luck.  If I see it stay clear then I'll call 
it solved.


-erich

On 5/7/24 6:55 AM, Dietmar Rieder wrote:

On 4/26/24 23:51, Erich Weiler wrote:
As Dietmar said, VS Code may cause this. Quite funny to read, 
actually, because we've been dealing with this issue for over a year, 
and yesterday was the very first time Ceph complained about a client 
and we saw VS Code's remote stuff running. Coincidence.


I'm holding my breath that the vscode issue is the one affecting us - 
I got my users to tweak their vscode configs and the problem seemed to 
go away, but I guess I won't consider it 'solved' until a few days 
pass without it coming back...  :)


I wonder if the vscode configs solved your issues, or if you still see 
the cache pressure messages?


Dietmar

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 'ceph fs status' no longer works?

2024-05-02 Thread Erich Weiler

Excellent!  Restarting all the MDS daemons fixed it.  Thank you.

This kinda feels like a bug.

-erich

On 5/2/24 12:44 PM, Bandelow, Gunnar wrote:

Hi Erich,

im not sure about this specific error message, but "ceph fs status" 
sometimes did fail me end of last year/in the beginning of the year.


Restarting ALL mon, mgr AND mds fixed it at the time.

Best regards,
Gunnar


===

Gunnar Bandelow (dipl. phys.)

Universitätsrechenzentrum (URZ)
Universität Greifswald
Felix-Hausdorff-Straße 18
17489 Greifswald
Germany


--- Original Nachricht ---
*Betreff: *[ceph-users] Re: 'ceph fs status' no longer works?
*Von: *"Erich Weiler" mailto:wei...@soe.ucsc.edu>>
*An: *"Eugen Block" mailto:ebl...@nde.ag>>, 
ceph-users@ceph.io <mailto:ceph-users@ceph.io>

*Datum: *02-05-2024 21:05



Hi Eugen,

Thanks for the tip!  I just ran:

ceph orch daemon restart mgr.pr-md-01.jemmdf

(my specific mgr instance)

And it restarted my primary mgr daemon, and in the process failed over
to my standby mgr daemon on another server.  That went smoothly.

Unfortunately, I still cannot get 'ceph fs status' to work (on any
node)...

# ceph fs status
Error EINVAL: Traceback (most recent call last):
    File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in
_handle_command
  return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
    File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
  return self.func(mgr, **kwargs)
    File "/usr/share/ceph/mgr/status/module.py", line 109, in
handle_fs_status
  assert metadata
AssertionError

-erich

On 5/2/24 11:07 AM, Eugen Block wrote:
 > Yep, seen this a couple of times during upgrades. I’ll have to
check my
 > notes if I wrote anything down for that. But try a mgr failover
first,
 > that could help.
 >
 > Zitat von Erich Weiler mailto:wei...@soe.ucsc.edu>>:
 >
 >> Hi All,
 >>
 >> For a while now I've been using 'ceph fs status' to show current
MDS
 >> active servers, filesystem status, etc.  I recently took down my
MDS
 >> servers and added RAM to them (one by one, so the filesystem stayed
 >> online).  After doing that with my four MDS servers (I had two
active
 >> and two standby), all looks OK, 'ceph -s' reports HEALTH_OK. 
But when

 >> I do 'ceph fs status' now, I get this:
 >>
 >> # ceph fs status
 >> Error EINVAL: Traceback (most recent call last):
 >>   File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in
_handle_command
 >>     return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
 >>   File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
 >>     return self.func(mgr, **kwargs)
 >>   File "/usr/share/ceph/mgr/status/module.py", line 109, in
 >> handle_fs_status
 >>     assert metadata
 >> AssertionError
 >>
 >> This is on ceph 18.2.1 reef.  This is very odd - can anyone
think of a
 >> reason why 'ceph fs status' would stop working after taking each of
 >> the servers down for maintenance?
 >>
 >> The filesystem is online and working just fine however.  This ceph
 >> instance is deployed via the cephadm method on RHEL 9.3, so the
 >> everything is containerized in podman.
 >>
 >> Thanks again,
 >> erich
 >> ___
 >> ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
 >> To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>
 >
 >
 > ___
 > ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
 > To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 'ceph fs status' no longer works?

2024-05-02 Thread Erich Weiler

Hi Eugen,

Thanks for the tip!  I just ran:

ceph orch daemon restart mgr.pr-md-01.jemmdf

(my specific mgr instance)

And it restarted my primary mgr daemon, and in the process failed over 
to my standby mgr daemon on another server.  That went smoothly.


Unfortunately, I still cannot get 'ceph fs status' to work (on any node)...

# ceph fs status
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command
return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/status/module.py", line 109, in 
handle_fs_status

assert metadata
AssertionError

-erich

On 5/2/24 11:07 AM, Eugen Block wrote:
Yep, seen this a couple of times during upgrades. I’ll have to check my 
notes if I wrote anything down for that. But try a mgr failover first, 
that could help.


Zitat von Erich Weiler :


Hi All,

For a while now I've been using 'ceph fs status' to show current MDS 
active servers, filesystem status, etc.  I recently took down my MDS 
servers and added RAM to them (one by one, so the filesystem stayed 
online).  After doing that with my four MDS servers (I had two active 
and two standby), all looks OK, 'ceph -s' reports HEALTH_OK.  But when 
I do 'ceph fs status' now, I get this:


# ceph fs status
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/status/module.py", line 109, in 
handle_fs_status

    assert metadata
AssertionError

This is on ceph 18.2.1 reef.  This is very odd - can anyone think of a 
reason why 'ceph fs status' would stop working after taking each of 
the servers down for maintenance?


The filesystem is online and working just fine however.  This ceph 
instance is deployed via the cephadm method on RHEL 9.3, so the 
everything is containerized in podman.


Thanks again,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 'ceph fs status' no longer works?

2024-05-02 Thread Erich Weiler

Hi All,

For a while now I've been using 'ceph fs status' to show current MDS 
active servers, filesystem status, etc.  I recently took down my MDS 
servers and added RAM to them (one by one, so the filesystem stayed 
online).  After doing that with my four MDS servers (I had two active 
and two standby), all looks OK, 'ceph -s' reports HEALTH_OK.  But when I 
do 'ceph fs status' now, I get this:


# ceph fs status
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command
return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/status/module.py", line 109, in 
handle_fs_status

assert metadata
AssertionError

This is on ceph 18.2.1 reef.  This is very odd - can anyone think of a 
reason why 'ceph fs status' would stop working after taking each of the 
servers down for maintenance?


The filesystem is online and working just fine however.  This ceph 
instance is deployed via the cephadm method on RHEL 9.3, so the 
everything is containerized in podman.


Thanks again,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-29 Thread Erich Weiler

Hi Xiubo,

Is there any way to possibly get a PR development release we could 
upgrade to, in order to test and see if the lock order bug per Bug 
#62123 could be the answer?  Although I'm not sure that bug has been 
fixed yet?


-erich

On 4/21/24 9:39 PM, Xiubo Li wrote:

Hi Erich,

I raised one tracker for this https://tracker.ceph.com/issues/65607.

Currently I haven't figured out where was holding the 'dn->lock' in the 
'lookup' request or somewhere else, since there is not debug log.


Hopefully we can get the debug logs, which we can push it further.

Thanks

- Xiubo

On 4/19/24 23:55, Erich Weiler wrote:

Hi Xiubo,

Nevermind I was wrong, most the blocked ops were 12 hours old. Ug.

I restarted the MDS daemon to clear them.

I just reset to having one active MDS instead of two, let's see if 
that makes a difference.


I am beginning to think it may be impossible to catch the logs that 
matter here.  I feel like sometimes the blocked ops are just waiting 
because of load and sometimes they are waiting because they are stuck. 
But, it's really hard to tell which, without waiting a while.  But, I 
can't wait while having debug turned on because my root disks (which 
are 150 GB large) fill up with debug logs in 20 minutes.  So it almost 
seems that unless I could somehow store many TB of debug logs we won't 
be able to catch this.


Let's see how having one MDS helps.  Or maybe I actually need like 4 
MDSs because the load is too high for only one or two.  I don't know. 
Or maybe it's the lock issue you've been working on.  I guess I can 
test the lock order fix when it's available to test.


-erich

On 4/19/24 7:26 AM, Erich Weiler wrote:
So I woke up this morning and checked the blocked_ops again, there 
were 150 of them.  But the age of each ranged from 500 to 4300 
seconds.  So it seems as if they are eventually being processed.


I wonder if we are thinking about this in the wrong way?  Maybe I 
should be *adding* MDS daemons because my current ones are overloaded?


Can a single server hold multiple MDS daemons?  Right now I have 
three physical servers each with one MDS daemon on it.


I can still try reducing to one.  And I'll keep an eye on blocked ops 
to see if any get to a very old age (and are thus wedged).


-erich

On 4/18/24 8:55 PM, Xiubo Li wrote:

Okay, please try it to set only one active mds.


On 4/19/24 11:54, Erich Weiler wrote:

We have 2 active MDS daemons and one standby.

On 4/18/24 8:52 PM, Xiubo Li wrote:

BTW, how man active mds you are using ?


On 4/19/24 10:55, Erich Weiler wrote:
OK, I'm sure I caught it in the right order this time, the logs 
should definitely show when the blocked/slow requests start. 
Check out these logs and dumps:


http://hgwdev.gi.ucsc.edu/~weiler/

It's a 762 MB tarball but it uncompresses to 16 GB.

-erichll


On 4/18/24 6:57 PM, Xiubo Li wrote:

Okay, could you try this with 18.2.0 ?

I just double it was introduce by:

commit e610179a6a59c463eb3d85e87152ed3268c808ff
Author: Patrick Donnelly 
Date:   Mon Jul 17 16:10:59 2023 -0400

 mds: drop locks and retry when lock set changes

 An optimization was added to avoid an unnecessary gather on 
the inode
 filelock when the client can safely get the file size 
without also
 getting issued the requested caps. However, if a retry of 
getattr

 is necessary, this conditional inclusion of the inode filelock
 can cause lock-order violations resulting in deadlock.

 So, if we've already acquired some of the inode's locks 
then we must

 drop locks and retry.

 Fixes: https://tracker.ceph.com/issues/62052
 Fixes: c822b3e2573578c288d170d1031672b74e02dced
 Signed-off-by: Patrick Donnelly 
 (cherry picked from commit 
b5719ac32fe6431131842d62ffaf7101c03e9bac)



On 4/19/24 09:54, Erich Weiler wrote:
I'm on 18.2.1.  I think I may have gotten the timing off on the 
logs and dumps so I'll try again.  Just really hard to capture 
because I need to kind of be looking at it in real time to 
capture it. Hang on, lemme see if I can get another capture...


-erich

On 4/18/24 6:35 PM, Xiubo Li wrote:


BTW, which ceph version you are using ?



On 4/12/24 04:22, Erich Weiler wrote:
BTW - it just happened again, I upped the debugging settings 
as you instructed and got more dumps (then returned the debug 
settings to normal).


Attached are the new dumps.

Thanks again,
erich

On 4/9/24 9:00 PM, Xiubo Li wrote:


On 4/10/24 11:48, Erich Weiler wrote:
Dos that mean it could be the locker order bug 
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?


I have raised one PR to fix the lock order issue, if 
possible please have a try to see could it resolve this 
issue.


Thank you!  Yeah, this issue is happening every couple days 
now. It just happened again today and I got more MDS dumps. 
If it would help, let me know and I can send them!



[ceph-users] Re: [EXTERN] cache pressure?

2024-04-27 Thread Erich Weiler
Actually should I be excluding my whole cephfs filesystem?  Like, if I 
mount it as /cephfs, should my stanza looks something like:


{
   "files.watcherExclude": {
  "**/.git/objects/**": true,
  "**/.git/subtree-cache/**": true,
  "**/node_modules/*/**": true,
 "**/.cache/**": true,
 "**/.conda/**": true,
 "**/.local/**": true,
 "**/.nextflow/**": true,
 "**/work/**": true,
 "**/cephfs/**": true
   }
}

On 4/27/24 12:24 AM, Dietmar Rieder wrote:

Hi Erich,

hope it helps. Let us know.

Dietmar


Am 26. April 2024 15:52:06 MESZ schrieb Erich Weiler :

Hi Dietmar,

We do in fact have a bunch of users running vscode on our HPC head
node as well (in addition to a few of our general purpose
interactive compute servers). I'll suggest they make the mods you
referenced! Thanks for the tip.

cheers,
erich

On 4/24/24 12:58 PM, Dietmar Rieder wrote:

Hi Erich,

in our case the "client failing to respond to cache pressure"
situation is/was often caused by users how have vscode
connecting via ssh to our HPC head node. vscode makes heavy use
of file watchers and we have seen users with > 400k watchers.
All these watched files must be held in the MDS cache and if you
have multiple users at the same time running vscode it gets
problematic.

Unfortunately there is no global setting - at least none that we
are aware of - for vscode to exclude certain files or
directories from being watched. We asked the users to configure
their vscode (Remote Settings -> Watcher Exclude) as follows:

{
   "files.watcherExclude": {
  "**/.git/objects/**": true,
  "**/.git/subtree-cache/**": true,
  "**/node_modules/*/**": true,
     "**/.cache/**": true,
     "**/.conda/**": true,
     "**/.local/**": true,
     "**/.nextflow/**": true,
     "**/work/**": true
   }
}

~/.vscode-server/data/Machine/settings.json

To monitor and find processes with watcher you may use inotify-info
<https://github.com/mikesart/inotify-info
<https://github.com/mikesart/inotify-info>>

HTH
   Dietmar

On 4/23/24 15:47, Erich Weiler wrote:

So I'm trying to figure out ways to reduce the number of
warnings I'm getting and I'm thinking about the one "client
failing to respond to cache pressure".

Is there maybe a way to tell a client (or all clients) to
reduce the amount of cache it uses or to release caches
quickly?  Like, all the time?

I know the linux kernel (and maybe ceph) likes to cache
everything for a while, and rightfully so, but I suspect in
my use case it may be more efficient to more quickly purge
the cache or to in general just cache way less overall...?

We have many thousands of threads all doing different things
that are hitting our filesystem, so I suspect the caching
isn't really doing me much good anyway due to the churn, and
probably is causing more problems than it helping...

-erich


ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cache pressure?

2024-04-26 Thread Erich Weiler
As Dietmar said, VS Code may cause this. Quite funny to read, actually, 
because we've been dealing with this issue for over a year, and 
yesterday was the very first time Ceph complained about a client and we 
saw VS Code's remote stuff running. Coincidence.


I'm holding my breath that the vscode issue is the one affecting us - I 
got my users to tweak their vscode configs and the problem seemed to go 
away, but I guess I won't consider it 'solved' until a few days pass 
without it coming back...  :)

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERN] cache pressure?

2024-04-26 Thread Erich Weiler

Hi Dietmar,

We do in fact have a bunch of users running vscode on our HPC head node 
as well (in addition to a few of our general purpose interactive compute 
servers).  I'll suggest they make the mods you referenced!  Thanks for 
the tip.


cheers,
erich

On 4/24/24 12:58 PM, Dietmar Rieder wrote:

Hi Erich,

in our case the "client failing to respond to cache pressure" situation 
is/was often caused by users how have vscode connecting via ssh to our 
HPC head node. vscode makes heavy use of file watchers and we have seen 
users with > 400k watchers. All these watched files must be held in the 
MDS cache and if you have multiple users at the same time running vscode 
it gets problematic.


Unfortunately there is no global setting - at least none that we are 
aware of - for vscode to exclude certain files or directories from being 
watched. We asked the users to configure their vscode (Remote Settings 
-> Watcher Exclude) as follows:


{
   "files.watcherExclude": {
  "**/.git/objects/**": true,
  "**/.git/subtree-cache/**": true,
  "**/node_modules/*/**": true,
     "**/.cache/**": true,
     "**/.conda/**": true,
     "**/.local/**": true,
     "**/.nextflow/**": true,
     "**/work/**": true
   }
}

~/.vscode-server/data/Machine/settings.json

To monitor and find processes with watcher you may use inotify-info
<https://github.com/mikesart/inotify-info>

HTH
   Dietmar

On 4/23/24 15:47, Erich Weiler wrote:
So I'm trying to figure out ways to reduce the number of warnings I'm 
getting and I'm thinking about the one "client failing to respond to 
cache pressure".


Is there maybe a way to tell a client (or all clients) to reduce the 
amount of cache it uses or to release caches quickly?  Like, all the 
time?


I know the linux kernel (and maybe ceph) likes to cache everything for 
a while, and rightfully so, but I suspect in my use case it may be 
more efficient to more quickly purge the cache or to in general just 
cache way less overall...?


We have many thousands of threads all doing different things that are 
hitting our filesystem, so I suspect the caching isn't really doing me 
much good anyway due to the churn, and probably is causing more 
problems than it helping...


-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cache pressure?

2024-04-23 Thread Erich Weiler
So I'm trying to figure out ways to reduce the number of warnings I'm 
getting and I'm thinking about the one "client failing to respond to 
cache pressure".


Is there maybe a way to tell a client (or all clients) to reduce the 
amount of cache it uses or to release caches quickly?  Like, all the time?


I know the linux kernel (and maybe ceph) likes to cache everything for a 
while, and rightfully so, but I suspect in my use case it may be more 
efficient to more quickly purge the cache or to in general just cache 
way less overall...?


We have many thousands of threads all doing different things that are 
hitting our filesystem, so I suspect the caching isn't really doing me 
much good anyway due to the churn, and probably is causing more problems 
than it helping...


-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Erich Weiler
I was able to start another MDS daemon on another node that had 512GB 
RAM, and then the active MDS eventually migrated there, and went through 
the replay (which consumed about 100 GB of RAM), and then things 
recovered.  Phew.  I guess I need significantly more RAM in my MDS 
servers...  I had no idea the MDS daemon could require that much RAM.


-erich

On 4/22/24 11:41 AM, Erich Weiler wrote:

possibly but it would be pretty time consuming and difficult...

Is it maybe a RAM issue since my MDS RAM is filling up?  Should maybe I 
bring up another MDS on another server with huge amount of RAM and move 
the MDS there in hopes it will have enough RAM to complete the replay?


On 4/22/24 11:37 AM, Sake Ceph wrote:
Just a question: is it possible to block or disable all clients? Just 
to prevent load on the system.


Kind regards,
Sake

Op 22-04-2024 20:33 CEST schreef Erich Weiler :

I also see this from 'ceph health detail':

# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1
MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
  fs slugfs is degraded
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
  mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large
(19GB/8GB); 0 inodes in use by clients, 0 stray files
[WRN] MDS_TRIM: 1 MDSs behind on trimming
  mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250)
max_segments: 250, num_segments: 127084

MDS cache too large?  The mds process is taking up 22GB right now and
starting to swap my server, so maybe it somehow is too large

On 4/22/24 11:17 AM, Erich Weiler wrote:

Hi All,

We have a somewhat serious situation where we have a cephfs filesystem
(18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of
the active daemons to unstick a bunch of blocked requests, and the
standby went into 'replay' for a very long time, then RAM on that MDS
server filled up, and it just stayed there for a while then eventually
appeared to give up and switched to the standby, but the cycle started
again.  So I restarted that MDS, and now I'm in a situation where I see
this:

# ceph fs status
slugfs - 29 clients
==
RANK   STATE    MDS    ACTIVITY   DNS    INOS   
DIRS   CAPS
   0 replay  slugfs.pr-md-01.xdtppo    3958k  57.1k  
12.2k 0
   1    resolve  slugfs.pr-md-02.sbblqq   0  3  
1  0

     POOL   TYPE USED  AVAIL
   cephfs_metadata    metadata   997G  2948G
cephfs_md_and_data    data   0   87.6T
     cephfs_data    data 773T   175T
   STANDBY MDS
slugfs.pr-md-03.mclckv
MDS version: ceph version 18.2.1
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)

It just stays there indefinitely.  All my clients are hung.  I tried
restarting all MDS daemons and they just went back to this state after
coming back up.

Is there any way I can somehow escape this state of indefinite
replay/resolve?

Thanks so much!  I'm kinda nervous since none of my clients have
filesystem access at the moment...

cheers,
erich

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Erich Weiler

possibly but it would be pretty time consuming and difficult...

Is it maybe a RAM issue since my MDS RAM is filling up?  Should maybe I 
bring up another MDS on another server with huge amount of RAM and move 
the MDS there in hopes it will have enough RAM to complete the replay?


On 4/22/24 11:37 AM, Sake Ceph wrote:

Just a question: is it possible to block or disable all clients? Just to 
prevent load on the system.

Kind regards,
Sake

Op 22-04-2024 20:33 CEST schreef Erich Weiler :

  
I also see this from 'ceph health detail':


# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1
MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
  fs slugfs is degraded
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
  mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large
(19GB/8GB); 0 inodes in use by clients, 0 stray files
[WRN] MDS_TRIM: 1 MDSs behind on trimming
  mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250)
max_segments: 250, num_segments: 127084

MDS cache too large?  The mds process is taking up 22GB right now and
starting to swap my server, so maybe it somehow is too large

On 4/22/24 11:17 AM, Erich Weiler wrote:

Hi All,

We have a somewhat serious situation where we have a cephfs filesystem
(18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of
the active daemons to unstick a bunch of blocked requests, and the
standby went into 'replay' for a very long time, then RAM on that MDS
server filled up, and it just stayed there for a while then eventually
appeared to give up and switched to the standby, but the cycle started
again.  So I restarted that MDS, and now I'm in a situation where I see
this:

# ceph fs status
slugfs - 29 clients
==
RANK   STATE    MDS    ACTIVITY   DNS    INOS   DIRS   CAPS
   0 replay  slugfs.pr-md-01.xdtppo    3958k  57.1k  12.2k 0
   1    resolve  slugfs.pr-md-02.sbblqq   0  3  1  0
     POOL   TYPE USED  AVAIL
   cephfs_metadata    metadata   997G  2948G
cephfs_md_and_data    data   0   87.6T
     cephfs_data    data 773T   175T
   STANDBY MDS
slugfs.pr-md-03.mclckv
MDS version: ceph version 18.2.1
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)

It just stays there indefinitely.  All my clients are hung.  I tried
restarting all MDS daemons and they just went back to this state after
coming back up.

Is there any way I can somehow escape this state of indefinite
replay/resolve?

Thanks so much!  I'm kinda nervous since none of my clients have
filesystem access at the moment...

cheers,
erich

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Erich Weiler

I also see this from 'ceph health detail':

# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 
MDSs behind on trimming

[WRN] FS_DEGRADED: 1 filesystem is degraded
fs slugfs is degraded
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large 
(19GB/8GB); 0 inodes in use by clients, 0 stray files

[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250) 
max_segments: 250, num_segments: 127084


MDS cache too large?  The mds process is taking up 22GB right now and 
starting to swap my server, so maybe it somehow is too large


On 4/22/24 11:17 AM, Erich Weiler wrote:

Hi All,

We have a somewhat serious situation where we have a cephfs filesystem 
(18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of 
the active daemons to unstick a bunch of blocked requests, and the 
standby went into 'replay' for a very long time, then RAM on that MDS 
server filled up, and it just stayed there for a while then eventually 
appeared to give up and switched to the standby, but the cycle started 
again.  So I restarted that MDS, and now I'm in a situation where I see 
this:


# ceph fs status
slugfs - 29 clients
==
RANK   STATE    MDS    ACTIVITY   DNS    INOS   DIRS   CAPS
  0 replay  slugfs.pr-md-01.xdtppo    3958k  57.1k  12.2k 0
  1    resolve  slugfs.pr-md-02.sbblqq   0  3  1  0
    POOL   TYPE USED  AVAIL
  cephfs_metadata    metadata   997G  2948G
cephfs_md_and_data    data   0   87.6T
    cephfs_data    data 773T   175T
  STANDBY MDS
slugfs.pr-md-03.mclckv
MDS version: ceph version 18.2.1 
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)


It just stays there indefinitely.  All my clients are hung.  I tried 
restarting all MDS daemons and they just went back to this state after 
coming back up.


Is there any way I can somehow escape this state of indefinite 
replay/resolve?


Thanks so much!  I'm kinda nervous since none of my clients have 
filesystem access at the moment...


cheers,
erich

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Stuck in replay?

2024-04-22 Thread Erich Weiler

Hi All,

We have a somewhat serious situation where we have a cephfs filesystem 
(18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of 
the active daemons to unstick a bunch of blocked requests, and the 
standby went into 'replay' for a very long time, then RAM on that MDS 
server filled up, and it just stayed there for a while then eventually 
appeared to give up and switched to the standby, but the cycle started 
again.  So I restarted that MDS, and now I'm in a situation where I see 
this:


# ceph fs status
slugfs - 29 clients
==
RANK   STATEMDSACTIVITY   DNSINOS   DIRS   CAPS
 0 replay  slugfs.pr-md-01.xdtppo3958k  57.1k  12.2k 0
 1resolve  slugfs.pr-md-02.sbblqq   0  3  1  0
   POOL   TYPE USED  AVAIL
 cephfs_metadatametadata   997G  2948G
cephfs_md_and_datadata   0   87.6T
   cephfs_datadata 773T   175T
 STANDBY MDS
slugfs.pr-md-03.mclckv
MDS version: ceph version 18.2.1 
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)


It just stays there indefinitely.  All my clients are hung.  I tried 
restarting all MDS daemons and they just went back to this state after 
coming back up.


Is there any way I can somehow escape this state of indefinite 
replay/resolve?


Thanks so much!  I'm kinda nervous since none of my clients have 
filesystem access at the moment...


cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-19 Thread Erich Weiler

Hi Xiubo,

Nevermind I was wrong, most the blocked ops were 12 hours old.  Ug.

I restarted the MDS daemon to clear them.

I just reset to having one active MDS instead of two, let's see if that 
makes a difference.


I am beginning to think it may be impossible to catch the logs that 
matter here.  I feel like sometimes the blocked ops are just waiting 
because of load and sometimes they are waiting because they are stuck. 
But, it's really hard to tell which, without waiting a while.  But, I 
can't wait while having debug turned on because my root disks (which are 
150 GB large) fill up with debug logs in 20 minutes.  So it almost seems 
that unless I could somehow store many TB of debug logs we won't be able 
to catch this.


Let's see how having one MDS helps.  Or maybe I actually need like 4 
MDSs because the load is too high for only one or two.  I don't know. 
Or maybe it's the lock issue you've been working on.  I guess I can test 
the lock order fix when it's available to test.


-erich

On 4/19/24 7:26 AM, Erich Weiler wrote:
So I woke up this morning and checked the blocked_ops again, there were 
150 of them.  But the age of each ranged from 500 to 4300 seconds.  So 
it seems as if they are eventually being processed.


I wonder if we are thinking about this in the wrong way?  Maybe I should 
be *adding* MDS daemons because my current ones are overloaded?


Can a single server hold multiple MDS daemons?  Right now I have three 
physical servers each with one MDS daemon on it.


I can still try reducing to one.  And I'll keep an eye on blocked ops to 
see if any get to a very old age (and are thus wedged).


-erich

On 4/18/24 8:55 PM, Xiubo Li wrote:

Okay, please try it to set only one active mds.


On 4/19/24 11:54, Erich Weiler wrote:

We have 2 active MDS daemons and one standby.

On 4/18/24 8:52 PM, Xiubo Li wrote:

BTW, how man active mds you are using ?


On 4/19/24 10:55, Erich Weiler wrote:
OK, I'm sure I caught it in the right order this time, the logs 
should definitely show when the blocked/slow requests start.  Check 
out these logs and dumps:


http://hgwdev.gi.ucsc.edu/~weiler/

It's a 762 MB tarball but it uncompresses to 16 GB.

-erichll


On 4/18/24 6:57 PM, Xiubo Li wrote:

Okay, could you try this with 18.2.0 ?

I just double it was introduce by:

commit e610179a6a59c463eb3d85e87152ed3268c808ff
Author: Patrick Donnelly 
Date:   Mon Jul 17 16:10:59 2023 -0400

 mds: drop locks and retry when lock set changes

 An optimization was added to avoid an unnecessary gather on 
the inode
 filelock when the client can safely get the file size without 
also
 getting issued the requested caps. However, if a retry of 
getattr

 is necessary, this conditional inclusion of the inode filelock
 can cause lock-order violations resulting in deadlock.

 So, if we've already acquired some of the inode's locks then 
we must

 drop locks and retry.

 Fixes: https://tracker.ceph.com/issues/62052
 Fixes: c822b3e2573578c288d170d1031672b74e02dced
 Signed-off-by: Patrick Donnelly 
 (cherry picked from commit 
b5719ac32fe6431131842d62ffaf7101c03e9bac)



On 4/19/24 09:54, Erich Weiler wrote:
I'm on 18.2.1.  I think I may have gotten the timing off on the 
logs and dumps so I'll try again.  Just really hard to capture 
because I need to kind of be looking at it in real time to 
capture it. Hang on, lemme see if I can get another capture...


-erich

On 4/18/24 6:35 PM, Xiubo Li wrote:


BTW, which ceph version you are using ?



On 4/12/24 04:22, Erich Weiler wrote:
BTW - it just happened again, I upped the debugging settings as 
you instructed and got more dumps (then returned the debug 
settings to normal).


Attached are the new dumps.

Thanks again,
erich

On 4/9/24 9:00 PM, Xiubo Li wrote:


On 4/10/24 11:48, Erich Weiler wrote:
Dos that mean it could be the locker order bug 
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?


I have raised one PR to fix the lock order issue, if 
possible please have a try to see could it resolve this issue.


Thank you!  Yeah, this issue is happening every couple days 
now. It just happened again today and I got more MDS dumps. 
If it would help, let me know and I can send them!


Once this happen if you could enable the mds debug logs will 
be better:


debug mds = 20

debug ms = 1

And then provide the debug logs together with the MDS dumps.


I assume if this fix is approved and backported it will then 
appear in like 18.2.3 or something?



Yeah, it will be backported after being well tested.

- Xiubo


Thanks again,
erich


















___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about PR merge

2024-04-17 Thread Erich Weiler

Have you already shared information about this issue? Please do if not.


I am working with Xiubo Li and providing debugging information - in 
progress!



I was
wondering if it would be included in 18.2.3 which I *think* should be
released soon?  Is there any way of knowing if that is true?


This PR is primarily a debugging tool. It will not make 18.2.3 as it's
not even merged to main yet.


Ah, OK.  I hope some solution can be had soon for this item if Xiubo 
figures it out - it's requiring constant attention to keep my filesystem 
from hanging, or, the restart MDS daemons multiple times a day to 
"unstick" the filesystem on random cluster nodes.  We think it's due to 
lock contention/deadlocking.


Possibly it's not affecting others as much as me...  We have an HPC 
cluster hammering the filesystem (18.2.1) and the MDS daemons seems to 
be reporting lock issues pretty frequently while nodes and processes 
fighting to get file and directory locks, and deadlocking (we think).


I'll keep working with Xiubo.

-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Question about PR merge

2024-04-17 Thread Erich Weiler

Hello,

We are tracking PR #56805:

https://github.com/ceph/ceph/pull/56805

And the resolution of this item would potentially fix a pervasive and 
ongoing issue that needs daily attention in our cephfs cluster.  I was 
wondering if it would be included in 18.2.3 which I *think* should be 
released soon?  Is there any way of knowing if that is true?


Thanks again,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to make config changes stick for MDS?

2024-04-16 Thread Erich Weiler

Hi All,

I'm having a crazy time getting config items to stick on my MDS daemons. 
 I'm running Reef 18.2.1 on RHEL 9 and the daemons are running in 
podman, I used cephadm to deploy the daemons.


I can adjust the config items in runtime, like so:

ceph tell mds.slugfs.pr-md-01.xdtppo config set mds_bal_interval -1

But for the life of me I cannot get that to stick when I restart the MDS 
daemon.


I've tried adding this to /etc/ceph/ceph.conf in the host server:

[mds]
mds_bal_interval = -1

But that doesn't get picked up on daemon restart.  I also added the same 
config segment to /etc/ceph/ceph.conf *inside* the container, no dice, 
still doesn't stick.  I even tried adding it to 
/var/lib/ceph//config/ceph.conf and it *still* doesn't stick 
across daemon restarts.


Does anyone know how I can get MDS config items to stick across daemon 
reboots when the daemon is running in podman under RHEL?


Thanks much!

-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-11 Thread Erich Weiler
Or...  Maybe the fix will first appear in the "centos-ceph-reef-test" 
repo that I see?  Is that how RedHat usually does it?


On 4/11/24 10:30, Erich Weiler wrote:
I guess we are specifically using the "centos-ceph-reef" repository, and 
it looks like the latest version in that repo is 18.2.2-1.el9s.  Will 
this fix appear in 18.2.2-2.el9s or something like that?  I don't know 
how often the release cycle updates the repos...?


On 4/11/24 09:40, Erich Weiler wrote:
I have raised one PR to fix the lock order issue, if possible 
please have a try to see could it resolve this issue.


That's great!  When do you think that will be available?

Thank you!  Yeah, this issue is happening every couple days now. It 
just happened again today and I got more MDS dumps.  If it would 
help, let me know and I can send them!



Once this happen if you could enable the mds debug logs will be better:

debug mds = 20

debug ms = 1

And then provide the debug logs together with the MDS dumps.


OK next time I see it I'll do that.

-erich

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-11 Thread Erich Weiler
I guess we are specifically using the "centos-ceph-reef" repository, and 
it looks like the latest version in that repo is 18.2.2-1.el9s.  Will 
this fix appear in 18.2.2-2.el9s or something like that?  I don't know 
how often the release cycle updates the repos...?


On 4/11/24 09:40, Erich Weiler wrote:
I have raised one PR to fix the lock order issue, if possible please 
have a try to see could it resolve this issue.


That's great!  When do you think that will be available?

Thank you!  Yeah, this issue is happening every couple days now. It 
just happened again today and I got more MDS dumps.  If it would 
help, let me know and I can send them!



Once this happen if you could enable the mds debug logs will be better:

debug mds = 20

debug ms = 1

And then provide the debug logs together with the MDS dumps.


OK next time I see it I'll do that.

-erich

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-11 Thread Erich Weiler
I have raised one PR to fix the lock order issue, if possible please 
have a try to see could it resolve this issue.


That's great!  When do you think that will be available?

Thank you!  Yeah, this issue is happening every couple days now. It 
just happened again today and I got more MDS dumps.  If it would help, 
let me know and I can send them!



Once this happen if you could enable the mds debug logs will be better:

debug mds = 20

debug ms = 1

And then provide the debug logs together with the MDS dumps.


OK next time I see it I'll do that.

-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-09 Thread Erich Weiler
Dos that mean it could be the locker order bug 
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?


I have raised one PR to fix the lock order issue, if possible please 
have a try to see could it resolve this issue.


Thank you!  Yeah, this issue is happening every couple days now.  It 
just happened again today and I got more MDS dumps.  If it would help, 
let me know and I can send them!


I assume if this fix is approved and backported it will then appear in 
like 18.2.3 or something?


Thanks again,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-07 Thread Erich Weiler
Ah, I see.  Yes, we are already running version 18.2.1 on the server side (we 
just installed this cluster a few weeks ago from scratch).  So I guess if the 
fix has already been backported to that version, then we still have a problem.

Dos that mean it could be the locker order bug 
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?

Thanks again,
Erich

> On Apr 7, 2024, at 9:00 PM, Alexander E. Patrakov  wrote:
> 
> Hi Erich,
> 
>> On Mon, Apr 8, 2024 at 11:51 AM Erich Weiler  wrote:
>> 
>> Hi Xiubo,
>> 
>>> Thanks for your logs, and it should be the same issue with
>>> https://tracker.ceph.com/issues/62052, could you try to test with this
>>> fix again ?
>> 
>> This sounds good - but I'm not clear on what I should do?  I see a patch
>> in that tracker page, is that what you are referring to?  If so, how
>> would I apply such a patch?  Or is there simply a binary update I can
>> apply somehow to the MDS server software?
> 
> The backport of this patch (https://github.com/ceph/ceph/pull/53241)
> was merged on October 18, 2023, and Ceph 18.2.1 was released on
> December 18, 2023. Therefore, if you are running Ceph 18.2.1 on the
> server side, you already have the fix. If you are already running
> version 18.2.1 or 18.2.2 (to which you should upgrade anyway), please
> complain, as the purported fix is then ineffective.
> 
>> 
>> Thanks for helping!
>> 
>> -erich
>> 
>>> Please let me know if you still could see this bug then it should be the
>>> locker order bug as https://tracker.ceph.com/issues/62123.
>>> 
>>> Thanks
>>> 
>>> - Xiubo
>>> 
>>> 
>>> On 3/28/24 04:03, Erich Weiler wrote:
>>>> Hi All,
>>>> 
>>>> I've been battling this for a while and I'm not sure where to go from
>>>> here.  I have a Ceph health warning as such:
>>>> 
>>>> # ceph -s
>>>>  cluster:
>>>>id: 58bde08a-d7ed-11ee-9098-506b4b4da440
>>>>health: HEALTH_WARN
>>>>1 MDSs report slow requests
>>>>1 MDSs behind on trimming
>>>> 
>>>>  services:
>>>>mon: 5 daemons, quorum
>>>> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
>>>>mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
>>>>mds: 1/1 daemons up, 2 standby
>>>>osd: 46 osds: 46 up (since 9h), 46 in (since 2w)
>>>> 
>>>>  data:
>>>>volumes: 1/1 healthy
>>>>pools:   4 pools, 1313 pgs
>>>>objects: 260.72M objects, 466 TiB
>>>>usage:   704 TiB used, 424 TiB / 1.1 PiB avail
>>>>pgs: 1306 active+clean
>>>> 4active+clean+scrubbing+deep
>>>> 3active+clean+scrubbing
>>>> 
>>>>  io:
>>>>client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr
>>>> 
>>>> And the specifics are:
>>>> 
>>>> # ceph health detail
>>>> HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
>>>> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
>>>>mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked >
>>>> 30 secs
>>>> [WRN] MDS_TRIM: 1 MDSs behind on trimming
>>>>mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250)
>>>> max_segments: 250, num_segments: 13884
>>>> 
>>>> That "num_segments" number slowly keeps increasing.  I suspect I just
>>>> need to tell the MDS servers to trim faster but after hours of
>>>> googling around I just can't figure out the best way to do it. The
>>>> best I could come up with was to decrease "mds_cache_trim_decay_rate"
>>>> from 1.0 to .8 (to start), based on this page:
>>>> 
>>>> https://www.suse.com/support/kb/doc/?id=19740
>>>> 
>>>> But it doesn't seem to help, maybe I should decrease it further? I am
>>>> guessing this must be a common issue...?  I am running Reef on the MDS
>>>> servers, but most clients are on Quincy.
>>>> 
>>>> Thanks for any advice!
>>>> 
>>>> cheers,
>>>> erich
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> --
> Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-07 Thread Erich Weiler

Hi Xiubo,

Thanks for your logs, and it should be the same issue with 
https://tracker.ceph.com/issues/62052, could you try to test with this 
fix again ?


This sounds good - but I'm not clear on what I should do?  I see a patch 
in that tracker page, is that what you are referring to?  If so, how 
would I apply such a patch?  Or is there simply a binary update I can 
apply somehow to the MDS server software?


Thanks for helping!

-erich

Please let me know if you still could see this bug then it should be the 
locker order bug as https://tracker.ceph.com/issues/62123.


Thanks

- Xiubo


On 3/28/24 04:03, Erich Weiler wrote:

Hi All,

I've been battling this for a while and I'm not sure where to go from 
here.  I have a Ceph health warning as such:


# ceph -s
  cluster:
    id: 58bde08a-d7ed-11ee-9098-506b4b4da440
    health: HEALTH_WARN
    1 MDSs report slow requests
    1 MDSs behind on trimming

  services:
    mon: 5 daemons, quorum 
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)

    mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
    mds: 1/1 daemons up, 2 standby
    osd: 46 osds: 46 up (since 9h), 46 in (since 2w)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 1313 pgs
    objects: 260.72M objects, 466 TiB
    usage:   704 TiB used, 424 TiB / 1.1 PiB avail
    pgs: 1306 active+clean
 4    active+clean+scrubbing+deep
 3    active+clean+scrubbing

  io:
    client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr

And the specifics are:

# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > 
30 secs

[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) 
max_segments: 250, num_segments: 13884


That "num_segments" number slowly keeps increasing.  I suspect I just 
need to tell the MDS servers to trim faster but after hours of 
googling around I just can't figure out the best way to do it. The 
best I could come up with was to decrease "mds_cache_trim_decay_rate" 
from 1.0 to .8 (to start), based on this page:


https://www.suse.com/support/kb/doc/?id=19740

But it doesn't seem to help, maybe I should decrease it further? I am 
guessing this must be a common issue...?  I am running Reef on the MDS 
servers, but most clients are on Quincy.


Thanks for any advice!

cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multiple MDS Daemon needed?

2024-04-07 Thread Erich Weiler

Hi All,

We have a slurm cluster with 25 clients, each with 256 cores, each 
mounting a cephfs filesystem as their main storage target.  The workload 
can be heavy at times.


We have two active MDS daemons and one standby.  A lot of the time 
everything is healthy but we sometimes get warnings about MDS daemons 
being slow on requests, behind on trimming, etc.  I realize their may be 
a bug in play, but also, I was wondering if we simply didn't have enough 
MDS daemons to handle the load.  Is there a way to know if adding a MDS 
daemon would help?  We could add a third active MDS if needed.  But I 
don't want to start adding a bunch of MDS's if that won't help.


The OSD servers seem fine.  It's mainly the MDS instances that are 
complaining.


We are running reef 18.2.1.

For reference, when things look healthy:

# ceph fs status slugfs
slugfs - 34 clients
==
RANK  STATEMDS   ACTIVITY DNSINOS   DIRS 
  CAPS
 0active  slugfs.pr-md-03.mclckv  Reqs:  273 /s  2759k  2636k 
362k  1079k
 1active  slugfs.pr-md-01.xdtppo  Reqs:  194 /s   868k   674k 
67.3k   351k

   POOL   TYPE USED  AVAIL
 cephfs_metadatametadata   127G  3281G
cephfs_md_and_datadata   0   98.3T
   cephfs_datadata 740T   196T
 STANDBY MDS
slugfs.pr-md-02.sbblqq
MDS version: ceph version 18.2.1 
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)


# ceph -s
  cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_OK

  services:
mon: 5 daemons, quorum 
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)

mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz
mds: 2/2 daemons up, 1 standby
osd: 46 osds: 46 up (since 8d), 46 in (since 4w)

  data:
volumes: 1/1 healthy
pools:   4 pools, 1313 pgs
objects: 271.17M objects, 493 TiB
usage:   744 TiB used, 384 TiB / 1.1 PiB avail
pgs: 1307 active+clean
 4active+clean+scrubbing
 2active+clean+scrubbing+deep

  io:
client:   39 MiB/s rd, 108 MiB/s wr, 1.96k op/s rd, 54 op/s wr




But when things are in "warning" mode, it looks like this:

# ceph -s
  cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 filesystem is degraded
1 clients failing to advance oldest client/flush tid
1 MDSs report slow requests
1 MDSs behind on trimming

  services:
mon: 5 daemons, quorum 
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)

mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz
mds: 2/2 daemons up, 1 standby
osd: 46 osds: 46 up (since 8d), 46 in (since 4w)

  data:
volumes: 1/1 healthy
pools:   4 pools, 1313 pgs
objects: 271.28M objects, 494 TiB
usage:   746 TiB used, 382 TiB / 1.1 PiB avail
pgs: 1307 active+clean
 5active+clean+scrubbing
 1active+clean+scrubbing+deep

  io:
client:   55 MiB/s rd, 2.6 MiB/s wr, 15 op/s rd, 46 op/s wr

And this:

# ceph health detail
HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 2 MDSs 
report slow requests; 1 MDSs behind on trimming
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest 
client/flush tid
mds.slugfs.pr-md-01.xdtppo(mds.0): Client phoenix-06.prism failing 
to advance its oldest client/flush tid.  client_id: 125780
mds.slugfs.pr-md-02.sbblqq(mds.1): Client phoenix-00.prism failing 
to advance its oldest client/flush tid.  client_id: 99385

[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
mds.slugfs.pr-md-01.xdtppo(mds.0): 4 slow requests are blocked > 30 
secs
mds.slugfs.pr-md-02.sbblqq(mds.1): 67 slow requests are blocked > 
30 secs

[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-02.sbblqq(mds.1): Behind on trimming (109410/250) 
max_segments: 250, num_segments: 109410


The "cure" is the restart the active MDS daemons, one at a time.  Then 
everything becomes healthy again, for a time.


We also have the following MDS config items in play:

mds_cache_memory_limit = 8589934592
mds_cache_trim_decay_rate = .6
mds_log_max_segments = 250

Thanks for any pointers!

cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
Could there be an issue with the fact that the servers (MDS, MGR, MON, 
OSD) are running reef and all the clients are running quincy?


I can easily enough get the new reef repo in for all our clients (Ubuntu 
22.04) and upgrade the clients to reef if that might help..?


On 3/28/24 3:05 PM, Erich Weiler wrote:
I asked the user and they said no, no rsync involved.  Although I 
rsync'd 500TB into this filesystem in the beginning without incident, so 
hopefully it's not a big deal here.


I'm asking the user what their workflow does to try and pin this down.

Are there any other known reason why a slow request would start on a 
certain inode, then block a bunch of cache segments behind it, until the 
MDS is restarted?


Once I restart the MDS daemon that is slow, it shows the cache segments 
transfer to the other MDS server and very quickly drop to zero, then 
everything is healthy again, the stuck directory in question responds 
again and all is well.  Then a few hours later it started happening 
again (not always the same directory).


I hope I'm not experiencing a bug, but I can't see what would be causing 
this...


On 3/28/24 2:37 PM, Alexander E. Patrakov wrote:

Hello Erich,

Does the workload, by any chance, involve rsync? It is unfortunately
well-known for triggering such issues. A workaround is to export the
directory via NFS and run rsync against the NFS mount instead of
directly against CephFS.

On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler  wrote:


MDS logs show:

Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
3676.400077 secs
Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]:
mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 320 slow requests, 5 included below; oldest blocked for >
3681.400104 secs
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3668.805732 seconds old, received at
2024-03-28T19:41:25.772531+: client_request(client.99375:574268
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3667.883853 seconds old, received at
2024-03-28T19:41:26.694410+: client_request(client.99390:374844
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3663.724571 seconds old, received at
2024-03-28T19:41:30.853692+: client_request(client.99390:375258
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3681.399582 seconds old, received at
2024-03-28T19:41:13.178681+: client_request(client.99385:11712080
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+
caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to
rdlock, waiting
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3680.508972 seconds old, received at
2024-03-28T19:41:14.069291+: client_request(client.99385:11712556
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:14.070764+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr

The client IDs map to several of our cluster nodes but the inode
reference always refers to the same directory in these recent logs:

/private/groups/shapirolab/brock/r2/cactus_coord

That directory does not respond to an 'ls', but other directories
directly above it do just fine.  Maybe it's a bad cache item on the MDS?

# ceph health detail
HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 1 MDSs
report slow requests; 1 MDSs behind on trimming
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest
client/flush tid
  mds.slugfs.pr-md-02.sbblqq(mds.0): Client mustard failing to
advance its oldest client/flush tid.  client_id: 101305
  mds.slugfs.pr-md-01.xdtppo(mds.1): Client  failing to advance its
oldest client/flush tid.  client_id: 101305
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
  mds.slugfs.pr-md-02.sbblqq(mds.0): 201 slow requests are blocked >
30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
  mds.slugfs.pr-md-02.sbblqq(mds.0): Behind on trimming (4786/250)
max_segments: 250, num_segments: 4786

I think that this is somehow causing the "slow requests", on the nodes
listed in the logs, as that directory in inaccessible.  And maybe the
'behind on trimming' part is also related, as it can&#

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
I asked the user and they said no, no rsync involved.  Although I 
rsync'd 500TB into this filesystem in the beginning without incident, so 
hopefully it's not a big deal here.


I'm asking the user what their workflow does to try and pin this down.

Are there any other known reason why a slow request would start on a 
certain inode, then block a bunch of cache segments behind it, until the 
MDS is restarted?


Once I restart the MDS daemon that is slow, it shows the cache segments 
transfer to the other MDS server and very quickly drop to zero, then 
everything is healthy again, the stuck directory in question responds 
again and all is well.  Then a few hours later it started happening 
again (not always the same directory).


I hope I'm not experiencing a bug, but I can't see what would be causing 
this...


On 3/28/24 2:37 PM, Alexander E. Patrakov wrote:

Hello Erich,

Does the workload, by any chance, involve rsync? It is unfortunately
well-known for triggering such issues. A workaround is to export the
directory via NFS and run rsync against the NFS mount instead of
directly against CephFS.

On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler  wrote:


MDS logs show:

Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
3676.400077 secs
Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]:
mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 320 slow requests, 5 included below; oldest blocked for >
3681.400104 secs
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3668.805732 seconds old, received at
2024-03-28T19:41:25.772531+: client_request(client.99375:574268
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3667.883853 seconds old, received at
2024-03-28T19:41:26.694410+: client_request(client.99390:374844
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3663.724571 seconds old, received at
2024-03-28T19:41:30.853692+: client_request(client.99390:375258
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3681.399582 seconds old, received at
2024-03-28T19:41:13.178681+: client_request(client.99385:11712080
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+
caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to
rdlock, waiting
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3680.508972 seconds old, received at
2024-03-28T19:41:14.069291+: client_request(client.99385:11712556
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:14.070764+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr

The client IDs map to several of our cluster nodes but the inode
reference always refers to the same directory in these recent logs:

/private/groups/shapirolab/brock/r2/cactus_coord

That directory does not respond to an 'ls', but other directories
directly above it do just fine.  Maybe it's a bad cache item on the MDS?

# ceph health detail
HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 1 MDSs
report slow requests; 1 MDSs behind on trimming
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest
client/flush tid
  mds.slugfs.pr-md-02.sbblqq(mds.0): Client mustard failing to
advance its oldest client/flush tid.  client_id: 101305
  mds.slugfs.pr-md-01.xdtppo(mds.1): Client  failing to advance its
oldest client/flush tid.  client_id: 101305
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
  mds.slugfs.pr-md-02.sbblqq(mds.0): 201 slow requests are blocked >
30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
  mds.slugfs.pr-md-02.sbblqq(mds.0): Behind on trimming (4786/250)
max_segments: 250, num_segments: 4786

I think that this is somehow causing the "slow requests", on the nodes
listed in the logs, as that directory in inaccessible.  And maybe the
'behind on trimming' part is also related, as it can't trim past that
inode or something?

If I restart the MDS daemon this will clear (I've done it before).  But
it just comes back.  Often somewhere in the same directory
/private/groups/shapirolab/brock/...[something].

-erich

On 3/28/24 10:11 AM, Erich Weiler wrote:

Here are some of the MDS logs:


[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler

MDS logs show:

Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) 
log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 
3676.400077 secs
Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]: 
mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) 
log [WRN] : 320 slow requests, 5 included below; oldest blocked for > 
3681.400104 secs
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) 
log [WRN] : slow request 3668.805732 seconds old, received at 
2024-03-28T19:41:25.772531+: client_request(client.99375:574268 
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) 
log [WRN] : slow request 3667.883853 seconds old, received at 
2024-03-28T19:41:26.694410+: client_request(client.99390:374844 
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) 
log [WRN] : slow request 3663.724571 seconds old, received at 
2024-03-28T19:41:30.853692+: client_request(client.99390:375258 
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) 
log [WRN] : slow request 3681.399582 seconds old, received at 
2024-03-28T19:41:13.178681+: client_request(client.99385:11712080 
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to 
rdlock, waiting
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster) 
log [WRN] : slow request 3680.508972 seconds old, received at 
2024-03-28T19:41:14.069291+: client_request(client.99385:11712556 
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:14.070764+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr


The client IDs map to several of our cluster nodes but the inode 
reference always refers to the same directory in these recent logs:


/private/groups/shapirolab/brock/r2/cactus_coord

That directory does not respond to an 'ls', but other directories 
directly above it do just fine.  Maybe it's a bad cache item on the MDS?


# ceph health detail
HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 1 MDSs 
report slow requests; 1 MDSs behind on trimming
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest 
client/flush tid
mds.slugfs.pr-md-02.sbblqq(mds.0): Client mustard failing to 
advance its oldest client/flush tid.  client_id: 101305
mds.slugfs.pr-md-01.xdtppo(mds.1): Client  failing to advance its 
oldest client/flush tid.  client_id: 101305

[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.slugfs.pr-md-02.sbblqq(mds.0): 201 slow requests are blocked > 
30 secs

[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-02.sbblqq(mds.0): Behind on trimming (4786/250) 
max_segments: 250, num_segments: 4786


I think that this is somehow causing the "slow requests", on the nodes 
listed in the logs, as that directory in inaccessible.  And maybe the 
'behind on trimming' part is also related, as it can't trim past that 
inode or something?


If I restart the MDS daemon this will clear (I've done it before).  But 
it just comes back.  Often somewhere in the same directory 
/private/groups/shapirolab/brock/...[something].


-erich

On 3/28/24 10:11 AM, Erich Weiler wrote:

Here are some of the MDS logs:

Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) 
log [WRN] : slow request 511.703289 seconds old, received at 
2024-03-27T18:49:53.623192+: client_request(client.99375:459393 
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:49:53.620806+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) 
log [WRN] : slow request 690.189459 seconds old, received at 
2024-03-27T18:46:55.137022+: client_request(client.99445:4189994 
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) 
log [WRN] : slow request 686.308604 seconds old, received at 
2024-03-27T18:46:59.017876+: client_request(client.99445:4190508 
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) 
log [WRN] : slow request 686.156943 sec

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
Wow those are extremely useful commands.  Next time this happens I'll be 
sure to use them.  A quick test shows they work just great!


cheers,
erich

On 3/28/24 11:16 AM, Alexander E. Patrakov wrote:

Hi Erich,

Here is how to map the client ID to some extra info:

ceph tell mds.0 client ls id=99445

Here is how to map inode ID to the path:

ceph tell mds.0 dump inode 0x100081b9ceb | jq -r .path

On Fri, Mar 29, 2024 at 1:12 AM Erich Weiler  wrote:


Here are some of the MDS logs:

Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 511.703289 seconds old, received at
2024-03-27T18:49:53.623192+: client_request(client.99375:459393
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:49:53.620806+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 690.189459 seconds old, received at
2024-03-27T18:46:55.137022+: client_request(client.99445:4189994
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 686.308604 seconds old, received at
2024-03-27T18:46:59.017876+: client_request(client.99445:4190508
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 686.156943 seconds old, received at
2024-03-27T18:46:59.169537+: client_request(client.99400:591887
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:26 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16631 from mon.0
Mar 27 11:58:30 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
699.385743 secs
Mar 27 11:58:34 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16632 from mon.0
Mar 27 11:58:35 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
704.385896 secs
Mar 27 11:58:38 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16633 from mon.0
Mar 27 11:58:40 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
709.385979 secs
Mar 27 11:58:42 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16634 from mon.0
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 78 slow requests, 5 included below; oldest blocked for >
714.386040 secs
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 710.189838 seconds old, received at
2024-03-27T18:46:55.137022+: client_request(client.99445:4189994
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 706.308983 seconds old, received at
2024-03-27T18:46:59.017876+: client_request(client.99445:4190508
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 706.157322 seconds old, received at
2024-03-27T18:46:59.169537+: client_request(client.99400:591887
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 706.086751 seconds old, received at
2024-03-27T18:46:59.240108+: client_request(client.99400:591894
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.242644+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 705.196030 seconds old, received at
2024-03-27T18:47:00.130829+: client_request(client.99400:591985
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:47:00.130641+
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16635 from mon.0
Mar 27 11:58:50 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
719.386116 secs
Mar 27 11:58:53

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
 11:59:00 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) 
log [WRN] : 16 slow requests, 0 included below; oldest blocked for > 
729.386333 secs
Mar 27 11:59:02 pr-md-01.prism ceph-mds[1296468]: 
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16638 from mon.0
Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) 
log [WRN] : 53 slow requests, 5 included below; oldest blocked for > 
734.386400 secs
Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) 
log [WRN] : slow request 730.190197 seconds old, received at 
2024-03-27T18:46:55.137022+: client_request(client.99445:4189994 
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+ 
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch 
getattr


Can we tell which client the slow requests are coming from?  It says 
stuff like "client.99445:4189994" but I don't know how to map that to a 
client...


Thanks for the response!

-erich

On 3/27/24 21:28, Xiubo Li wrote:


On 3/28/24 04:03, Erich Weiler wrote:

Hi All,

I've been battling this for a while and I'm not sure where to go from 
here.  I have a Ceph health warning as such:


# ceph -s
  cluster:
    id: 58bde08a-d7ed-11ee-9098-506b4b4da440
    health: HEALTH_WARN
    1 MDSs report slow requests


There had slow requests. I just suspect the behind on trimming was 
caused by this.


Could you share the logs about the slow requests ? What are they ?

Thanks



1 MDSs behind on trimming

  services:
    mon: 5 daemons, quorum 
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)

    mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
    mds: 1/1 daemons up, 2 standby
    osd: 46 osds: 46 up (since 9h), 46 in (since 2w)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 1313 pgs
    objects: 260.72M objects, 466 TiB
    usage:   704 TiB used, 424 TiB / 1.1 PiB avail
    pgs: 1306 active+clean
 4    active+clean+scrubbing+deep
 3    active+clean+scrubbing

  io:
    client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr

And the specifics are:

# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > 
30 secs

[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) 
max_segments: 250, num_segments: 13884


That "num_segments" number slowly keeps increasing.  I suspect I just 
need to tell the MDS servers to trim faster but after hours of 
googling around I just can't figure out the best way to do it. The 
best I could come up with was to decrease "mds_cache_trim_decay_rate" 
from 1.0 to .8 (to start), based on this page:


https://www.suse.com/support/kb/doc/?id=19740

But it doesn't seem to help, maybe I should decrease it further? I am 
guessing this must be a common issue...?  I am running Reef on the MDS 
servers, but most clients are on Quincy.


Thanks for any advice!

cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS Behind on Trimming...

2024-03-27 Thread Erich Weiler

Hi All,

I've been battling this for a while and I'm not sure where to go from 
here.  I have a Ceph health warning as such:


# ceph -s
  cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 MDSs report slow requests
1 MDSs behind on trimming

  services:
mon: 5 daemons, quorum 
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)

mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
mds: 1/1 daemons up, 2 standby
osd: 46 osds: 46 up (since 9h), 46 in (since 2w)

  data:
volumes: 1/1 healthy
pools:   4 pools, 1313 pgs
objects: 260.72M objects, 466 TiB
usage:   704 TiB used, 424 TiB / 1.1 PiB avail
pgs: 1306 active+clean
 4active+clean+scrubbing+deep
 3active+clean+scrubbing

  io:
client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr

And the specifics are:

# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > 
30 secs

[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) 
max_segments: 250, num_segments: 13884


That "num_segments" number slowly keeps increasing.  I suspect I just 
need to tell the MDS servers to trim faster but after hours of googling 
around I just can't figure out the best way to do it.  The best I could 
come up with was to decrease "mds_cache_trim_decay_rate" from 1.0 to .8 
(to start), based on this page:


https://www.suse.com/support/kb/doc/?id=19740

But it doesn't seem to help, maybe I should decrease it further?  I am 
guessing this must be a common issue...?  I am running Reef on the MDS 
servers, but most clients are on Quincy.


Thanks for any advice!

cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS filesystem mount tanks on some nodes?

2024-03-26 Thread Erich Weiler

Hi All,

We have a CephFS filesystem where we are running Reef on the servers 
(OSD/MDS/MGR/MON) and Quincy on the clients.


Every once in a while, one of the clients will stop allowing access to 
my CephFS filesystem, the error being "permission denied" while try to 
access the filesystem on that node.  The fix is to force unmount the 
filesystem and remount it, then it's fine again.  Any idea how I can 
prevent this?


I see this in the client node logs:

Mar 25 11:34:46 phoenix-07 kernel: [50508.354036]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:34:46 phoenix-07 kernel: [50508.359650]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:34:46 phoenix-07 kernel: [50508.367657]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:36:46 phoenix-07 kernel: [50629.189000]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:36:46 phoenix-07 kernel: [50629.192579]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:36:46 phoenix-07 kernel: [50629.196103]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:38:47 phoenix-07 kernel: [50750.024268]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:38:47 phoenix-07 kernel: [50750.031520]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:38:47 phoenix-07 kernel: [50750.038594]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 11:40:48 phoenix-07 kernel: [50870.853281]  ? 
__touch_cap+0x24/0xd0 [ceph]
Mar 25 22:55:38 phoenix-07 kernel: [91360.583032] libceph: mds0 
(1)10.50.1.75:6801 socket closed (con state OPEN)
Mar 25 22:55:38 phoenix-07 kernel: [91360.667914] libceph: mds0 
(1)10.50.1.75:6801 session reset
Mar 25 22:55:38 phoenix-07 kernel: [91360.667923] ceph: mds0 closed our 
session

Mar 25 22:55:38 phoenix-07 kernel: [91360.667925] ceph: mds0 reconnect start
Mar 25 22:55:52 phoenix-07 kernel: [91374.541614] ceph: mds0 reconnect 
denied
Mar 25 22:55:52 phoenix-07 kernel: [91374.541726] ceph:  dropping 
dirty+flushing Fw state for ea96c18f 1099683115069
Mar 25 22:55:52 phoenix-07 kernel: [91374.541732] ceph:  dropping 
dirty+flushing Fw state for ce495f00 1099687100635
Mar 25 22:55:52 phoenix-07 kernel: [91374.541737] ceph:  dropping 
dirty+flushing Fw state for 73ebb190 1099687100636
Mar 25 22:55:52 phoenix-07 kernel: [91374.541744] ceph:  dropping 
dirty+flushing Fw state for 91337e6a 1099687100637
Mar 25 22:55:52 phoenix-07 kernel: [91374.541746] ceph:  dropping 
dirty+flushing Fw state for 9075ecd8 1099687100634
Mar 25 22:55:52 phoenix-07 kernel: [91374.541751] ceph:  dropping 
dirty+flushing Fw state for d1d4c51f 1099687100633
Mar 25 22:55:52 phoenix-07 kernel: [91374.541781] ceph:  dropping 
dirty+flushing Fw state for 63dec1e4 1099687100632
Mar 25 22:55:52 phoenix-07 kernel: [91374.541793] ceph:  dropping 
dirty+flushing Fw state for 8b3124db 1099687100638
Mar 25 22:55:52 phoenix-07 kernel: [91374.541796] ceph:  dropping 
dirty+flushing Fw state for d9e76d8b 1099687100471
Mar 25 22:55:52 phoenix-07 kernel: [91374.541798] ceph:  dropping 
dirty+flushing Fw state for b57da610 1099685041085
Mar 25 22:55:52 phoenix-07 kernel: [91374.542235] libceph: mds0 
(1)10.50.1.75:6801 socket closed (con state V1_CONNECT_MSG)
Mar 25 22:55:52 phoenix-07 kernel: [91374.791652] ceph: mds0 rejected 
session
Mar 25 23:01:51 phoenix-07 kernel: [91733.308806] ceph: get_quota_realm: 
ino (1.fffe) null i_snap_realm
Mar 25 23:01:56 phoenix-07 kernel: [91738.182127] ceph: 
check_quota_exceeded: ino (1000a1cb4a8.fffe) null i_snap_realm
Mar 25 23:01:56 phoenix-07 kernel: [91738.188225] ceph: 
check_quota_exceeded: ino (1000a1cb4a8.fffe) null i_snap_realm
Mar 25 23:01:56 phoenix-07 kernel: [91738.233658] ceph: 
check_quota_exceeded: ino (1000a1cb4aa.fffe) null i_snap_realm
Mar 25 23:25:52 phoenix-07 kernel: [93174.787630] libceph: mds0 
(1)10.50.1.75:6801 socket closed (con state OPEN)
Mar 25 23:39:45 phoenix-07 kernel: [94007.751879] ceph: get_quota_realm: 
ino (1.fffe) null i_snap_realm
Mar 26 00:03:28 phoenix-07 kernel: [95430.158646] ceph: get_quota_realm: 
ino (1.fffe) null i_snap_realm
Mar 26 00:39:45 phoenix-07 kernel: [97607.685421] ceph: get_quota_realm: 
ino (1.fffe) null i_snap_realm
Mar 26 00:43:34 phoenix-07 kernel: [97836.681145] ceph: 
check_quota_exceeded: ino (1000a306503.fffe) null i_snap_realm
Mar 26 00:43:34 phoenix-07 kernel: [97836.686797] ceph: 
check_quota_exceeded: ino (1000a306503.fffe) null i_snap_realm
Mar 26 00:43:34 phoenix-07 kernel: [97836.729046] ceph: 
check_quota_exceeded: ino (1000a306505.fffe) null i_snap_realm
Mar 26 00:49:39 phoenix-07 kernel: [98201.302564] ceph: 
check_quota_exceeded: ino (1000a75677d.fffe) null i_snap_realm
Mar 26 00:49:39 phoenix-07 kernel: [98201.305676] ceph: 
check_quota_exceeded: ino (1000a75677d.fffe) null i_snap_realm
Mar 26 00:49:39 phoenix-07 kernel: [98201.347267] ceph: 
check_quota_exceeded: ino (1000a755fe3.fffe) null i_snap_realm
Mar 26 01:04:49 pho

[ceph-users] Re: Clients failing to advance oldest client?

2024-03-26 Thread Erich Weiler
Thank you!  The OSD/mon/mgr/MDS servers are on 18.2.1, and the clients 
are mostly 17.2.6.


-erich

On 3/25/24 11:57 PM, Dhairya Parmar wrote:
I think this bug has already been worked on in 
https://tracker.ceph.com/issues/63364 
<https://tracker.ceph.com/issues/63364>, can you tell which version 
you're on?


--
*Dhairya Parmar*

Associate Software Engineer, CephFS

IBM, Inc.



On Tue, Mar 26, 2024 at 2:32 AM Erich Weiler <mailto:wei...@soe.ucsc.edu>> wrote:


Hi Y'all,

I'm seeing this warning via 'ceph -s' (this is on Reef):

# ceph -s
    cluster:
      id:     58bde08a-d7ed-11ee-9098-506b4b4da440
      health: HEALTH_WARN
              3 clients failing to advance oldest client/flush tid
              1 MDSs report slow requests
              1 MDSs behind on trimming

    services:
      mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d)
      mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
      mds: 1/1 daemons up, 1 standby
      osd: 46 osds: 46 up (since 3d), 46 in (since 2w)

    data:
      volumes: 1/1 healthy
      pools:   4 pools, 1313 pgs
      objects: 258.13M objects, 454 TiB
      usage:   688 TiB used, 441 TiB / 1.1 PiB avail
      pgs:     1303 active+clean
               8    active+clean+scrubbing
               2    active+clean+scrubbing+deep

    io:
      client:   131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr

I googled around and looked at the docs and it seems like this isn't a
critical problem, but I couldn't find a clear path to resolution.  Does
anyone have any advice on what I can do to resolve the health issues
up top?

My CephFS filesystem is incredibly busy so I have a feeling that has
some impact here, but not 100% sure...

Thanks as always for the help!

cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Clients failing to advance oldest client?

2024-03-25 Thread Erich Weiler
Ok! Thank you. Is there a way to tell which client is slow?

> On Mar 25, 2024, at 9:06 PM, David Yang  wrote:
> 
> It is recommended to disconnect the client first and then observe
> whether the cluster's slow requests recover.
> 
> Erich Weiler  于2024年3月26日周二 05:02写道:
>> 
>> Hi Y'all,
>> 
>> I'm seeing this warning via 'ceph -s' (this is on Reef):
>> 
>> # ceph -s
>>   cluster:
>> id: 58bde08a-d7ed-11ee-9098-506b4b4da440
>> health: HEALTH_WARN
>> 3 clients failing to advance oldest client/flush tid
>> 1 MDSs report slow requests
>> 1 MDSs behind on trimming
>> 
>>   services:
>> mon: 5 daemons, quorum
>> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d)
>> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
>> mds: 1/1 daemons up, 1 standby
>> osd: 46 osds: 46 up (since 3d), 46 in (since 2w)
>> 
>>   data:
>> volumes: 1/1 healthy
>> pools:   4 pools, 1313 pgs
>> objects: 258.13M objects, 454 TiB
>> usage:   688 TiB used, 441 TiB / 1.1 PiB avail
>> pgs: 1303 active+clean
>>  8active+clean+scrubbing
>>  2active+clean+scrubbing+deep
>> 
>>   io:
>> client:   131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr
>> 
>> I googled around and looked at the docs and it seems like this isn't a
>> critical problem, but I couldn't find a clear path to resolution.  Does
>> anyone have any advice on what I can do to resolve the health issues up top?
>> 
>> My CephFS filesystem is incredibly busy so I have a feeling that has
>> some impact here, but not 100% sure...
>> 
>> Thanks as always for the help!
>> 
>> cheers,
>> erich
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Clients failing to advance oldest client?

2024-03-25 Thread Erich Weiler

Hi Y'all,

I'm seeing this warning via 'ceph -s' (this is on Reef):

# ceph -s
  cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
3 clients failing to advance oldest client/flush tid
1 MDSs report slow requests
1 MDSs behind on trimming

  services:
mon: 5 daemons, quorum 
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d)

mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
mds: 1/1 daemons up, 1 standby
osd: 46 osds: 46 up (since 3d), 46 in (since 2w)

  data:
volumes: 1/1 healthy
pools:   4 pools, 1313 pgs
objects: 258.13M objects, 454 TiB
usage:   688 TiB used, 441 TiB / 1.1 PiB avail
pgs: 1303 active+clean
 8active+clean+scrubbing
 2active+clean+scrubbing+deep

  io:
client:   131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr

I googled around and looked at the docs and it seems like this isn't a 
critical problem, but I couldn't find a clear path to resolution.  Does 
anyone have any advice on what I can do to resolve the health issues up top?


My CephFS filesystem is incredibly busy so I have a feeling that has 
some impact here, but not 100% sure...


Thanks as always for the help!

cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Question about erasure coding on cephfs

2024-03-02 Thread Erich Weiler

Hi Y'all,

We have a new ceph cluster online that looks like this:

md-01 : monitor, manager, mds
md-02 : monitor, manager, mds
md-03 : monitor, manager
store-01 : twenty 30TB NVMe OSDs
store-02 : twenty 30TB NVMe OSDs

The cephfs storage is using erasure coding at 4:2.  The crush domain is 
set to "osd".


(I know that's not optimal but let me get to that in a minute)

We have a current regular single NFS server (nfs-01) with the same 
storage as the OSD servers above (twenty 30TB NVME disks).  We want to 
wipe the NFS server and integrate it into the above ceph cluster as 
"store-03".  When we do that, we would then have three OSD servers.  We 
would then switch the crush domain to "host".


My question is this:  Given that we have 4:2 erasure coding, would the 
data rebalance evenly across the three OSD servers after we add store-03 
such that if a single OSD server went down, the other two would be 
enough to keep the system online?  Like, with 4:2 erasure coding, would 
2 shards go on store-01, then 2 shards on store-02, and then 2 shards on 
store-03?  Is that how I understand it?


Thanks for any insight!

-erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io