[ceph-users] Re: excluding from host_pattern

2023-01-27 Thread Nizamudeen A
Hi, I am not sure about cephadm but if you were to use the ceph-dashboard, in its host creation form you can enter a pattern like ceph[01-19] should add ceph01...ceph19. Regards, Nizam On Fri, Jan 27, 2023, 23:52 E Taka <0eta...@gmail.com> wrote: > Thanks, Ulrich, but: > > # ceph orch host ls

[ceph-users] Re: excluding from host_pattern

2023-01-27 Thread E Taka
Thanks, Ulrich, but: # ceph orch host ls --host_pattern="^ceph(0[1-9])|(1[0-9])$" 0 hosts in cluster whose hostname matched "^ceph(0[1-9])|(1[0-9])$" Bash pattern are not accepted. (I tried it in numerous other combinations). But, as I said, not really a problem - just wondering what the regex

[ceph-users] Re: excluding from host_pattern

2023-01-27 Thread Ulrich Klein
I use something like "^ceph(0[1-9])|(1[0-9])$", but in a script that checks a parameter for a "correct" ceph node name like in: wantNum=$1 if [[ $wantNum =~ ^ceph(0[2-9]|1[0-9])$ ]] ; then wantNum=${BASH_REMATCH[1]} Which gives me the number, if it is in the range 02-19 Dunno, if

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
On 1/27/23 17:44, Josh Baergen wrote: This might be due to tombstone accumulation in rocksdb. You can try to issue a compact to all of your OSDs and see if that helps (ceph tell osd.XXX compact). I usually prefer to do this one host at a time just in case it causes issues, though on a

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
FWIW, the snapshot was in pool cephVMs01_comp, which does use compresion. How is your pg distribution on your osd devices? Looks like the PG's are not perfectly balanced, but doesn't seem to be too bad: ceph osd df tree ID  CLASS  WEIGHT    REWEIGHT  SIZE RAW USE  DATA OMAP META   

[ceph-users] excluding from host_pattern

2023-01-27 Thread E Taka
Hi, I wonder if it is possible to define a host pattern, which includes the host names ceph01…ceph19, but no other hosts, especially not ceph00. That means, this pattern is wrong: ceph[01][0-9] , since it includes ceph00. Not really a problem, but it seems that the "“host-pattern” is a regex

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Josh Baergen
This might be due to tombstone accumulation in rocksdb. You can try to issue a compact to all of your OSDs and see if that helps (ceph tell osd.XXX compact). I usually prefer to do this one host at a time just in case it causes issues, though on a reasonably fast RBD cluster you can often get away

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Szabo, Istvan (Agoda)
How is your pg distribution on your osd devices? Do you have enough assigned pgs? Istvan Szabo Staff Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
Ah yes, checked that too. Monitors and OSD's report with ceph config show-with-defaults that bluefs_buffered_io is set to true as default setting (it isn't overriden somewere). On 1/27/23 17:15, Wesley Dillingham wrote: I hit this issue once on a nautilus cluster and changed the OSD

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Wesley Dillingham
I hit this issue once on a nautilus cluster and changed the OSD parameter bluefs_buffered_io = true (was set at false). I believe the default of this parameter was switched from false to true in release 14.2.20, however, perhaps you could still check what your osds are configured with in regard to

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-27 Thread Frank Schilder
Hi Ilya, thanks for that information. It sounds like one can use exclusive locks in the hook script to resolve race conditions. I will have a look. Not sure if it will help reducing the number of states (shared log tags) though. Best regards, = Frank Schilder AIT Risø Campus

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-27 Thread Ilya Dryomov
On Fri, Jan 27, 2023 at 4:09 PM Frank Schilder wrote: > > Hi Ilya, > > yes, it has race conditions. However, it seems to address the specific case > that is causing us headaches. > > About possible improvements. I tried to understand the documentation about > rbd image locks, but probably

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-27 Thread Frank Schilder
Hi Ilya, yes, it has race conditions. However, it seems to address the specific case that is causing us headaches. About possible improvements. I tried to understand the documentation about rbd image locks, but probably failed. I don't understand what the difference between an exclusive lock

[ceph-users] Re: OSDs fail to start after stopping them with ceph osd stop command

2023-01-27 Thread Stefan Hanreich
Seems like I accidentally only replied directly to Eugen, so here is my answer in case anyone encounters the same problem: We were able to reproduce this issue, this is related to OSDs not catching up to the current epoch of the OSD map. For the first few OSDs, restarting twice worked well,

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-27 Thread Ilya Dryomov
On Fri, Jan 27, 2023 at 11:21 AM Frank Schilder wrote: > > Hi Mark, > > thanks a lot! This seems to address the issue we observe, at least to a large > degree. > > I believe we had 2 VMs running after a failed live-migration as well and in > this case it doesn't seem like it will help. Maybe

[ceph-users] Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
Hello, Asking for help with an issue. Maybe someone has a clue about what's going on. Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed it. A bit later, nearly half of the PGs of the pool entered snaptrim and snaptrim_wait state, as expected. The problem is that such

[ceph-users] PSA: Potential problems in a recent kernel?

2023-01-27 Thread Matthew Booth
My ceph cluster became unstable yesterday after zincati (CoreOS's auto-updater) updated one of my nodes from 37.20221225.3.0 to 37.20230110.3.1(*). The symptom was slow ops in my cephfs mds which started immediately the OSDs on this node became in and up. Excluding the OSDs on this node worked

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-27 Thread Frank Schilder
Hi Mark, thanks a lot! This seems to address the issue we observe, at least to a large degree. I believe we had 2 VMs running after a failed live-migration as well and in this case it doesn't seem like it will help. Maybe its possible to add a bit of logic for this case as well (similar to

[ceph-users] Re: OSDs fail to start after stopping them with ceph osd stop command

2023-01-27 Thread Eugen Block
Hi, what ceph version is this cluster running on? I tried the procedure you describe in a test cluster with 16.2.9 (cephadm) and all OSDs came up, although I had to start the containers twice (manually). Regards, Eugen Zitat von Stefan Hanreich : We encountered the following problems