[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-02-20 Thread Boris Behrens
Hi, we've encountered the same issue after upgrading to octopus on on of our rbd cluster, and now it reappears after the autoscaler lowered the PGs form 8k to 2k for the RBD pool. What we've done in the past: - recreate all OSD after our 2nd incident with slow OPS in a single week after the ceph

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-02-15 Thread Victor Rodriguez
An update on this for the record: To fully solve this I've had to destroy each OSD and create then again, one by one. I could have done it one host at a time but I've preferred to be on the safest side just in case something else went wrong. The values for num_pgmeta_omap (which I don't know

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Victor Rodriguez
On 1/30/23 15:15, Ana Aviles wrote: Hi, Josh already suggested, but I will one more time. We had similar behaviour upgrading from Nautilus to Pacific. In our case compacting the OSDs did the trick. Thanks for chimming in! Unfortunately, in my case neither an online compaction (ceph tell

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Ana Aviles
ctor Rodriguez Sent: 29 January 2023 22:40:46 To: ceph-users@ceph.io Subject: [ceph-users] Re: Very slow snaptrim operations blocking client I/O Looks like this is going to take a few days. I hope to manage the available performance for VMs with osd_snap_trim_sleep_ssd. I'm wondering if afte

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Frank Schilder
relevant, I could copy pieces I have into here. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Victor Rodriguez > Sent: 29 January 2023 22:40:46 > To: ceph-users@ceph.io > Subj

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Victor Rodriguez
lder AIT Risø Campus Bygning 109, rum S14 From: Victor Rodriguez Sent: 29 January 2023 22:40:46 To: ceph-users@ceph.io Subject: [ceph-users] Re: Very slow snaptrim operations blocking client I/O Looks like this is going to take a few days. I hope to

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Frank Schilder
ceph-users] Re: Very slow snaptrim operations blocking client I/O Looks like this is going to take a few days. I hope to manage the available performance for VMs with osd_snap_trim_sleep_ssd. I'm wondering if after that long snaptrim process you went through, was your cluster was stable again and

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-29 Thread Victor Rodriguez
Looks like this is going to take a few days. I hope to manage the available performance for VMs with osd_snap_trim_sleep_ssd. I'm wondering if after that long snaptrim process you went through, was your cluster was stable again and snapshots/snaptrims did work properly? On 1/29/23 16:01,

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-29 Thread Matt Vandermeulen
I should have explicitly stated that during the recovery, it was still quite bumpy for customers. Some snaptrims were very quick, some took what felt like a really long time. This was however a cluster with a very large number of volumes and a long, long history of snapshots. I'm not sure

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-28 Thread Victor Rodriguez
On 1/29/23 00:50, Matt Vandermeulen wrote: I've observed a similar horror when upgrading a cluster from Luminous to Nautilus, which had the same effect of an overwhelming amount of snaptrim making the cluster unusable. In our case, we held its hand by setting all OSDs to have zero max

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-28 Thread Matt Vandermeulen
I've observed a similar horror when upgrading a cluster from Luminous to Nautilus, which had the same effect of an overwhelming amount of snaptrim making the cluster unusable. In our case, we held its hand by setting all OSDs to have zero max trimming PGs, unsetting nosnaptrim, and then

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-28 Thread Victor Rodriguez
After some investigation this is what I'm seeing: - OSD processes get stuck at least at 100% CPU if I ceph osd unset nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim. They stayed like that for at least 26 hours. Some quick benchmarks don't show a reduction of the

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
On 1/27/23 17:44, Josh Baergen wrote: This might be due to tombstone accumulation in rocksdb. You can try to issue a compact to all of your OSDs and see if that helps (ceph tell osd.XXX compact). I usually prefer to do this one host at a time just in case it causes issues, though on a

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
FWIW, the snapshot was in pool cephVMs01_comp, which does use compresion. How is your pg distribution on your osd devices? Looks like the PG's are not perfectly balanced, but doesn't seem to be too bad: ceph osd df tree ID  CLASS  WEIGHT    REWEIGHT  SIZE RAW USE  DATA OMAP META   

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Josh Baergen
This might be due to tombstone accumulation in rocksdb. You can try to issue a compact to all of your OSDs and see if that helps (ceph tell osd.XXX compact). I usually prefer to do this one host at a time just in case it causes issues, though on a reasonably fast RBD cluster you can often get away

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Szabo, Istvan (Agoda)
How is your pg distribution on your osd devices? Do you have enough assigned pgs? Istvan Szabo Staff Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
Ah yes, checked that too. Monitors and OSD's report with ceph config show-with-defaults that bluefs_buffered_io is set to true as default setting (it isn't overriden somewere). On 1/27/23 17:15, Wesley Dillingham wrote: I hit this issue once on a nautilus cluster and changed the OSD

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Wesley Dillingham
I hit this issue once on a nautilus cluster and changed the OSD parameter bluefs_buffered_io = true (was set at false). I believe the default of this parameter was switched from false to true in release 14.2.20, however, perhaps you could still check what your osds are configured with in regard to