[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-02-20 Thread Boris Behrens
Hi,
we've encountered the same issue after upgrading to octopus on on of our
rbd cluster, and now it reappears after the autoscaler lowered the PGs form
8k to 2k for the RBD pool.

What we've done in the past:
- recreate all OSD after our 2nd incident with slow OPS in a single week
after the ceph upgrade (earlier september)
- upgraded the OS from centos7 to ubuntu focal after the third incident
(december)
- offline compact all OSDs a week ago, because we had some (~500) very old
snapshots lying around and hoped that the snaptrim works faster when the
rockzdb is freshly compacted.

After the first incident we had roughly three months of smooth sailing, and
now it is again roughly three months of smooth sailing later, and we
experience these slow OPS again.
It might be this time, because we have some OSDs (2TB SSD) with very few
PGs (~30) and some OSDs (8TB SSD) with a lot of PGs (~120).
I will try to compact all OSDs and check if it stops again, but I think I
need to bump the PGs up again to 4K PGs, because it started again when the
autoscaler lowered the PGs.

And from out data (prometheus), the apply latency goes up to 9seconds and
it mostly hits the 8TB disks.

I am currently cunning the "time ceph daemon osd.1
calc_objectstore_db_histogram" for all OSDs and get very mixed values, but
non of the values lies in the <1 minute range.


Am Mi., 15. Feb. 2023 um 16:42 Uhr schrieb Victor Rodriguez <
vrodrig...@soltecsis.com>:

> An update on this for the record:
>
> To fully solve this I've had to destroy each OSD and create then again,
> one by one. I could have done it one host at a time but I've preferred
> to be on the safest side just in case something else went wrong.
>
> The values for num_pgmeta_omap (which I don't know what it is, yet) on
> the new OSD were similar to other clusters (I've seen from 30 to
> 70 aprox), so I believe the characteristics of the data in the
> cluster does not determine how big or small num_pgmeta_omap should be.
>
> One thing I've noticed is that /bad/ or /damanged/ OSDs (i.e. those
> showing high CPU usage and poor performance doing the trim operation)
> took much more time to calculate their histogram, even if their
> num_pgmeta_opmap was low:
>
> (/bad //OSD)/:
> # time ceph daemon osd.1 calc_objectstore_db_histogram | grep
> "num_pgmeta_omap"
>  "num_pgmeta_omap": 673208,
>
> real1m14,549s
> user0m0,075s
> sys0m0,025s
>
> (/good new OSD/):
> #  time ceph daemon osd.1 calc_objectstore_db_histogram | grep
> "num_pgmeta_omap"
>  "num_pgmeta_omap": 434298,
>
> real0m18,022s
> user0m0,078s
> sys0m0,023s
>
>
> Maybe is worth checking that histogram from time to time as a way to
> measure the OSD "health"?
>
> Again, thanks everyone.
>
>
>
> On 1/30/23 18:18, Victor Rodriguez wrote:
> >
> > On 1/30/23 15:15, Ana Aviles wrote:
> >> Hi,
> >>
> >> Josh already suggested, but I will one more time. We had similar
> >> behaviour upgrading from Nautilus to Pacific. In our case compacting
> >> the OSDs did the trick.
> >
> > Thanks for chimming in! Unfortunately, in my case neither an online
> > compaction (ceph tell osd.ID compact) or an offline repair
> > (ceph-bluestore-tool repair --path /var/lib/ceph/osd/OSD_ID) does
> > help. Compactions seem to compact some amount. I think that OSD log
> > dumps information about the size of rocksdb. It went from this:
> >
> >
> 
>
> >
> >   L0  0/00.00 KB   0.0  0.0 0.0  0.0 3.9 3.9
> > 0.0   1.0  0.0 62.2 64.81 61.59890.728   0  0
> >   L1  3/0   132.84 MB   0.5  7.0 3.9  3.1 5.0
> > 2.0   0.0   1.3 63.8 46.1 112.11 108.5223
> > 4.874 56M  7276K
> >   L2 12/0   690.99 MB   0.8  6.5 1.8  4.7 5.6
> > 0.9   0.1   3.2 21.4 18.5 310.78 307.1428
> > 11.099165M  3077K
> >   L3 54/03.37 GB   0.1  0.9 0.3  0.6 0.5
> > -0.1   0.0   1.6 35.9 20.2 24.84 24.49 4
> > 6.210 24M15M
> >  Sum 69/04.17 GB   0.0 14.4 6.0  8.3 15.1
> > 6.7   0.1   3.8 28.7 30.1 512.54 501.74   144
> > 3.559246M26M
> >  Int  0/00.00 KB   0.0  0.8 0.3  0.5 0.6 0.1
> > 0.0  14.1 27.5 20.7 31.13 30.73 47.783 18M  4086K
> >
> > To this:
> >
> >
> 
>
> >
> >   L0  2/0   72.42 MB   0.5  0.0 0.0  0.0 0.1 0.1
> > 0.0   1.0  0.0 63.2 1.14 0.84 20.572   0  0
> >   L3 48/03.10 GB   0.1  0.0 0.0  0.0 0.0 0.0
> > 0.0   0.0  0.0  0.0 0.00 0.00 00.000   0  0
> >  Sum   

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-02-15 Thread Victor Rodriguez

An update on this for the record:

To fully solve this I've had to destroy each OSD and create then again, 
one by one. I could have done it one host at a time but I've preferred 
to be on the safest side just in case something else went wrong.


The values for num_pgmeta_omap (which I don't know what it is, yet) on 
the new OSD were similar to other clusters (I've seen from 30 to 
70 aprox), so I believe the characteristics of the data in the 
cluster does not determine how big or small num_pgmeta_omap should be.


One thing I've noticed is that /bad/ or /damanged/ OSDs (i.e. those 
showing high CPU usage and poor performance doing the trim operation) 
took much more time to calculate their histogram, even if their 
num_pgmeta_opmap was low:


(/bad //OSD)/:
# time ceph daemon osd.1 calc_objectstore_db_histogram | grep 
"num_pgmeta_omap"

    "num_pgmeta_omap": 673208,

real    1m14,549s
user    0m0,075s
sys    0m0,025s

(/good new OSD/):
#  time ceph daemon osd.1 calc_objectstore_db_histogram | grep 
"num_pgmeta_omap"

    "num_pgmeta_omap": 434298,

real    0m18,022s
user    0m0,078s
sys    0m0,023s


Maybe is worth checking that histogram from time to time as a way to 
measure the OSD "health"?


Again, thanks everyone.



On 1/30/23 18:18, Victor Rodriguez wrote:


On 1/30/23 15:15, Ana Aviles wrote:

Hi,

Josh already suggested, but I will one more time. We had similar 
behaviour upgrading from Nautilus to Pacific. In our case compacting 
the OSDs did the trick.


Thanks for chimming in! Unfortunately, in my case neither an online 
compaction (ceph tell osd.ID compact) or an offline repair 
(ceph-bluestore-tool repair --path /var/lib/ceph/osd/OSD_ID) does 
help. Compactions seem to compact some amount. I think that OSD log 
dumps information about the size of rocksdb. It went from this:


 

  L0  0/0    0.00 KB   0.0  0.0 0.0  0.0 3.9 3.9   
0.0   1.0  0.0 62.2 64.81 61.59    89    0.728   0  0
  L1  3/0   132.84 MB   0.5  7.0 3.9  3.1 5.0 
2.0   0.0   1.3 63.8 46.1 112.11 108.52    23    
4.874 56M  7276K
  L2 12/0   690.99 MB   0.8  6.5 1.8  4.7 5.6 
0.9   0.1   3.2 21.4 18.5 310.78 307.14    28   
11.099    165M  3077K
  L3 54/0    3.37 GB   0.1  0.9 0.3  0.6 0.5 
-0.1   0.0   1.6 35.9 20.2 24.84 24.49 4    
6.210 24M    15M
 Sum 69/0    4.17 GB   0.0 14.4 6.0  8.3 15.1 
6.7   0.1   3.8 28.7 30.1 512.54 501.74   144    
3.559    246M    26M
 Int  0/0    0.00 KB   0.0  0.8 0.3  0.5 0.6 0.1   
0.0  14.1 27.5 20.7 31.13 30.73 4    7.783 18M  4086K


To this:

 

  L0  2/0   72.42 MB   0.5  0.0 0.0  0.0 0.1 0.1   
0.0   1.0  0.0 63.2 1.14 0.84 2    0.572   0  0
  L3 48/0    3.10 GB   0.1  0.0 0.0  0.0 0.0 0.0   
0.0   0.0  0.0  0.0 0.00 0.00 0    0.000   0  0
 Sum 50/0    3.17 GB   0.0  0.0 0.0  0.0 0.1 0.1   
0.0   1.0  0.0 63.2 1.14 0.84 2    0.572   0  0
 Int  0/0    0.00 KB   0.0  0.0 0.0  0.0 0.0 0.0   
0.0   0.0  0.0  0.0 0.00 0.00 0    0.000   0  0


Still, it feels "too big" compared to some other OSD in other 
similarily sized clusters, making me think that there's some kind of 
"garbage" making the trim to go crazy.



For us there was no performance impact running the compaction (ceph 
osd daemon osd.0 compact) although we run them in batches and not all 
at once on all OSDs just in case. Also, no need to restart OSDs for 
this operation.


Yes, compacting had no perceived impact on client performance, just 
some higher CPU usage for the OSD process.



Does anyone knows by any chance the meaning of "num_pgmeta_omap" on 
ceph daemon osd.ID calc_objectstore_db_histogram output? As I 
mentioned, the OSDs in this cluster have very different values in that 
field but all other clusters have much similar values:


osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

Thanks a lot!




--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Victor Rodriguez


On 1/30/23 15:15, Ana Aviles wrote:

Hi,

Josh already suggested, but I will one more time. We had similar 
behaviour upgrading from Nautilus to Pacific. In our case compacting 
the OSDs did the trick.


Thanks for chimming in! Unfortunately, in my case neither an online 
compaction (ceph tell osd.ID compact) or an offline repair 
(ceph-bluestore-tool repair --path /var/lib/ceph/osd/OSD_ID) does help. 
Compactions seem to compact some amount. I think that OSD log dumps 
information about the size of rocksdb. It went from this:



  L0  0/0    0.00 KB   0.0  0.0 0.0  0.0 3.9  
3.9   0.0   1.0  0.0 62.2 64.81 61.59    
89    0.728   0  0
  L1  3/0   132.84 MB   0.5  7.0 3.9  3.1 5.0  
2.0   0.0   1.3 63.8 46.1 112.11    108.52    
23    4.874 56M  7276K
  L2 12/0   690.99 MB   0.8  6.5 1.8  4.7 5.6  
0.9   0.1   3.2 21.4 18.5 310.78    307.14    
28   11.099    165M  3077K
  L3 54/0    3.37 GB   0.1  0.9 0.3  0.6 0.5 
-0.1   0.0   1.6 35.9 20.2 24.84 24.49 
4    6.210 24M    15M
 Sum 69/0    4.17 GB   0.0 14.4 6.0  8.3 15.1  
6.7   0.1   3.8 28.7 30.1 512.54    501.74   
144    3.559    246M    26M
 Int  0/0    0.00 KB   0.0  0.8 0.3  0.5 0.6  
0.1   0.0  14.1 27.5 20.7 31.13 30.73 
4    7.783 18M  4086K


To this:


  L0  2/0   72.42 MB   0.5  0.0 0.0  0.0 0.1  
0.1   0.0   1.0  0.0 63.2 1.14  0.84 
2    0.572   0  0
  L3 48/0    3.10 GB   0.1  0.0 0.0  0.0 0.0  
0.0   0.0   0.0  0.0  0.0 0.00  0.00 
0    0.000   0  0
 Sum 50/0    3.17 GB   0.0  0.0 0.0  0.0 0.1  
0.1   0.0   1.0  0.0 63.2 1.14  0.84 
2    0.572   0  0
 Int  0/0    0.00 KB   0.0  0.0 0.0  0.0 0.0  
0.0   0.0   0.0  0.0  0.0 0.00  0.00 
0    0.000   0  0


Still, it feels "too big" compared to some other OSD in other similarily 
sized clusters, making me think that there's some kind of "garbage" 
making the trim to go crazy.



For us there was no performance impact running the compaction (ceph 
osd daemon osd.0 compact) although we run them in batches and not all 
at once on all OSDs just in case. Also, no need to restart OSDs for 
this operation.


Yes, compacting had no perceived impact on client performance, just some 
higher CPU usage for the OSD process.



Does anyone knows by any chance the meaning of "num_pgmeta_omap" on ceph 
daemon osd.ID calc_objectstore_db_histogram output? As I mentioned, the 
OSDs in this cluster have very different values in that field but all 
other clusters have much similar values:


osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

Thanks a lot!



--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Ana Aviles

Hi,

Josh already suggested, but I will one more time. We had similar 
behaviour upgrading from Nautilus to Pacific. In our case compacting the 
OSDs did the trick.


For us there was no performance impact running the compaction (ceph osd 
daemon osd.0 compact) although we run them in batches and not all at 
once on all OSDs just in case. Also, no need to restart OSDs for this 
operation.


https://www.spinics.net/lists/ceph-users/msg74774.html

Regards,

Ana

On 30-01-2023 10:23, Victor Rodriguez wrote:

I'm reading that thread on spinics. Thanks for pointing that out.

The cluster was upgraded from Nautilus to Octopus first week of 
December. No previous version has been used in this cluster. I just 
don't remember if we used snapshots after the upgrade. 
bluestore_fsck_quick_fix_on_mount is at default of "true". Did not 
notice anything unusual during or after the upgrade.


Is there any way to check if the OMAP conversion is done for an OSD? 
Maybe it tries to do the conversion every time I restart an OSD and 
fails? (given they take nearly a minute to start). I still have 13 PGs 
pending snaptrims and client I/O is severely affected even while doing 
on OSD at a time and osd_snap_trim_sleep_ssd=5 :\



On 1/30/23 09:23, Frank Schilder wrote:

Hi Victor,

out of curiosity, did you upgrade the cluster recently to octopus? We 
and others observed this behaviour when following one out of two 
routes to upgrade OSDs. There was a thread "Octopus OSDs extremely 
slow during upgrade from mimic", which seems to have been lost with 
the recent mail list outage. If it is relevant, I could copy pieces I 
have into here.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Victor Rodriguez 
Sent: 29 January 2023 22:40:46
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Very slow snaptrim operations blocking 
client I/O


Looks like this is going to take a few days. I hope to manage the
available performance for VMs with osd_snap_trim_sleep_ssd.

I'm wondering if after that long snaptrim process you went through, was
your cluster was stable again and snapshots/snaptrims did work properly?


On 1/29/23 16:01, Matt Vandermeulen wrote:

I should have explicitly stated that during the recovery, it was still
quite bumpy for customers.  Some snaptrims were very quick, some took
what felt like a really long time.  This was however a cluster with a
very large number of volumes and a long, long history of snapshots.
I'm not sure what the difference will be from our case versus a single
large volume with a big snapshot.



On 2023-01-28 20:45, Victor Rodriguez wrote:

On 1/29/23 00:50, Matt Vandermeulen wrote:

I've observed a similar horror when upgrading a cluster from
Luminous to Nautilus, which had the same effect of an overwhelming
amount of snaptrim making the cluster unusable.

In our case, we held its hand by setting all OSDs to have zero max
trimming PGs, unsetting nosnaptrim, and then slowly enabling
snaptrim a few OSDs at a time.  It was painful to babysit but it
allowed the cluster to catch up without falling over.


That's an interesting approach! Thanks!

On preliminary tests seems that just running snaptrim on a single PG
of a single OSD still makes the cluster barely usable. I have to
increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable
by getting a third of its performance. After a while, a few PG got
trimmed and feels like some of them are harder to trim than others,
as some need a higher osd_snap_trim_sleep_ssd value to let the
cluster perform.

I don't know how long this is going to take... Maybe recreating the
OSD's and dealing with the rebalance is a better option?

There's something ugly going on here... I would really like to put my
finger on it.



On 2023-01-28 19:43, Victor Rodriguez wrote:

After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset
nosnaptrim. They keep at 100% CPU even if I ceph osd set
nosnaptrim. They stayed like that for at least 26 hours. Some quick
benchmarks don't show a reduction of the performance of the cluster.

- Restarting a OSD lowers it's CPU usage to typical levels, as
expected, but it also usually sets some other OSD in a different
host to typical levels.

- All OSDs in this cluster take quite a bit to start: between 35 to
70 seconds depending on the OSD. Clearly much longer than any other
OSD in any of my clusters.

- I believe that the size of the rocksdb database is dumped in the
OSD log when an automatic compact operation is triggered. The "sum"
sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger
that those in any other cluster I have.

- ceph daemon osd.* calc_objectstore_db_histogram is giving values
for num_pgmeta_omap (I don't know what it is) way bigger than those
on any other of my cl

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Frank Schilder
Glad you found it, its quite a long one. We could never confirm that a manual 
compaction after an upgrade with quick fix on mount = true would solve the 
problem with snap trim. You might want to try off-line fixing using

ceph-bluestore-tool repair --path /var/lib/ceph/osd/OSD_ID
maybe: ceph-bluestore-tool quick-fix --path /var/lib/ceph/osd/OSD_ID

and possibly off-line OSD compaction using ceph-kvstore-tool. maybe this 
completes the conversion? As far as I know, running an off-line compaction does 
more than on-line compaction. This may or may not make the difference. Having 
the flag quick fix on mount = true after conversion has no effect on startup.

These commands are taken from two other long threads that seem at least partly 
present: "OSD crashes during upgrade mimic->octopus" and "LVM osds loose 
connection to disk" where we had to fight with OSDs crashing during the omap 
conversion process.

To run the bluestore/kvstore tool commands, you need to stop the OSD daemon but 
activate the OSD with ceph-volume lvm activate. You need to have everything 
mounted under /var/lib/ceph/osd/OSD_ID while the OSD daemon is down.

If this doesn't restore performance during snaptrim I see only a re-creation of 
all OSDs as a way out.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Victor Rodriguez 
Sent: 30 January 2023 10:23:15
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Very slow snaptrim operations blocking client I/O

I'm reading that thread on spinics. Thanks for pointing that out.

The cluster was upgraded from Nautilus to Octopus first week of
December. No previous version has been used in this cluster. I just
don't remember if we used snapshots after the upgrade.
bluestore_fsck_quick_fix_on_mount is at default of "true". Did not
notice anything unusual during or after the upgrade.

Is there any way to check if the OMAP conversion is done for an OSD?
Maybe it tries to do the conversion every time I restart an OSD and
fails? (given they take nearly a minute to start). I still have 13 PGs
pending snaptrims and client I/O is severely affected even while doing
on OSD at a time and osd_snap_trim_sleep_ssd=5 :\


On 1/30/23 09:23, Frank Schilder wrote:
> Hi Victor,
>
> out of curiosity, did you upgrade the cluster recently to octopus? We and 
> others observed this behaviour when following one out of two routes to 
> upgrade OSDs. There was a thread "Octopus OSDs extremely slow during upgrade 
> from mimic", which seems to have been lost with the recent mail list outage. 
> If it is relevant, I could copy pieces I have into here.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Victor Rodriguez 
> Sent: 29 January 2023 22:40:46
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: Very slow snaptrim operations blocking client I/O
>
> Looks like this is going to take a few days. I hope to manage the
> available performance for VMs with osd_snap_trim_sleep_ssd.
>
> I'm wondering if after that long snaptrim process you went through, was
> your cluster was stable again and snapshots/snaptrims did work properly?
>
>
> On 1/29/23 16:01, Matt Vandermeulen wrote:
>> I should have explicitly stated that during the recovery, it was still
>> quite bumpy for customers.  Some snaptrims were very quick, some took
>> what felt like a really long time.  This was however a cluster with a
>> very large number of volumes and a long, long history of snapshots.
>> I'm not sure what the difference will be from our case versus a single
>> large volume with a big snapshot.
>>
>>
>>
>> On 2023-01-28 20:45, Victor Rodriguez wrote:
>>> On 1/29/23 00:50, Matt Vandermeulen wrote:
>>>> I've observed a similar horror when upgrading a cluster from
>>>> Luminous to Nautilus, which had the same effect of an overwhelming
>>>> amount of snaptrim making the cluster unusable.
>>>>
>>>> In our case, we held its hand by setting all OSDs to have zero max
>>>> trimming PGs, unsetting nosnaptrim, and then slowly enabling
>>>> snaptrim a few OSDs at a time.  It was painful to babysit but it
>>>> allowed the cluster to catch up without falling over.
>>>
>>> That's an interesting approach! Thanks!
>>>
>>> On preliminary tests seems that just running snaptrim on a single PG
>>> of a single OSD still makes the cluster barely usable. I have to
>>> increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable
>>> by getting a third of its performance. After a while, a few PG got
&

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Victor Rodriguez

I'm reading that thread on spinics. Thanks for pointing that out.

The cluster was upgraded from Nautilus to Octopus first week of 
December. No previous version has been used in this cluster. I just 
don't remember if we used snapshots after the upgrade. 
bluestore_fsck_quick_fix_on_mount is at default of "true". Did not 
notice anything unusual during or after the upgrade.


Is there any way to check if the OMAP conversion is done for an OSD? 
Maybe it tries to do the conversion every time I restart an OSD and 
fails? (given they take nearly a minute to start). I still have 13 PGs 
pending snaptrims and client I/O is severely affected even while doing 
on OSD at a time and osd_snap_trim_sleep_ssd=5 :\



On 1/30/23 09:23, Frank Schilder wrote:

Hi Victor,

out of curiosity, did you upgrade the cluster recently to octopus? We and others observed 
this behaviour when following one out of two routes to upgrade OSDs. There was a thread 
"Octopus OSDs extremely slow during upgrade from mimic", which seems to have 
been lost with the recent mail list outage. If it is relevant, I could copy pieces I have 
into here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Victor Rodriguez 
Sent: 29 January 2023 22:40:46
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Very slow snaptrim operations blocking client I/O

Looks like this is going to take a few days. I hope to manage the
available performance for VMs with osd_snap_trim_sleep_ssd.

I'm wondering if after that long snaptrim process you went through, was
your cluster was stable again and snapshots/snaptrims did work properly?


On 1/29/23 16:01, Matt Vandermeulen wrote:

I should have explicitly stated that during the recovery, it was still
quite bumpy for customers.  Some snaptrims were very quick, some took
what felt like a really long time.  This was however a cluster with a
very large number of volumes and a long, long history of snapshots.
I'm not sure what the difference will be from our case versus a single
large volume with a big snapshot.



On 2023-01-28 20:45, Victor Rodriguez wrote:

On 1/29/23 00:50, Matt Vandermeulen wrote:

I've observed a similar horror when upgrading a cluster from
Luminous to Nautilus, which had the same effect of an overwhelming
amount of snaptrim making the cluster unusable.

In our case, we held its hand by setting all OSDs to have zero max
trimming PGs, unsetting nosnaptrim, and then slowly enabling
snaptrim a few OSDs at a time.  It was painful to babysit but it
allowed the cluster to catch up without falling over.


That's an interesting approach! Thanks!

On preliminary tests seems that just running snaptrim on a single PG
of a single OSD still makes the cluster barely usable. I have to
increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable
by getting a third of its performance. After a while, a few PG got
trimmed and feels like some of them are harder to trim than others,
as some need a higher osd_snap_trim_sleep_ssd value to let the
cluster perform.

I don't know how long this is going to take... Maybe recreating the
OSD's and dealing with the rebalance is a better option?

There's something ugly going on here... I would really like to put my
finger on it.



On 2023-01-28 19:43, Victor Rodriguez wrote:

After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset
nosnaptrim. They keep at 100% CPU even if I ceph osd set
nosnaptrim. They stayed like that for at least 26 hours. Some quick
benchmarks don't show a reduction of the performance of the cluster.

- Restarting a OSD lowers it's CPU usage to typical levels, as
expected, but it also usually sets some other OSD in a different
host to typical levels.

- All OSDs in this cluster take quite a bit to start: between 35 to
70 seconds depending on the OSD. Clearly much longer than any other
OSD in any of my clusters.

- I believe that the size of the rocksdb database is dumped in the
OSD log when an automatic compact operation is triggered. The "sum"
sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger
that those in any other cluster I have.

- ceph daemon osd.* calc_objectstore_db_histogram is giving values
for num_pgmeta_omap (I don't know what it is) way bigger than those
on any other of my clusters for some OSD. Also, values are not
similar among the OSD which hold the same PGs.

osd.0:"num_pgmeta_omap": 17526766,
osd.1:"num_pgmeta_omap": 2653379,
osd.2:"num_pgmeta_omap": 12358703,
osd.3:"num_pgmeta_omap": 6404975,
osd.6:"num_pgmeta_omap": 19845318,
osd.7:"num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:"num_pgmeta_omap": 615846,
osd.14:"num_pgmeta_omap"

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-30 Thread Frank Schilder
Hi Victor,

out of curiosity, did you upgrade the cluster recently to octopus? We and 
others observed this behaviour when following one out of two routes to upgrade 
OSDs. There was a thread "Octopus OSDs extremely slow during upgrade from 
mimic", which seems to have been lost with the recent mail list outage. If it 
is relevant, I could copy pieces I have into here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Victor Rodriguez 
Sent: 29 January 2023 22:40:46
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Very slow snaptrim operations blocking client I/O

Looks like this is going to take a few days. I hope to manage the
available performance for VMs with osd_snap_trim_sleep_ssd.

I'm wondering if after that long snaptrim process you went through, was
your cluster was stable again and snapshots/snaptrims did work properly?


On 1/29/23 16:01, Matt Vandermeulen wrote:
> I should have explicitly stated that during the recovery, it was still
> quite bumpy for customers.  Some snaptrims were very quick, some took
> what felt like a really long time.  This was however a cluster with a
> very large number of volumes and a long, long history of snapshots.
> I'm not sure what the difference will be from our case versus a single
> large volume with a big snapshot.
>
>
>
> On 2023-01-28 20:45, Victor Rodriguez wrote:
>> On 1/29/23 00:50, Matt Vandermeulen wrote:
>>> I've observed a similar horror when upgrading a cluster from
>>> Luminous to Nautilus, which had the same effect of an overwhelming
>>> amount of snaptrim making the cluster unusable.
>>>
>>> In our case, we held its hand by setting all OSDs to have zero max
>>> trimming PGs, unsetting nosnaptrim, and then slowly enabling
>>> snaptrim a few OSDs at a time.  It was painful to babysit but it
>>> allowed the cluster to catch up without falling over.
>>
>>
>> That's an interesting approach! Thanks!
>>
>> On preliminary tests seems that just running snaptrim on a single PG
>> of a single OSD still makes the cluster barely usable. I have to
>> increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable
>> by getting a third of its performance. After a while, a few PG got
>> trimmed and feels like some of them are harder to trim than others,
>> as some need a higher osd_snap_trim_sleep_ssd value to let the
>> cluster perform.
>>
>> I don't know how long this is going to take... Maybe recreating the
>> OSD's and dealing with the rebalance is a better option?
>>
>> There's something ugly going on here... I would really like to put my
>> finger on it.
>>
>>
>>> On 2023-01-28 19:43, Victor Rodriguez wrote:
>>>> After some investigation this is what I'm seeing:
>>>>
>>>> - OSD processes get stuck at least at 100% CPU if I ceph osd unset
>>>> nosnaptrim. They keep at 100% CPU even if I ceph osd set
>>>> nosnaptrim. They stayed like that for at least 26 hours. Some quick
>>>> benchmarks don't show a reduction of the performance of the cluster.
>>>>
>>>> - Restarting a OSD lowers it's CPU usage to typical levels, as
>>>> expected, but it also usually sets some other OSD in a different
>>>> host to typical levels.
>>>>
>>>> - All OSDs in this cluster take quite a bit to start: between 35 to
>>>> 70 seconds depending on the OSD. Clearly much longer than any other
>>>> OSD in any of my clusters.
>>>>
>>>> - I believe that the size of the rocksdb database is dumped in the
>>>> OSD log when an automatic compact operation is triggered. The "sum"
>>>> sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger
>>>> that those in any other cluster I have.
>>>>
>>>> - ceph daemon osd.* calc_objectstore_db_histogram is giving values
>>>> for num_pgmeta_omap (I don't know what it is) way bigger than those
>>>> on any other of my clusters for some OSD. Also, values are not
>>>> similar among the OSD which hold the same PGs.
>>>>
>>>> osd.0:"num_pgmeta_omap": 17526766,
>>>> osd.1:"num_pgmeta_omap": 2653379,
>>>> osd.2:"num_pgmeta_omap": 12358703,
>>>> osd.3:"num_pgmeta_omap": 6404975,
>>>> osd.6:"num_pgmeta_omap": 19845318,
>>>> osd.7:"num_pgmeta_omap": 6043083,
>>>> os

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-29 Thread Victor Rodriguez


Looks like this is going to take a few days. I hope to manage the 
available performance for VMs with osd_snap_trim_sleep_ssd.


I'm wondering if after that long snaptrim process you went through, was 
your cluster was stable again and snapshots/snaptrims did work properly?



On 1/29/23 16:01, Matt Vandermeulen wrote:
I should have explicitly stated that during the recovery, it was still 
quite bumpy for customers.  Some snaptrims were very quick, some took 
what felt like a really long time.  This was however a cluster with a 
very large number of volumes and a long, long history of snapshots.  
I'm not sure what the difference will be from our case versus a single 
large volume with a big snapshot.




On 2023-01-28 20:45, Victor Rodriguez wrote:

On 1/29/23 00:50, Matt Vandermeulen wrote:
I've observed a similar horror when upgrading a cluster from 
Luminous to Nautilus, which had the same effect of an overwhelming 
amount of snaptrim making the cluster unusable.


In our case, we held its hand by setting all OSDs to have zero max 
trimming PGs, unsetting nosnaptrim, and then slowly enabling 
snaptrim a few OSDs at a time.  It was painful to babysit but it 
allowed the cluster to catch up without falling over.



That's an interesting approach! Thanks!

On preliminary tests seems that just running snaptrim on a single PG 
of a single OSD still makes the cluster barely usable. I have to 
increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable 
by getting a third of its performance. After a while, a few PG got 
trimmed and feels like some of them are harder to trim than others, 
as some need a higher osd_snap_trim_sleep_ssd value to let the 
cluster perform.


I don't know how long this is going to take... Maybe recreating the 
OSD's and dealing with the rebalance is a better option?


There's something ugly going on here... I would really like to put my 
finger on it.




On 2023-01-28 19:43, Victor Rodriguez wrote:

After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset 
nosnaptrim. They keep at 100% CPU even if I ceph osd set 
nosnaptrim. They stayed like that for at least 26 hours. Some quick 
benchmarks don't show a reduction of the performance of the cluster.


- Restarting a OSD lowers it's CPU usage to typical levels, as 
expected, but it also usually sets some other OSD in a different 
host to typical levels.


- All OSDs in this cluster take quite a bit to start: between 35 to 
70 seconds depending on the OSD. Clearly much longer than any other 
OSD in any of my clusters.


- I believe that the size of the rocksdb database is dumped in the 
OSD log when an automatic compact operation is triggered. The "sum" 
sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger 
that those in any other cluster I have.


- ceph daemon osd.* calc_objectstore_db_histogram is giving values 
for num_pgmeta_omap (I don't know what it is) way bigger than those 
on any other of my clusters for some OSD. Also, values are not 
similar among the OSD which hold the same PGs.


osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

- Compacting the OSD barely reduces rocksdb size and does not 
reduce num_pgmeta_omap at all.


- This is the only cluster I have were there are some RBD images 
that I mount directly from some clients, that is, they are not 
disks for QEMU/Proxmox VMs. Maybe I have something misconfigured 
related to this?  This cluster is at least two and half years old 
an never had this issue with snaptrims.


Thanks in advance!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-29 Thread Matt Vandermeulen
I should have explicitly stated that during the recovery, it was still 
quite bumpy for customers.  Some snaptrims were very quick, some took 
what felt like a really long time.  This was however a cluster with a 
very large number of volumes and a long, long history of snapshots.  I'm 
not sure what the difference will be from our case versus a single large 
volume with a big snapshot.




On 2023-01-28 20:45, Victor Rodriguez wrote:

On 1/29/23 00:50, Matt Vandermeulen wrote:
I've observed a similar horror when upgrading a cluster from Luminous 
to Nautilus, which had the same effect of an overwhelming amount of 
snaptrim making the cluster unusable.


In our case, we held its hand by setting all OSDs to have zero max 
trimming PGs, unsetting nosnaptrim, and then slowly enabling snaptrim 
a few OSDs at a time.  It was painful to babysit but it allowed the 
cluster to catch up without falling over.



That's an interesting approach! Thanks!

On preliminary tests seems that just running snaptrim on a single PG of 
a single OSD still makes the cluster barely usable. I have to increase 
osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable by getting 
a third of its performance. After a while, a few PG got trimmed and 
feels like some of them are harder to trim than others, as some need a 
higher osd_snap_trim_sleep_ssd value to let the cluster perform.


I don't know how long this is going to take... Maybe recreating the 
OSD's and dealing with the rebalance is a better option?


There's something ugly going on here... I would really like to put my 
finger on it.




On 2023-01-28 19:43, Victor Rodriguez wrote:

After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset 
nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim. 
They stayed like that for at least 26 hours. Some quick benchmarks 
don't show a reduction of the performance of the cluster.


- Restarting a OSD lowers it's CPU usage to typical levels, as 
expected, but it also usually sets some other OSD in a different host 
to typical levels.


- All OSDs in this cluster take quite a bit to start: between 35 to 
70 seconds depending on the OSD. Clearly much longer than any other 
OSD in any of my clusters.


- I believe that the size of the rocksdb database is dumped in the 
OSD log when an automatic compact operation is triggered. The "sum" 
sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger 
that those in any other cluster I have.


- ceph daemon osd.* calc_objectstore_db_histogram is giving values 
for num_pgmeta_omap (I don't know what it is) way bigger than those 
on any other of my clusters for some OSD. Also, values are not 
similar among the OSD which hold the same PGs.


osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

- Compacting the OSD barely reduces rocksdb size and does not reduce 
num_pgmeta_omap at all.


- This is the only cluster I have were there are some RBD images that 
I mount directly from some clients, that is, they are not disks for 
QEMU/Proxmox VMs. Maybe I have something misconfigured related to 
this?  This cluster is at least two and half years old an never had 
this issue with snaptrims.


Thanks in advance!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-28 Thread Victor Rodriguez

On 1/29/23 00:50, Matt Vandermeulen wrote:
I've observed a similar horror when upgrading a cluster from Luminous 
to Nautilus, which had the same effect of an overwhelming amount of 
snaptrim making the cluster unusable.


In our case, we held its hand by setting all OSDs to have zero max 
trimming PGs, unsetting nosnaptrim, and then slowly enabling snaptrim 
a few OSDs at a time.  It was painful to babysit but it allowed the 
cluster to catch up without falling over.



That's an interesting approach! Thanks!

On preliminary tests seems that just running snaptrim on a single PG of 
a single OSD still makes the cluster barely usable. I have to increase 
osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable by getting a 
third of its performance. After a while, a few PG got trimmed and feels 
like some of them are harder to trim than others, as some need a higher 
osd_snap_trim_sleep_ssd value to let the cluster perform.


I don't know how long this is going to take... Maybe recreating the 
OSD's and dealing with the rebalance is a better option?


There's something ugly going on here... I would really like to put my 
finger on it.




On 2023-01-28 19:43, Victor Rodriguez wrote:

After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset 
nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim. 
They stayed like that for at least 26 hours. Some quick benchmarks 
don't show a reduction of the performance of the cluster.


- Restarting a OSD lowers it's CPU usage to typical levels, as 
expected, but it also usually sets some other OSD in a different host 
to typical levels.


- All OSDs in this cluster take quite a bit to start: between 35 to 
70 seconds depending on the OSD. Clearly much longer than any other 
OSD in any of my clusters.


- I believe that the size of the rocksdb database is dumped in the 
OSD log when an automatic compact operation is triggered. The "sum" 
sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger 
that those in any other cluster I have.


- ceph daemon osd.* calc_objectstore_db_histogram is giving values 
for num_pgmeta_omap (I don't know what it is) way bigger than those 
on any other of my clusters for some OSD. Also, values are not 
similar among the OSD which hold the same PGs.


osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

- Compacting the OSD barely reduces rocksdb size and does not reduce 
num_pgmeta_omap at all.


- This is the only cluster I have were there are some RBD images that 
I mount directly from some clients, that is, they are not disks for 
QEMU/Proxmox VMs. Maybe I have something misconfigured related to 
this?  This cluster is at least two and half years old an never had 
this issue with snaptrims.


Thanks in advance!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-28 Thread Matt Vandermeulen
I've observed a similar horror when upgrading a cluster from Luminous to 
Nautilus, which had the same effect of an overwhelming amount of 
snaptrim making the cluster unusable.


In our case, we held its hand by setting all OSDs to have zero max 
trimming PGs, unsetting nosnaptrim, and then slowly enabling snaptrim a 
few OSDs at a time.  It was painful to babysit but it allowed the 
cluster to catch up without falling over.




On 2023-01-28 19:43, Victor Rodriguez wrote:

After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset 
nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim. 
They stayed like that for at least 26 hours. Some quick benchmarks 
don't show a reduction of the performance of the cluster.


- Restarting a OSD lowers it's CPU usage to typical levels, as 
expected, but it also usually sets some other OSD in a different host 
to typical levels.


- All OSDs in this cluster take quite a bit to start: between 35 to 70 
seconds depending on the OSD. Clearly much longer than any other OSD in 
any of my clusters.


- I believe that the size of the rocksdb database is dumped in the OSD 
log when an automatic compact operation is triggered. The "sum" sizes 
of these OSD range between 2.5 and 5.1 GB. Thats way bigger that those 
in any other cluster I have.


- ceph daemon osd.* calc_objectstore_db_histogram is giving values for 
num_pgmeta_omap (I don't know what it is) way bigger than those on any 
other of my clusters for some OSD. Also, values are not similar among 
the OSD which hold the same PGs.


osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

- Compacting the OSD barely reduces rocksdb size and does not reduce 
num_pgmeta_omap at all.


- This is the only cluster I have were there are some RBD images that I 
mount directly from some clients, that is, they are not disks for 
QEMU/Proxmox VMs. Maybe I have something misconfigured related to 
this?  This cluster is at least two and half years old an never had 
this issue with snaptrims.


Thanks in advance!


On 1/27/23 17:29, Victor Rodriguez wrote:
Ah yes, checked that too. Monitors and OSD's report with ceph config 
show-with-defaults that bluefs_buffered_io is set to true as default 
setting (it isn't overriden somewere).



On 1/27/23 17:15, Wesley Dillingham wrote:
I hit this issue once on a nautilus cluster and changed the OSD 
parameter bluefs_buffered_io = true (was set at false). I believe the 
default of this parameter was switched from false to true in release 
14.2.20, however, perhaps you could still check what your osds are 
configured with in regard to this config item.


Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez 
 wrote:


    Hello,

    Asking for help with an issue. Maybe someone has a clue about 
what's

    going on.

    Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
    removed
    it. A bit later, nearly half of the PGs of the pool entered
    snaptrim and
    snaptrim_wait state, as expected. The problem is that such 
operations

    ran extremely slow and client I/O was nearly nothing, so all VMs
    in the
    cluster got stuck as they could not I/O to the storage. Taking 
and

    removing big snapshots is a normal operation that we do often and
    this
    is the first time I see this issue in any of my clusters.

    Disks are all Samsung PM1733 and network is 25G. It gives us
    plenty of
    performance for the use case and never had an issue with the 
hardware.


    Both disk I/O and network I/O was very low. Still, client I/O
    seemed to
    get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
    stops
    any active snaptrim operation and client I/O resumes back to 
normal.

    Enabling snaptrim again makes client I/O to almost halt again.

    I've been playing with some settings:

    ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
    ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
    ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
    ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 
1'


    None really seemed to help. Also tried restarting OSD services.

    This cluster was upgraded from 14.2.x to 15.2.17 a couple of
    months. Is
    there any setting that must be changed which may cause this 
problem?


    I have scheduled a maintenance window, what should I look for to
    diagnose this problem?

    Any help is very appreciated. Thanks in advance.

    Victor


    ___
    ceph-user

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-28 Thread Victor Rodriguez

After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset 
nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim. 
They stayed like that for at least 26 hours. Some quick benchmarks don't 
show a reduction of the performance of the cluster.


- Restarting a OSD lowers it's CPU usage to typical levels, as expected, 
but it also usually sets some other OSD in a different host to typical 
levels.


- All OSDs in this cluster take quite a bit to start: between 35 to 70 
seconds depending on the OSD. Clearly much longer than any other OSD in 
any of my clusters.


- I believe that the size of the rocksdb database is dumped in the OSD 
log when an automatic compact operation is triggered. The "sum" sizes of 
these OSD range between 2.5 and 5.1 GB. Thats way bigger that those in 
any other cluster I have.


- ceph daemon osd.* calc_objectstore_db_histogram is giving values for 
num_pgmeta_omap (I don't know what it is) way bigger than those on any 
other of my clusters for some OSD. Also, values are not similar among 
the OSD which hold the same PGs.


osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

- Compacting the OSD barely reduces rocksdb size and does not reduce 
num_pgmeta_omap at all.


- This is the only cluster I have were there are some RBD images that I 
mount directly from some clients, that is, they are not disks for 
QEMU/Proxmox VMs. Maybe I have something misconfigured related to this?  
This cluster is at least two and half years old an never had this issue 
with snaptrims.


Thanks in advance!


On 1/27/23 17:29, Victor Rodriguez wrote:
Ah yes, checked that too. Monitors and OSD's report with ceph config 
show-with-defaults that bluefs_buffered_io is set to true as default 
setting (it isn't overriden somewere).



On 1/27/23 17:15, Wesley Dillingham wrote:
I hit this issue once on a nautilus cluster and changed the OSD 
parameter bluefs_buffered_io = true (was set at false). I believe the 
default of this parameter was switched from false to true in release 
14.2.20, however, perhaps you could still check what your osds are 
configured with in regard to this config item.


Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez 
 wrote:


    Hello,

    Asking for help with an issue. Maybe someone has a clue about what's
    going on.

    Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
    removed
    it. A bit later, nearly half of the PGs of the pool entered
    snaptrim and
    snaptrim_wait state, as expected. The problem is that such 
operations

    ran extremely slow and client I/O was nearly nothing, so all VMs
    in the
    cluster got stuck as they could not I/O to the storage. Taking and
    removing big snapshots is a normal operation that we do often and
    this
    is the first time I see this issue in any of my clusters.

    Disks are all Samsung PM1733 and network is 25G. It gives us
    plenty of
    performance for the use case and never had an issue with the 
hardware.


    Both disk I/O and network I/O was very low. Still, client I/O
    seemed to
    get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
    stops
    any active snaptrim operation and client I/O resumes back to normal.
    Enabling snaptrim again makes client I/O to almost halt again.

    I've been playing with some settings:

    ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
    ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
    ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
    ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'

    None really seemed to help. Also tried restarting OSD services.

    This cluster was upgraded from 14.2.x to 15.2.17 a couple of
    months. Is
    there any setting that must be changed which may cause this problem?

    I have scheduled a maintenance window, what should I look for to
    diagnose this problem?

    Any help is very appreciated. Thanks in advance.

    Victor


    ___
    ceph-users mailing list -- ceph-users@ceph.io
    To unsubscribe send an email to ceph-users-le...@ceph.io


--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez



On 1/27/23 17:44, Josh Baergen wrote:

This might be due to tombstone accumulation in rocksdb. You can try to
issue a compact to all of your OSDs and see if that helps (ceph tell
osd.XXX compact). I usually prefer to do this one host at a time just
in case it causes issues, though on a reasonably fast RBD cluster you
can often get away with compacting everything at once.

Josh


Is there a way to get the current size or count of the tomstone list? Or 
maybe the size of the whole rocksdb?


I would like to:

- Compare such sizes with other similar clusters
- Compare the sizes before and after the compact operation.

I don't want to risk any more downtimes until the scheduled maintenance 
window I have tomorrow, so I cant run the compact now.






On Fri, Jan 27, 2023 at 6:52 AM Victor Rodriguez
 wrote:

Hello,

Asking for help with an issue. Maybe someone has a clue about what's
going on.

Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed
it. A bit later, nearly half of the PGs of the pool entered snaptrim and
snaptrim_wait state, as expected. The problem is that such operations
ran extremely slow and client I/O was nearly nothing, so all VMs in the
cluster got stuck as they could not I/O to the storage. Taking and
removing big snapshots is a normal operation that we do often and this
is the first time I see this issue in any of my clusters.

Disks are all Samsung PM1733 and network is 25G. It gives us plenty of
performance for the use case and never had an issue with the hardware.

Both disk I/O and network I/O was very low. Still, client I/O seemed to
get queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stops
any active snaptrim operation and client I/O resumes back to normal.
Enabling snaptrim again makes client I/O to almost halt again.

I've been playing with some settings:

ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'

None really seemed to help. Also tried restarting OSD services.

This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Is
there any setting that must be changed which may cause this problem?

I have scheduled a maintenance window, what should I look for to
diagnose this problem?

Any help is very appreciated. Thanks in advance.

Victor


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez

FWIW, the snapshot was in pool cephVMs01_comp, which does use compresion.


How is your pg distribution on your osd devices? 


Looks like the PG's are not perfectly balanced, but doesn't seem to be 
too bad:


ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE RAW USE  DATA OMAP META 
AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1 13.10057 -   13 TiB  7.7 TiB  7.6 TiB  150 MiB   47 
GiB  5.4 TiB  58.41  1.00    -  root default
-3  4.36659 -  4.4 TiB  2.6 TiB  2.5 TiB   50 MiB   16 
GiB  1.8 TiB  58.43  1.00    -  host maigmo01
 0    ssd   1.74660   1.0  1.7 TiB  971 GiB  966 GiB   15 MiB  5.8 
GiB  817 GiB  54.31  0.93  123  up  osd.0
 1    ssd   1.31000   1.0  1.3 TiB  794 GiB  790 GiB   17 MiB  4.2 
GiB  547 GiB  59.23  1.01   99  up  osd.1
 2    ssd   1.31000   1.0  1.3 TiB  847 GiB  841 GiB   18 MiB  6.0 
GiB  495 GiB  63.12  1.08   99  up  osd.2
-5  4.36659 -  4.4 TiB  2.6 TiB  2.5 TiB   59 MiB   16 
GiB  1.8 TiB  58.42  1.00    -  host maigmo02
 3    ssd   1.31000   1.0  1.3 TiB  714 GiB  710 GiB   24 MiB  4.1 
GiB  627 GiB  53.23  0.91   92  up  osd.3
 6    ssd   1.74660   1.0  1.7 TiB  1.1 TiB  1.1 TiB   20 MiB  7.1 
GiB  645 GiB  63.94  1.09  137  up  osd.6
 7    ssd   1.31000   1.0  1.3 TiB  755 GiB  750 GiB   15 MiB  4.4 
GiB  587 GiB  56.26  0.96   92  up  osd.7
-2  4.36739 -  4.4 TiB  2.6 TiB  2.5 TiB   41 MiB   15 
GiB  1.8 TiB  58.39  1.00    -  host maigmo04
12    ssd   1.74660   1.0  1.7 TiB  1.1 TiB  1.1 TiB   16 MiB  6.5 
GiB  700 GiB  60.83  1.04  133  up  osd.12
13    ssd   1.31039   1.0  1.3 TiB  634 GiB  631 GiB   10 MiB  2.8 
GiB  708 GiB  47.24  0.81   83  up  osd.13
14    ssd   1.31039   1.0  1.3 TiB  890 GiB  884 GiB   14 MiB  5.6 
GiB  452 GiB  66.29  1.13  105  up  osd.14
    TOTAL   13 TiB  7.7 TiB  7.6 TiB  150 MiB   47 
GiB  5.4 TiB  58.41

MIN/MAX VAR: 0.81/1.13  STDDEV: 5.72

This cluster creates data at a slow rate, maybe around 300GB a year. 
Maybe it's time for a reweight...



Do you have enough assigned pgs?


Autoscaler is enabled and it believes that the pools have right amount 
of PGs:


ceph osd pool autoscale-status
POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY RATIO  
TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM AUTOSCALE
cephVMs01 19 3.0    13414G 
0.  1.0  32 on
cephFS01_metadata  167.2M    3.0    13414G 
0.  4.0  32 on
cephFS01_data  0 3.0    13414G 
0.  1.0  32 on
cephDATA01 742.0G    3.0    13414G 
0.1659  1.0  64 on
cephMYSQL01    357.7G    3.0    13414G 
0.0800  1.0  32 on
device_health_metrics  249.8k    3.0    13414G 
0.  1.0   1 on

cephVMs01_comp  1790G    3.0    13414G 0.4004




Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. Jan 27., at 23:30, Victor Rodriguez 
 wrote:


Email received from the internet. If in doubt, don't click any link 
nor open any attachment !



Ah yes, checked that too. Monitors and OSD's report with ceph config
show-with-defaults that bluefs_buffered_io is set to true as default
setting (it isn't overriden somewere).


On 1/27/23 17:15, Wesley Dillingham wrote:

I hit this issue once on a nautilus cluster and changed the OSD
parameter bluefs_buffered_io = true (was set at false). I believe the
default of this parameter was switched from false to true in release
14.2.20, however, perhaps you could still check what your osds are
configured with in regard to this config item.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez
 wrote:

   Hello,

   Asking for help with an issue. Maybe someone has a clue about what's
   going on.

   Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
   removed
   it. A bit later, nearly half of the PGs of the pool entered
   snaptrim and
   snaptrim_wait state, as expected. The problem is that such operations
   ran extremely slow and client I/O was nearly nothing, so all VMs
   in the
   cluster got stuck as they could not I/O to the storage. Taking and
   removing big snapshots is a normal operation that we do often and
   this
   is the first time I see this issue 

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Josh Baergen
This might be due to tombstone accumulation in rocksdb. You can try to
issue a compact to all of your OSDs and see if that helps (ceph tell
osd.XXX compact). I usually prefer to do this one host at a time just
in case it causes issues, though on a reasonably fast RBD cluster you
can often get away with compacting everything at once.

Josh

On Fri, Jan 27, 2023 at 6:52 AM Victor Rodriguez
 wrote:
>
> Hello,
>
> Asking for help with an issue. Maybe someone has a clue about what's
> going on.
>
> Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed
> it. A bit later, nearly half of the PGs of the pool entered snaptrim and
> snaptrim_wait state, as expected. The problem is that such operations
> ran extremely slow and client I/O was nearly nothing, so all VMs in the
> cluster got stuck as they could not I/O to the storage. Taking and
> removing big snapshots is a normal operation that we do often and this
> is the first time I see this issue in any of my clusters.
>
> Disks are all Samsung PM1733 and network is 25G. It gives us plenty of
> performance for the use case and never had an issue with the hardware.
>
> Both disk I/O and network I/O was very low. Still, client I/O seemed to
> get queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stops
> any active snaptrim operation and client I/O resumes back to normal.
> Enabling snaptrim again makes client I/O to almost halt again.
>
> I've been playing with some settings:
>
> ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
> ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'
>
> None really seemed to help. Also tried restarting OSD services.
>
> This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Is
> there any setting that must be changed which may cause this problem?
>
> I have scheduled a maintenance window, what should I look for to
> diagnose this problem?
>
> Any help is very appreciated. Thanks in advance.
>
> Victor
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Szabo, Istvan (Agoda)
How is your pg distribution on your osd devices? Do you have enough assigned 
pgs?

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. Jan 27., at 23:30, Victor Rodriguez  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Ah yes, checked that too. Monitors and OSD's report with ceph config
show-with-defaults that bluefs_buffered_io is set to true as default
setting (it isn't overriden somewere).


On 1/27/23 17:15, Wesley Dillingham wrote:
I hit this issue once on a nautilus cluster and changed the OSD
parameter bluefs_buffered_io = true (was set at false). I believe the
default of this parameter was switched from false to true in release
14.2.20, however, perhaps you could still check what your osds are
configured with in regard to this config item.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez
 wrote:

   Hello,

   Asking for help with an issue. Maybe someone has a clue about what's
   going on.

   Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
   removed
   it. A bit later, nearly half of the PGs of the pool entered
   snaptrim and
   snaptrim_wait state, as expected. The problem is that such operations
   ran extremely slow and client I/O was nearly nothing, so all VMs
   in the
   cluster got stuck as they could not I/O to the storage. Taking and
   removing big snapshots is a normal operation that we do often and
   this
   is the first time I see this issue in any of my clusters.

   Disks are all Samsung PM1733 and network is 25G. It gives us
   plenty of
   performance for the use case and never had an issue with the hardware.

   Both disk I/O and network I/O was very low. Still, client I/O
   seemed to
   get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
   stops
   any active snaptrim operation and client I/O resumes back to normal.
   Enabling snaptrim again makes client I/O to almost halt again.

   I've been playing with some settings:

   ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
   ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
   ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
   ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'

   None really seemed to help. Also tried restarting OSD services.

   This cluster was upgraded from 14.2.x to 15.2.17 a couple of
   months. Is
   there any setting that must be changed which may cause this problem?

   I have scheduled a maintenance window, what should I look for to
   diagnose this problem?

   Any help is very appreciated. Thanks in advance.

   Victor


   ___
   ceph-users mailing list -- ceph-users@ceph.io
   To unsubscribe send an email to ceph-users-le...@ceph.io

--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Victor Rodriguez
Ah yes, checked that too. Monitors and OSD's report with ceph config 
show-with-defaults that bluefs_buffered_io is set to true as default 
setting (it isn't overriden somewere).



On 1/27/23 17:15, Wesley Dillingham wrote:
I hit this issue once on a nautilus cluster and changed the OSD 
parameter bluefs_buffered_io = true (was set at false). I believe the 
default of this parameter was switched from false to true in release 
14.2.20, however, perhaps you could still check what your osds are 
configured with in regard to this config item.


Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez 
 wrote:


Hello,

Asking for help with an issue. Maybe someone has a clue about what's
going on.

Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
removed
it. A bit later, nearly half of the PGs of the pool entered
snaptrim and
snaptrim_wait state, as expected. The problem is that such operations
ran extremely slow and client I/O was nearly nothing, so all VMs
in the
cluster got stuck as they could not I/O to the storage. Taking and
removing big snapshots is a normal operation that we do often and
this
is the first time I see this issue in any of my clusters.

Disks are all Samsung PM1733 and network is 25G. It gives us
plenty of
performance for the use case and never had an issue with the hardware.

Both disk I/O and network I/O was very low. Still, client I/O
seemed to
get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
stops
any active snaptrim operation and client I/O resumes back to normal.
Enabling snaptrim again makes client I/O to almost halt again.

I've been playing with some settings:

ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'

None really seemed to help. Also tried restarting OSD services.

This cluster was upgraded from 14.2.x to 15.2.17 a couple of
months. Is
there any setting that must be changed which may cause this problem?

I have scheduled a maintenance window, what should I look for to
diagnose this problem?

Any help is very appreciated. Thanks in advance.

Victor


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Wesley Dillingham
I hit this issue once on a nautilus cluster and changed the OSD
parameter bluefs_buffered_io
= true (was set at false). I believe the default of this parameter was
switched from false to true in release 14.2.20, however, perhaps you could
still check what your osds are configured with in regard to this config
item.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez 
wrote:

> Hello,
>
> Asking for help with an issue. Maybe someone has a clue about what's
> going on.
>
> Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed
> it. A bit later, nearly half of the PGs of the pool entered snaptrim and
> snaptrim_wait state, as expected. The problem is that such operations
> ran extremely slow and client I/O was nearly nothing, so all VMs in the
> cluster got stuck as they could not I/O to the storage. Taking and
> removing big snapshots is a normal operation that we do often and this
> is the first time I see this issue in any of my clusters.
>
> Disks are all Samsung PM1733 and network is 25G. It gives us plenty of
> performance for the use case and never had an issue with the hardware.
>
> Both disk I/O and network I/O was very low. Still, client I/O seemed to
> get queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stops
> any active snaptrim operation and client I/O resumes back to normal.
> Enabling snaptrim again makes client I/O to almost halt again.
>
> I've been playing with some settings:
>
> ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
> ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'
>
> None really seemed to help. Also tried restarting OSD services.
>
> This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Is
> there any setting that must be changed which may cause this problem?
>
> I have scheduled a maintenance window, what should I look for to
> diagnose this problem?
>
> Any help is very appreciated. Thanks in advance.
>
> Victor
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io