Hi,

On 11/10/21 16:14, Christoph Adomeit wrote:
But the cluster seemed to slowly "eat" storage space. So yesterday I decided to 
add 3 more NVMEs, 1 for each node. In the second i added the first nvme as ceph osd the 
cluster was crashing. I had high loads on all osds and all the osds where dying again and 
again until i set nodown,noout,noscrub,nodeep-scrub and rtemoved the new osd. The the 
cluster recovered but had slow io and lots of snaptriHm and snaptrim wait processes.

You may have hit this issue https://tracker.ceph.com/issues/52026. AFAIU there could be some untrimmed snapshots (visible in snaptrimq_len with `ceph pg dump pgs`) which are only trimmed once the pg is repeered. We experience that during testing, but the root cause is not fully understood (at least to me).

Maybe once you added your new OSDs made the snaptrim state appeared on various PGs which affected your cluster apparently.

I made this smoother by setting --osd_snap_trim_sleep=3.0

Over night the snaptrim_wait pgs became 0 and i had 15% mor free space in the 
ceph cluster. But during the day the snaptrim_waits increased and increased.

I then set osd_snap_trim_sleep to 0.0 again and most vms had extremely high 
iowaits ore crashed.

Now I did a ceph osd set nosnaptrim and the cluster is flying again. Iowait 0 
on all vms but count
of snaptrim wait is slowly increasing.

How can I get the snaptrims running fast and not affect ceph io performance ?
My theory is until yesterday for some reasons the snaptrims were not running for some 
reason and therefore the cluster was "eating" storage space. After crash 
yesterday and restarting the snaptrims the started.

On our test cluster we actually decreased `osd_snap_trim_sleep` to 0.1s instead of the default 2s for hybrid OSD because the snaptrim we had would have lasted a few weeks without it IIRC. We didn't notice any slowdowns, HDD crashing or anything like that (but this cluster doesn't have any real production workloads, so we may have overlooked some aspects).

In your case the default value should be set to `osd_snap_trim_sleep_ssd` which is 0, so maybe with some SSD/NVME OSD the snaptrim do affect performance (with the default settings at least)... Therefore, you may want to set `osd_snap_trim_sleep` to something different than 0. The 0.1s sleep worked smoothly in our tests, but this was needed because I was stress testing snapshots and there was many many objects that needed this snaptrim process. You could probably increase this value for safety reasons, any value between 0.1s and 3s (that you already tested!) is probably fine!

Cheers,

--
Arthur Outhenin-Chalandre
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to