[ceph-users] Re: Rocksdb compaction and OSD timeout

J-P Methot Thu, 07 Sep 2023 10:29:44 -0700

Hi,

By this point, we're 95% sure that, contrary to our previous beliefs,it's an issue with changes to the bluestore_allocator and not thecompaction process. That said, I will keep this email in mind as we willwant to test optimizations to compaction on our test environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,
There are two things that might help you here. One is to try the new"rocksdb_cf_compaction_on_deletion" feature that I added in Reef andwe backported to Pacific in 16.2.13. So far this appears to be a hugewin for avoiding tombstone accumulation during iteration which isoften the issue with threadpool timeouts due to rocksdb. Manualcompaction can help, but if you are hitting a case where there'sconcurrent iteration and deletions with no writes, tombstones willaccumulate quickly with no compactions taking place and you'lleventually end up back in the same place. The default sliding windowand trigger settings are fairly conservative to avoid excessivecompaction, so it may require some tuning to hit the right sweet spoton your cluster. I know of at least one site that's using this featurewith more aggressive settings than default and had an extremelypositive impact on their cluster.
The other thing that can help improve compaction performance ingeneral is enabling lz4 compression in RocksDB. I plan to make thisthe default behavior in Squid assuming we don't run into any issues intesting. There are several sites that are using this now inproduction and the benefits have been dramatic relative to the costs. We're seeing significantly faster compactions and about 2.2x lowerspace requirement for the DB (RGW workload). There may be a slight CPUcost and read/index listing performance impact, but even with testingon NVMe clusters this was quite low (maybe a couple of percent).
Mark


On 9/7/23 10:21, J-P Methot wrote:
Hi,
Since my post, we've been speaking with a member of the Ceph devteam. He did, at first, believe it was an issue linked to the commonperformance degradation after huge deletes operation. So we did dooffline compactions on all our OSDs. It fixed nothing and we aregoing through the logs to try and figure this out.
To answer your question, no the OSD doesn't restart after it logs thetimeout. It manages to get back online by itself, at the cost ofsluggish performances for the cluster and high iowait on VMs.
We mostly run RBD workloads.
Deep scrubs or no deep scrubs doesn't appear to change anything.Deactivating scrubs altogether did not impact performances in any way.
Furthermore, I'll stress that this is only happening since weupgraded to the latest Pacific, yesterday.
On 9/7/23 10:49, Stefan Kooman wrote:
On 07-09-2023 09:05, J-P Methot wrote:
Hi,
We're running latest Pacific on our production cluster and we'vebeen seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' hadtimed out after 15.000000954s' error. We have reasons to believethis happens each time the RocksDB compaction process is launchedon an OSD. My question is, does the cluster detecting that an OSDhas timed out interrupt the compaction process? This seems to bewhat's happening, but it's not immediately obvious. We arecurrently facing an infinite loop of random OSDs timing out and ifthe compaction process is interrupted without finishing, it mayexplain that.
Does the OSD also restart after it logged the timeouts?
You might want to perform an offline compaction every $timeperiod tofix any potential RocksDB degradation. That's what we do. What kindof workload do you run (i.e. RBD, CephFS, RGW)?
Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Rocksdb compaction and OSD timeout

Reply via email to