[ceph-users] Re: Recent ceph.io Performance Blog Posts

Mark Nelson Wed, 09 Nov 2022 05:11:43 -0800

On 11/9/22 6:03 AM, Eshcar Hillel wrote:

Hi Mark,
Thanks for posting these blogs. They are very interesting to read.
Maybe you have an answer to a question I asked in the dev list:
We run fio benchmark against a 3-node ceph cluster with 96 OSDs.Objects are 4kb. We usegdbpmp profilerhttps://github.com/markhpc/gdbpmp<https://github.com/markhpc/gdbpmp> to analyze the threads' performance.we discovered the bstore_kv_sync thread is always busy, while all 16tp_osd_tp threads are not busy most of the time (wait on a conditionalvariable or a lock).Given that 3 rocksdb CFs are sharded, and sharding is configurable,why not run multiple (3) bstore_kv_sync threads? they won't haveconflicts most of the time.This has the potential of removing the rocksdb bottleneck andincreasing IOPS.
Can you explain this design choice?

You are absolutely correct that the bstore_kv_sync thread can often be abottleneck during 4K random writes. Typically it's not so bad that thetp_osd_tp threads are mostly blocked though (feel free to send me a copyof the trace, I would be interested in seeing it). Years ago Iadvocated for the same approach you are suggesting here. The fear atthe time was that the changes inside bluestore would be too disruptive. The column family sharding approach could be (and was) mostly containedto the KeyValueDB glue code. Column family sharding has been a win fromthe standpoint that it helps us avoid really deep LSM hierarchies inRocksDB. We tend to see faster compaction times and are more likely tokeep full levels on the fast device. Sadly it doesn't really help withimproving metadata throughput and may even introduce a small amount ofoverhead during the WAL flush process. FWIW slow bstore_kv_sync is oneof the reasons that people will some times run multiple OSDs on one NVMedrive (sometimes it's faster, sometimes it's not).

Maybe a year ago I tried to sort of map out the changes that I thoughtwould be necessary to shard across KeyValueDBs inside bluestore itself. It didn't look impossible, but would require quite a bit of work (and abit of finesse to restructure the data path). There's a legitimatequestions of whether or not it's worth it now to make those kinds ofchanges to bluestore or invest in crimson and seastore at this point. We ended up deciding not to pursue the changes back then. I think if wechanged our minds it would most likely go into some kind of experimentalbluestore v2 project (along with other things like hierarchical storage)so we don't screw up the existing code base.


------------------------------------------------------------------------
*From:* Mark Nelson <mnel...@redhat.com>
*Sent:* Tuesday, November 8, 2022 10:20 PM
*To:* ceph-users@ceph.io <ceph-users@ceph.io>
*Subject:* [ceph-users] Recent ceph.io Performance Blog Posts
CAUTION: External Sender

Hi Folks,

I thought I would mention that I've released a couple of performance
articles on the Ceph blog recently that might be of interest to people:

 1.
https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/
    <https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/>
 2.
https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
    <https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/>
 3.
https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/
    <https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/>

The first covers RocksDB tuning. How we arrived at our defaults, an
analysis of some common settings that have been floating around on the
mailing list, and potential new settings that we are considering making
default in the future.

The second covers how to tune QEMU/KVM with librbd to achieve high
single-client performance on a small (30 OSD) NVMe backed cluster. This
article also covers the performance impact when enabling 128bit AES
over-the-wire encryption.

The third covers per-OSD CPU/Core scaling and the kind of IOPS/core and
IOPS/NVMe numbers that are achievable both on a single OSD and on a
larger (60 OSD) NVMe cluster. In this case there are enough clients and
a high enough per-client iodepth to saturate the OSD(s).

I hope these are helpful or at least interesting!

Thanks,
Mark

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recent ceph.io Performance Blog Posts

Reply via email to