For multi- vs. single-OSD per flash drive decision the following test might be 
useful:

We found dramatic improvements using multiple OSDs per flash drive with octopus 
*if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one 
and this thread is effectively sequentializing otherwise async IO if saturated.

There was a dev discussion about having more kv_sync_threads per OSD daemon by 
splitting up rocks-dbs for PGs, but I don't know if this ever materialized.

My guess is that for good NVMe drives it is possible that a single 
kv_sync_thread can saturate the device and there will be no advantage of having 
more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments 
usually are better, because the on-disk controller requires concurrency to 
saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with 
iodepth=1.

With good NVME drives I have seen fio-tests with direct IO saturate the drive 
with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that 
and I could imagine that here 1 OSD/drive is sufficient. For such drives, 
storage access quickly becomes CPU bound, so some benchmarking taking all 
system properties into account is required. If you are already CPU bound (too 
many NVMe drives per core, many standard servers with 24+ NVMe drives have that 
property) there is no point adding extra CPU load with more OSD daemons.

Don't just look at single disks, look at the whole system.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Bailey Allison <balli...@45drives.com>
Sent: Thursday, January 18, 2024 12:36 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Performance impact of Heterogeneous environment

+1 to this, great article and great research. Something we've been keeping a 
very close eye on ourselves.

Overall we've mostly settled on the old keep it simple stupid methodology with 
good results. Especially as the benefits have gotten less beneficial the more 
recent your ceph version, and have been rocking with single OSD/NVMe, but as 
always everything is workload dependant and there is sometimes a need for 
doubling up 😊

Regards,

Bailey


> -----Original Message-----
> From: Maged Mokhtar <mmokh...@petasan.org>
> Sent: January 17, 2024 4:59 PM
> To: Mark Nelson <mark.nel...@clyso.com>; ceph-users@ceph.io
> Subject: [ceph-users] Re: Performance impact of Heterogeneous
> environment
>
> Very informative article you did Mark.
>
> IMHO if you find yourself with very high per-OSD core count, it may be logical
> to just pack/add more nvmes per host, you'd be getting the best price per
> performance and capacity.
>
> /Maged
>
>
> On 17/01/2024 22:00, Mark Nelson wrote:
> > It's a little tricky.  In the upstream lab we don't strictly see an
> > IOPS or average latency advantage with heavy parallelism by running
> > muliple OSDs per NVMe drive until per-OSD core counts get very high.
> > There does seem to be a fairly consistent tail latency advantage even
> > at moderately low core counts however.  Results are here:
> >
> > https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/
> >
> > Specifically for jitter, there is probably an advantage to using 2
> > cores per OSD unless you are very CPU starved, but how much that
> > actually helps in practice for a typical production workload is
> > questionable imho.  You do pay some overhead for running 2 OSDs per
> > NVMe as well.
> >
> >
> > Mark
> >
> >
> > On 1/17/24 12:24, Anthony D'Atri wrote:
> >> Conventional wisdom is that with recent Ceph releases there is no
> >> longer a clear advantage to this.
> >>
> >>> On Jan 17, 2024, at 11:56, Peter Sabaini <pe...@sabaini.at> wrote:
> >>>
> >>> One thing that I've heard people do but haven't done personally with
> >>> fast NVMes (not familiar with the IronWolf so not sure if they
> >>> qualify) is partition them up so that they run more than one OSD
> >>> (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth.
> >>> See
> >>> https://ceph.com/community/bluestore-default-vs-tuned-
> performance-co
> >>> mparison/
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> >> email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email
> to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to