[ceph-users] Re: Performance impact of Heterogeneous environment
On 17.01.24 11:13, Tino Todino wrote: > Hi folks. > > I had a quick search but found nothing concrete on this so thought I would > ask. > > We currently have a 4 host CEPH cluster with an NVMe pool (1 OSD per host) > and an HDD Pool (1 OSD per host). Both OSD's use a separate NVMe for DB/WAL. > These machines are identical (Homogenous) and are Ryzen 7 5800X machines with > 64GB DDR3200 RAM. The NVMe's are 1TB Seagate Ironwolfs and the HDD's are > 16TB Seagate IronWolfs. > > We are wanting to add more nodes mainly for capacity and resilience reasons. > We have an old 3 node cluster of Dell R740 servers that could be added to > this CEPH cluster. Instead of DDR4, they use DDR3 (although 1.5TB each!!). > and instead of Ryzen 7 5800X CPUs they use old Intel Xeon CPU E5-4657L v2 > (96 cores at 2.4Ghz). > > What would be the performance impact of adding these three nodes with the > same OSD layout (i.e 1NVMe OSD and 1 HDD OSD per host with 1x NVMe DB/WAL > NVMe) > Would we get overall better performance or worse? Can weighting be used to > mitigate performance penalties and if so is this easy to configure? What will happen is that Ceph will distribute PGs across your cluster uniformly by default, so for some requests PGs from the Ryzen nodes will answer, for others the Xeons. Presumably the Xeons will be slower based on slower clock speed. The net effect will be jitter in completion latencies -- it depends on your workloads if they're fine with it or not. Note the latencies are sensitive to clock speeds; the huge amount or RAM on the Xeons won't help much for OSDs, and as your cluster is small MONs/MGRs won't need that much either. Check out https://docs.ceph.com/en/reef/start/hardware-recommendations/ and see the section on memory for tuning recommendations. You can influence distribution by weighting but this will only influence the rate of jitter not the magnitude (I mean if you weight down the Xeons to 0 the jitter will cease but that's not very useful :-)). One thing that I've heard people do but haven't done personally with fast NVMes (not familiar with the IronWolf so not sure if they qualify) is partition them up so that they run more than one OSD (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth. See https://ceph.com/community/bluestore-default-vs-tuned-performance-comparison/ > > On performance, I would deem it Ok for our use case currently (VM disks), as > we are running on 10Gbe network (with dedicated NICs for public and cluster > network). > > Many thanks in advance > > Tino > This E-mail is intended solely for the person or organisation to which it is > addressed. It may contain privileged or confidential information and, if you > are not the intended recipient, you must not copy, distribute or take any > action in reliance upon it. Any views or opinions presented are solely those > of the author and do not necessarily represent those of Marlan Maritime > Technologies Ltd. If you have received this E-mail in error, please notify us > as soon as possible and delete it from your computer. Marlan Maritime > Technologies Ltd Registered in England & Wales 323 Mariners House, Norfolk > Street, Liverpool. L1 0BG Company No. 08492427. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance impact of Heterogeneous environment
Conventional wisdom is that with recent Ceph releases there is no longer a clear advantage to this. > On Jan 17, 2024, at 11:56, Peter Sabaini wrote: > > One thing that I've heard people do but haven't done personally with fast > NVMes (not familiar with the IronWolf so not sure if they qualify) is > partition them up so that they run more than one OSD (say 2 to 4) on a single > NVMe to better utilize the NVMe bandwidth. See > https://ceph.com/community/bluestore-default-vs-tuned-performance-comparison/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance impact of Heterogeneous environment
It's a little tricky. In the upstream lab we don't strictly see an IOPS or average latency advantage with heavy parallelism by running muliple OSDs per NVMe drive until per-OSD core counts get very high. There does seem to be a fairly consistent tail latency advantage even at moderately low core counts however. Results are here: https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ Specifically for jitter, there is probably an advantage to using 2 cores per OSD unless you are very CPU starved, but how much that actually helps in practice for a typical production workload is questionable imho. You do pay some overhead for running 2 OSDs per NVMe as well. Mark On 1/17/24 12:24, Anthony D'Atri wrote: Conventional wisdom is that with recent Ceph releases there is no longer a clear advantage to this. On Jan 17, 2024, at 11:56, Peter Sabaini wrote: One thing that I've heard people do but haven't done personally with fast NVMes (not familiar with the IronWolf so not sure if they qualify) is partition them up so that they run more than one OSD (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth. See https://ceph.com/community/bluestore-default-vs-tuned-performance-comparison/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Best Regards, Mark Nelson Head of Research and Development Clyso GmbH p: +49 89 21552391 12 | a: Minnesota, USA w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance impact of Heterogeneous environment
Very informative article you did Mark. IMHO if you find yourself with very high per-OSD core count, it may be logical to just pack/add more nvmes per host, you'd be getting the best price per performance and capacity. /Maged On 17/01/2024 22:00, Mark Nelson wrote: It's a little tricky. In the upstream lab we don't strictly see an IOPS or average latency advantage with heavy parallelism by running muliple OSDs per NVMe drive until per-OSD core counts get very high. There does seem to be a fairly consistent tail latency advantage even at moderately low core counts however. Results are here: https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ Specifically for jitter, there is probably an advantage to using 2 cores per OSD unless you are very CPU starved, but how much that actually helps in practice for a typical production workload is questionable imho. You do pay some overhead for running 2 OSDs per NVMe as well. Mark On 1/17/24 12:24, Anthony D'Atri wrote: Conventional wisdom is that with recent Ceph releases there is no longer a clear advantage to this. On Jan 17, 2024, at 11:56, Peter Sabaini wrote: One thing that I've heard people do but haven't done personally with fast NVMes (not familiar with the IronWolf so not sure if they qualify) is partition them up so that they run more than one OSD (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth. See https://ceph.com/community/bluestore-default-vs-tuned-performance-comparison/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance impact of Heterogeneous environment
+1 to this, great article and great research. Something we've been keeping a very close eye on ourselves. Overall we've mostly settled on the old keep it simple stupid methodology with good results. Especially as the benefits have gotten less beneficial the more recent your ceph version, and have been rocking with single OSD/NVMe, but as always everything is workload dependant and there is sometimes a need for doubling up 😊 Regards, Bailey > -Original Message- > From: Maged Mokhtar > Sent: January 17, 2024 4:59 PM > To: Mark Nelson ; ceph-users@ceph.io > Subject: [ceph-users] Re: Performance impact of Heterogeneous > environment > > Very informative article you did Mark. > > IMHO if you find yourself with very high per-OSD core count, it may be logical > to just pack/add more nvmes per host, you'd be getting the best price per > performance and capacity. > > /Maged > > > On 17/01/2024 22:00, Mark Nelson wrote: > > It's a little tricky. In the upstream lab we don't strictly see an > > IOPS or average latency advantage with heavy parallelism by running > > muliple OSDs per NVMe drive until per-OSD core counts get very high. > > There does seem to be a fairly consistent tail latency advantage even > > at moderately low core counts however. Results are here: > > > > https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ > > > > Specifically for jitter, there is probably an advantage to using 2 > > cores per OSD unless you are very CPU starved, but how much that > > actually helps in practice for a typical production workload is > > questionable imho. You do pay some overhead for running 2 OSDs per > > NVMe as well. > > > > > > Mark > > > > > > On 1/17/24 12:24, Anthony D'Atri wrote: > >> Conventional wisdom is that with recent Ceph releases there is no > >> longer a clear advantage to this. > >> > >>> On Jan 17, 2024, at 11:56, Peter Sabaini wrote: > >>> > >>> One thing that I've heard people do but haven't done personally with > >>> fast NVMes (not familiar with the IronWolf so not sure if they > >>> qualify) is partition them up so that they run more than one OSD > >>> (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth. > >>> See > >>> https://ceph.com/community/bluestore-default-vs-tuned- > performance-co > >>> mparison/ > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > >> email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email > to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance impact of Heterogeneous environment
Thanks kindly Maged/Bailey! As always it's a bit of a moving target. New hardware comes out that reveals bottlenecks in our code. Doubling up the OSDs sometimes improves things. We figure out how to make the OSDs faster and the old assumptions stop being correct. Even newer hardware comes out, etc etc. Mark On 1/17/24 17:36, Bailey Allison wrote: +1 to this, great article and great research. Something we've been keeping a very close eye on ourselves. Overall we've mostly settled on the old keep it simple stupid methodology with good results. Especially as the benefits have gotten less beneficial the more recent your ceph version, and have been rocking with single OSD/NVMe, but as always everything is workload dependant and there is sometimes a need for doubling up 😊 Regards, Bailey -Original Message- From: Maged Mokhtar Sent: January 17, 2024 4:59 PM To: Mark Nelson ; ceph-users@ceph.io Subject: [ceph-users] Re: Performance impact of Heterogeneous environment Very informative article you did Mark. IMHO if you find yourself with very high per-OSD core count, it may be logical to just pack/add more nvmes per host, you'd be getting the best price per performance and capacity. /Maged On 17/01/2024 22:00, Mark Nelson wrote: It's a little tricky. In the upstream lab we don't strictly see an IOPS or average latency advantage with heavy parallelism by running muliple OSDs per NVMe drive until per-OSD core counts get very high. There does seem to be a fairly consistent tail latency advantage even at moderately low core counts however. Results are here: https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ Specifically for jitter, there is probably an advantage to using 2 cores per OSD unless you are very CPU starved, but how much that actually helps in practice for a typical production workload is questionable imho. You do pay some overhead for running 2 OSDs per NVMe as well. Mark On 1/17/24 12:24, Anthony D'Atri wrote: Conventional wisdom is that with recent Ceph releases there is no longer a clear advantage to this. On Jan 17, 2024, at 11:56, Peter Sabaini wrote: One thing that I've heard people do but haven't done personally with fast NVMes (not familiar with the IronWolf so not sure if they qualify) is partition them up so that they run more than one OSD (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth. See https://ceph.com/community/bluestore-default-vs-tuned- performance-co mparison/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance impact of Heterogeneous environment
For multi- vs. single-OSD per flash drive decision the following test might be useful: We found dramatic improvements using multiple OSDs per flash drive with octopus *if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one and this thread is effectively sequentializing otherwise async IO if saturated. There was a dev discussion about having more kv_sync_threads per OSD daemon by splitting up rocks-dbs for PGs, but I don't know if this ever materialized. My guess is that for good NVMe drives it is possible that a single kv_sync_thread can saturate the device and there will be no advantage of having more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments usually are better, because the on-disk controller requires concurrency to saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with iodepth=1. With good NVME drives I have seen fio-tests with direct IO saturate the drive with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that and I could imagine that here 1 OSD/drive is sufficient. For such drives, storage access quickly becomes CPU bound, so some benchmarking taking all system properties into account is required. If you are already CPU bound (too many NVMe drives per core, many standard servers with 24+ NVMe drives have that property) there is no point adding extra CPU load with more OSD daemons. Don't just look at single disks, look at the whole system. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Bailey Allison Sent: Thursday, January 18, 2024 12:36 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Performance impact of Heterogeneous environment +1 to this, great article and great research. Something we've been keeping a very close eye on ourselves. Overall we've mostly settled on the old keep it simple stupid methodology with good results. Especially as the benefits have gotten less beneficial the more recent your ceph version, and have been rocking with single OSD/NVMe, but as always everything is workload dependant and there is sometimes a need for doubling up 😊 Regards, Bailey > -Original Message- > From: Maged Mokhtar > Sent: January 17, 2024 4:59 PM > To: Mark Nelson ; ceph-users@ceph.io > Subject: [ceph-users] Re: Performance impact of Heterogeneous > environment > > Very informative article you did Mark. > > IMHO if you find yourself with very high per-OSD core count, it may be logical > to just pack/add more nvmes per host, you'd be getting the best price per > performance and capacity. > > /Maged > > > On 17/01/2024 22:00, Mark Nelson wrote: > > It's a little tricky. In the upstream lab we don't strictly see an > > IOPS or average latency advantage with heavy parallelism by running > > muliple OSDs per NVMe drive until per-OSD core counts get very high. > > There does seem to be a fairly consistent tail latency advantage even > > at moderately low core counts however. Results are here: > > > > https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ > > > > Specifically for jitter, there is probably an advantage to using 2 > > cores per OSD unless you are very CPU starved, but how much that > > actually helps in practice for a typical production workload is > > questionable imho. You do pay some overhead for running 2 OSDs per > > NVMe as well. > > > > > > Mark > > > > > > On 1/17/24 12:24, Anthony D'Atri wrote: > >> Conventional wisdom is that with recent Ceph releases there is no > >> longer a clear advantage to this. > >> > >>> On Jan 17, 2024, at 11:56, Peter Sabaini wrote: > >>> > >>> One thing that I've heard people do but haven't done personally with > >>> fast NVMes (not familiar with the IronWolf so not sure if they > >>> qualify) is partition them up so that they run more than one OSD > >>> (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth. > >>> See > >>> https://ceph.com/community/bluestore-default-vs-tuned- > performance-co > >>> mparison/ > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > >> email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email > to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance impact of Heterogeneous environment
On 1/18/24 03:40, Frank Schilder wrote: For multi- vs. single-OSD per flash drive decision the following test might be useful: We found dramatic improvements using multiple OSDs per flash drive with octopus *if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one and this thread is effectively sequentializing otherwise async IO if saturated. There was a dev discussion about having more kv_sync_threads per OSD daemon by splitting up rocks-dbs for PGs, but I don't know if this ever materialized. I advocated for it at one point, but Sage was pretty concerned about how much we'd be disturbing Bluestore's write path. Instead, Adam ended up implementing the column family sharding inside RocksDB which got us some (but not all) of the benefit. A lot of the work that has gone into refactoring the RocksDB settings in Reef has been to help mitigate some of the overhead in the kv sync thread. The gist of it is that we are trying to balance keeping the memtables large enough to avoid letting short lived items like pg log entries from regularly leaking into the DB, while simultaneously keeping the memtables as small as possible to reduce the number of comparisons RocksDB needs to do to keep them in sorted order during inserts (which is a significant overhead in the kv sync thread during heavy small random writes). This is also part of the reason that Igor was experimenting with implementing a native bluestore WAL rather than relying on the one in RocksDB. My guess is that for good NVMe drives it is possible that a single kv_sync_thread can saturate the device and there will be no advantage of having more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments usually are better, because the on-disk controller requires concurrency to saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with iodepth=1. Oddly enough, we do see some effect on small random reads. The way that the async msgr / shards / threads interact doesn't scale well past about 14-16 cpu threads (and increasing the shard/thread counts has complicated effects, it may not always help). If you look at that 1 vs 2 NVMe article on the ceph.io page, you'll see that once you hit the 14 CPU threads the 2 OSD/NVMe configuration keeps scaling but the single OSD configuration tops out. With good NVME drives I have seen fio-tests with direct IO saturate the drive with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that and I could imagine that here 1 OSD/drive is sufficient. For such drives, storage access quickly becomes CPU bound, so some benchmarking taking all system properties into account is required. If you are already CPU bound (too many NVMe drives per core, many standard servers with 24+ NVMe drives have that property) there is no point adding extra CPU load with more OSD daemons. Don't just look at single disks, look at the whole system. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Bailey Allison Sent: Thursday, January 18, 2024 12:36 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Performance impact of Heterogeneous environment +1 to this, great article and great research. Something we've been keeping a very close eye on ourselves. Overall we've mostly settled on the old keep it simple stupid methodology with good results. Especially as the benefits have gotten less beneficial the more recent your ceph version, and have been rocking with single OSD/NVMe, but as always everything is workload dependant and there is sometimes a need for doubling up 😊 Regards, Bailey -Original Message- From: Maged Mokhtar Sent: January 17, 2024 4:59 PM To: Mark Nelson ; ceph-users@ceph.io Subject: [ceph-users] Re: Performance impact of Heterogeneous environment Very informative article you did Mark. IMHO if you find yourself with very high per-OSD core count, it may be logical to just pack/add more nvmes per host, you'd be getting the best price per performance and capacity. /Maged On 17/01/2024 22:00, Mark Nelson wrote: It's a little tricky. In the upstream lab we don't strictly see an IOPS or average latency advantage with heavy parallelism by running muliple OSDs per NVMe drive until per-OSD core counts get very high. There does seem to be a fairly consistent tail latency advantage even at moderately low core counts however. Results are here: https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ Specifically for jitter, there is probably an advantage to using 2 cores per OSD unless you are very CPU starved, but how much that actually helps in practice for a typical production workload is questionable imho. You do pay some overhead for running 2 OSDs per NVMe as well. Mark On 1/17/24 12:24, Anthony D'Atri wrote: Conventional wisdom is that with recen