[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

Anthony D'Atri Sun, 21 Apr 2024 10:09:43 -0700

>Do you have any data on the reliability of QLC NVMe drives? 

They were my job for a year, so yes, I do.  The published specs are accurate.  
A QLC drive built from the same NAND as a TLC drive will have more capacity, 
but less endurance.  Depending on the model, you may wish to enable 
`bluestore_use_optimal_io_size_for_min_alloc_size` when creating your OSDs.  
The Intel / Soldigim P5316, for example, has a 64KB IU size, so performance and 
endurance will benefit from aligning OSD `min_alloc_size` to that value.  Note 
that this is baked in at creation, you cannot change it on a given OSD after 
the fact, but you can redeploy the OSD and let it recover.

Other SKUs have 8KB or 16KB IU sizes, some have 4KB which requires no specific 
min_alloc_size.  Note that QLC is a good fit for workloads where writes tend to 
be sequential and reasonably large on average and infrequent.  I know of 
successful QLC RGW clusters that see 0.01 DWPD.  Yes, that decimal point is in 
the correct place.  Millions of 1KB files overwritten once an hour aren't a 
good workload for QLC. Backups, archives, even something like an OpenStack 
Glance pool are good fits.  I'm about to trial QLC as Prometheus LTS as well. 
Read-mostly workloads are good fits, as the read performance is in the ballpark 
of TLC.  Write performance is still going to be way better than any HDD, and 
you aren't stuck with legacy SATA slots.  You also don't have to buy or manage 
a fussy HBA.

> How old is your deep archive cluster, how many NVMes it has, and how many did 
> you
> have to replace?

I don't personally have one at the moment.

Even with TLC, endurance is, dare I say, overrated.  99% of enterprise SSDs 
never burn more than 15% of their rated endurance.  SSDs from at least some 
manufacturers have a timed workload feature in firmware that will estimate 
drive lifetime when presented with a real-world workload -- this is based on 
observed PE cycles.

Pretty much any SSD will report lifetime used or remaining, so TLC, QLC, even 
MLC or SLC you should collect those metrics in your time-series DB and watch 
both for drives nearing EOL and their burn rates.  

> 
> On Sun, Apr 21, 2024 at 11:06 PM Anthony D'Atri <anthony.da...@gmail.com> 
> wrote:
>> 
>> A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB 
>> in size, 32 of those in one RU makes for a cluster that doesn’t take up the 
>> whole DC.
>> 
>>> On Apr 21, 2024, at 5:42 AM, Darren Soothill <darren.sooth...@croit.io> 
>>> wrote:
>>> 
>>> Hi Niklaus,
>>> 
>>> Lots of questions here but let me tray and get through some of them.
>>> 
>>> Personally unless a cluster is for deep archive then I would never suggest 
>>> configuring or deploying a cluster without Rocks DB and WAL on NVME.
>>> There are a number of benefits to this in terms of performance and 
>>> recovery. Small writes go to the NVME first before being written to the HDD 
>>> and it makes many recovery operations far more efficient.
>>> 
>>> As to how much faster it makes things that very much depends on the type of 
>>> workload you have on the system. Lots of small writes will make a 
>>> significant difference. Very large writes not as much of a difference.
>>> Things like compactions of the RocksDB database are a lot faster as they 
>>> are now running from NVME and not from the HDD.
>>> 
>>> We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This 
>>> is assuming the NVME’s being used are good mixed use enterprise NVME’s with 
>>> power loss protection.
>>> 
>>> As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but 
>>> this is no worse than a failure of an entire node. This is something Ceph 
>>> is designed to handle.
>>> 
>>> I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
>>> that will degrade the performance of them when you are trying to get better 
>>> performance.
>>> 
>>> 
>>> 
>>> Darren Soothill
>>> 
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> -- 
> Alexander E. Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

Reply via email to