[ceph-users] Re: Ceph system disk on non-raid drive

Anthony D'Atri via ceph-users Wed, 24 Dec 2025 07:21:30 -0800

> Just my 2 cents, please wait for a second opinion. I wouldn't want to give 
> incorrect/incomplete advice.


You must be new to the interwebs ;)

> It's best practice to present Ceph the raw block devices on which you're 
> going to deploy your OSDs.

I think the OP was asking about the boot / OS volume, not OSDs

> No (active) RAID controller in between, just the raw device. You can use a 
> RAID controller

See my history of posts about those things.  Nothing but trouble and expense.  
Sometimes a pure-NVMe system without any HBA costs *less* than one with a RAID 
HBA.  Almost nobody monitors RAID HBAs properly, or keeps the firmware updated.


> but configure it so that the OS sees the RAW device, not a block device the 
> raid controller itself presents to the OS. I think i't called IT-mode or 
> hbamode.

Integrated Target HBAs are all passthrough all the time.  IR Integrated RAID 
are capable of virtual drives.  Sometimes one sees people flashing IT firmware 
on IR HBAs for better performance, though this practice is fraught.  These days 
IR HBAs can usually pass-through non-VD drives so the OS sees them directly, 
which makes for improved observability and the ability to do wacky things like 
update firmware on them.  In the past one would have to wrap every drive in 
degenerate VD just for the OS to see it.

> Please someone correct me if I'm wrong but AFAIK, it doesn't really matter on 
> which block devices you're running your OS, including where the Ceph binaries 
> are located and/or /var/lib/ceph. Just make sure it's reliable storage and 
> sufficiently fast like you would spec your regular servers.

Sharing boot with data is in general not a great strategy. Boot drives are not 
designed for an OSD duty cycle, and I’ve seen them cause Ceph daemon impact 
when too much other crap is presented to them.

Separate boot drives also make it a lot easier to reimage the OS without losing 
OSDs.

> Also make sure to choose suitable block devices for OSDs! HDDs are definitely 
> possible and used in Ceph clusters but result in a relatively slow cluster 
> unless you've got a massive amount of them.

We lose money with every sale but make up for it in volume!


> Eg, if you want to use it for archival purposes, and performance doesn't 
> really matter, it might work.

At scale, including things like DC racks, coarse-IU QLC can actually cost less 
than HDDs, and less risky because it doesn’t take a month or more to recover 
from a failure.


> For SSDs, definitely go enterprise class with PLP! Don't bargain on PLP and 
> don't go for consumer class SSDs. Its a big big no-no you definitely want to 
> avoid!

Absolutely.  I know of a cluster with the CephFS metadata pool on Sabrents.  
Ticking bomb.

> 
> And in case you skipped the paragraph above: Don't compromise on PLP and go 
> for Enterprise class SSDs, pretty please 😉🙂.
> 
> And while thinking about SSDs, our Ceph support partner once told me there 
> are certain SSDs that are known to cause problems. I don't know the specifics 
> of it, but once you're speccing your cluster, I think it's a good idea to 
> post your proposal here or in the Slack channel for review. Who knows, 
> someone might possibly chime in with an actual experience with your hardware.

We have to be careful when discussing specific companies, but most SSD failures 
are firmware and can be addressed in-situ with an upgrade.  I’ve been through 
this myself.

> 
> Tentacle has just been released, while it's a stable release you might not 
> want to go Tentacle just yet until there's at least one point release.
> 
> There's Squid, but you might want to be careful and disable 
> bluestore_elastic_shared_blobs before you start deploying OSDs especially if 
> you're looking to use Erasure Coded pools rather than replicated pools.

… or deploy on the latest Reef then upgrade.


> 
> I discussed about some practices for ceph node hardware configuration and
> specifically about system disk on non-RAID drive (e.g. internal m2).

It’s a strategy.

> 
> Failure domain is host or rack, nvme cluster.
> 
> Is it OK or is it one of possible and "adopted" way to deploy ceph nodes
> with system disk without RAID1 (hw or sw)? Do you have some experience with
> such nodes?

Sure, if you like. I would shy away from doing that on smaller clusters, where 
the loss of a whole failure domain means, say, 5% or more of cluster capacity.  
If you have 250 nodes in your cluster, then losing one isn’t nearly so big a 
deal.  HBA RAID is IMHO mostly an artifact of the days of a graphical desktop 
OS mistaken for a server OS. I would much rather use MD; ymmv.







_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph system disk on non-raid drive

Reply via email to