[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-17 Thread Anthony D'Atri



> Also in our favour is that the users of the cluster we are currently 
> intending for this have established a practice of storing large objects.

That definitely is in your favor.

> but it remains to be seen how 60x 22TB behaves in practice.

Be sure you don't get SMR drives.

>  and it's hard for it to rebalance.


^ This.

> What is OLC?

QLC SSDs store 33% more data than TLC, 4 voltage levels per cell vs 3.

> Fascinating to hear about destroy-redeploy being safer than a simple 
> restart-recover!

This was Luminous, that dynamic may be different now, esp. with Nautilus async 
recovery.  

> Agreed. I guess I wanted to add the data point that these kinds of clusters 
> can and do make full sense in certain contexts, and push a little away from 
> "friends don't let friends use HDDs" dogma.

Understood.  Some deployments aren't squeezed for DC space -- today.  But since 
many HDD deployments are using LFF chassis, the form factor and interface 
limitations down the road still complicate expansion and SSD utilization.

> For now, we limit individual cloud volumes to 300 IOPs, doubled for those who 
> need it.

I'm curious how many clients / volumes you have vs. number of HDD OSDs and if 
you're using replication or EC.  If you have relatively few clients per HDD 
that would definitely improve the dynamic.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-17 Thread Gregory Orange

On 16/1/24 11:39, Anthony D'Atri wrote:
by “RBD for cloud”, do you mean VM / container general-purposes volumes 
on which a filesystem is usually built?  Or large archive / backup 
volumes that are read and written sequentially without much concern for 
latency or throughput?


General purpose volumes for cloud instance filesystems. Performance is 
not high, but requirements are a moving target, and it performs better 
than it used to, so decision makers and users are satisfied. If more 
targeted requirements start to arise, of course architecture and costs 
will change.



How many of those ultra-dense chassis in a cluster?  Are all 60 off a 
single HBA?


When we deploy prod RGW there it may be 10-20 in a cluster. Yes there is 
a single 4 miniSAS port HBA per head node, and one of those for each 
chassis.



I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis, 
each of which had 2x server trays, so effectively 2x 45 slot chassis 
bound together.  The bucket pool was EC 3,2 or 4,2.  The motherboard was 
…. odd, as a certain chassis vendor had a thing for at a certain point 
in time.  With only 12 DIMM slots each, they were chronically short on 
RAM and the single HBA was a bottleneck.  Performance was acceptable for 
the use-case …. at first.  As the cluster filled up and got busier, that 
was no longer the case.  And these were 8TB capped drives.  Not all 
slots were filled, at least initially.


The index pool was on separate 1U servers with SATA SSDs.


This sounds similar to our plans, albeit with denser nodes and a NVMe 
index pool. Also in our favour is that the users of the cluster we are 
currently intending for this have established a practice of storing 
large objects.



There were hotspots, usually relatively small objects that clients 
hammered on.  A single OSD restarting and recovering would tank the API; 
we found it better to destroy and redeploy it.   Expanding faster than 
data was coming in was a challenge, as we had to throttle the heck out 
of the backfill to avoid rampant slow requests and API impact.


QLC with a larger number of OSD node failure domains was a net win in 
that RAS was dramatically increased, and expensive engineer-hours 
weren’t soaked up fighting performance and availability issues.


Thank you, this is helpful information. We haven't had that kind of 
performance concern with our RGW on 24x 14TB nodes, but it remains to be 
seen how 60x 22TB behaves in practice. Rebalancing is a big 
consideration, particularly if we have a whole node failure. We are 
currently contemplating a PG split and even more IO since the growing 
data volume and subsequent node additions has left us with low PG/OSD 
ratio and it's hard for it to rebalance.


What is OLC?

Fascinating to hear about destroy-redeploy being safer than a simple 
restart-recover!




ymmv


Agreed. I guess I wanted to add the data point that these kinds of 
clusters can and do make full sense in certain contexts, and push a 
little away from "friends don't let friends use HDDs" dogma.



If spinners work for your 
purposes and you don’t need IOPs or the ability to provision SSDs down 
the road, more power to you.


I expect our road to be long, and SSD usage will grow as the capital 
dollars, performance and TCO metrics change over time. For now, we limit 
individual cloud volumes to 300 IOPs, doubled for those who need it.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-16 Thread Drew Weaver
>Groovy.  Channel drives are IMHO a pain, though in the case of certain 
>manufacturers it can be the only way to get firmware updates.  Channel drives 
>often only have a 3 year warranty, vs 5 for generic drives.

Yes, we have run into this with Kioxia as far as being able to find new 
firmware. Which MFG tends to be the most responsible in this regard in your 
view? Not looking for a vendor rec or anything just specifically for this one 
particular issue.

Thanks!
-Drew

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-16 Thread Anthony D'Atri

> 
> NVMe SSDs shouldn’t cost significantly more than SATA SSDs.  Hint:  certain 
> tier-one chassis manufacturers mark both the fsck up.  You can get a better 
> warranty and pricing by buying drives from a VAR.
> 
>  We stopped buying “Vendor FW” drives a long time ago.

Groovy.  Channel drives are IMHO a pain, though in the case of certain 
manufacturers it can be the only way to get firmware updates.  Channel drives 
often only have a 3 year warranty, vs 5 for generic drives.


> Although when the PowerEdge R750 originally came out they removed the ability 
> for the DRAC to monitor the endurance of the non-vendor SSDs to penalize us, 
> it took about 6 months or arguing to get them to put that back in.

I've seen a bug on R440s with certain drives around this as well, where a drive 
was falsely reported as EOL.  It's a much better idea to monitor yourself than 
to trust iDRAC or any other BMC to do this.  


> It’s a trap!  Which is to say, that the $/GB really isn’t far away, and in 
> fact once you step back to TCO from the unit economics of the drive in 
> insolation, the HDDs often turn out to be *more* expensive.
> 
>  I suppose depending on what DWPD/endurance you are assuming on 
> the SSDs but also in my very specific case we have PBs of HDDs in inventory 
> so that costs us…no additional money.

Fair enough ; my remarks naturally are with respect to net new acquisitions.  
OpEx of HDDs is still higher though.


> My comment on there being more economical NVMe disks available was simply 
> that if we are all changing over to NVMe but we don’t right now need to be 
> able to move 7GB/s per drive

It's not just about performance, it's about drives that will be available if 
any 5 years from now.  

> it would be cool to just stop buying anything with SATA in it and then just 
> change out the drives later.  Which was kind of the vibe with SATA when SSDs 
> were first introduced. Everyone disagrees with me on this point but it 
> doesn’t really make sense that you have to choose between SATA or NVME on a 
> system with a backplane.

There are "universal" backplanes that will accept both, but of course you pay 
more and still need an HBA, even if it's built into the motherboard.


> 
> But yes I see all of your points as far as if I was trying to build a Ceph 
> cluster as primary storage and had a budget for this project. That would 
> indeed change everything about my algebra.
> 
> Thanks for your time and consideration I appreciate it.
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-16 Thread Drew Weaver
By HBA I suspect you mean a non-RAID HBA?

Yes, something like the HBA355

NVMe SSDs shouldn’t cost significantly more than SATA SSDs.  Hint:  certain 
tier-one chassis manufacturers mark both the fsck up.  You can get a better 
warranty and pricing by buying drives from a VAR.

  We stopped buying “Vendor FW” drives a long time ago. Although 
when the PowerEdge R750 originally came out they removed the ability for the 
DRAC to monitor the endurance of the non-vendor SSDs to penalize us, it took 
about 6 months or arguing to get them to put that back in.

It’s a trap!  Which is to say, that the $/GB really isn’t far away, and in fact 
once you step back to TCO from the unit economics of the drive in insolation, 
the HDDs often turn out to be *more* expensive.

  I suppose depending on what DWPD/endurance you are assuming on 
the SSDs but also in my very specific case we have PBs of HDDs in inventory so 
that costs us…no additional money. My comment on there being more economical 
NVMe disks available was simply that if we are all changing over to NVMe but we 
don’t right now need to be able to move 7GB/s per drive it would be cool to 
just stop buying anything with SATA in it and then just change out the drives 
later.  Which was kind of the vibe with SATA when SSDs were first introduced. 
Everyone disagrees with me on this point but it doesn’t really make sense that 
you have to choose between SATA or NVME on a system with a backplane.

But yes I see all of your points as far as if I was trying to build a Ceph 
cluster as primary storage and had a budget for this project. That would indeed 
change everything about my algebra.

Thanks for your time and consideration I appreciate it.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Anthony D'Atri
by “RBD for cloud”, do you mean VM / container general-purposes volumes on 
which a filesystem is usually built?  Or large archive / backup volumes that 
are read and written sequentially without much concern for latency or 
throughput?

How many of those ultra-dense chassis in a cluster?  Are all 60 off a single 
HBA?

I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis, each 
of which had 2x server trays, so effectively 2x 45 slot chassis bound together. 
 The bucket pool was EC 3,2 or 4,2.  The motherboard was …. odd, as a certain 
chassis vendor had a thing for at a certain point in time.  With only 12 DIMM 
slots each, they were chronically short on RAM and the single HBA was a 
bottleneck.  Performance was acceptable for the use-case …. at first.  As the 
cluster filled up and got busier, that was no longer the case.  And these were 
8TB capped drives.  Not all slots were filled, at least initially.

The index pool was on separate 1U servers with SATA SSDs.

There were hotspots, usually relatively small objects that clients hammered on. 
 A single OSD restarting and recovering would tank the API; we found it better 
to destroy and redeploy it.   Expanding faster than data was coming in was a 
challenge, as we had to throttle the heck out of the backfill to avoid rampant 
slow requests and API impact.

QLC with a larger number of OSD node failure domains was a net win in that RAS 
was dramatically increased, and expensive engineer-hours weren’t soaked up 
fighting performance and availability issues.  

ymmv, especially if one’s organization has unreasonably restrictive purchasing 
policies, row after row of empty DC racks, etc.  I’ve suffered LFF spinners — 
just 3 / 4 TB — misused for  OpenStack Cinder and Glance.  Filestore with 
(wince) colocated journals * with 3R pools — EC for RBD was not yet a thing, 
else we would have been forced to make it even worse.  The stated goal of the 
person who specked the hardware was for every instance to have the performance 
of its own 5400 RPM HDD.  Three fallacies there:  1) that anyone would consider 
that acceptable 2) that it would be sustainable during heavy usage or 
backfill/recovery and especially 3) that 450 / 3 = 2000.  It was just 
miserable.  I suspect that your use-case is different.  If spinners work for 
your purposes and you don’t need IOPs or the ability to provision SSDs down the 
road, more power to you.




* Which tickled a certain HDD mfg’s design flaws in a manner that substantially 
risked data availability and durability, in turn directly costing the 
organization substantial user dissatisfaction and hundreds of thousands of 
dollars.

> 
> These kinds of statements make me at least ask questions. Dozens of 14TB HDDs 
> have worked reasonably well for us for four years of RBD for cloud, and 
> hundreds of 16TB HDDs have satisfied our requirements for two years of RGW 
> operations, such that we are deploying 22TB HDDs in the next batch. It 
> remains to be seen how well 60 disk SAS-attached JBOD chassis work, but we 
> believe we have an effective use case.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Gregory Orange

On 12/1/24 22:32, Drew Weaver wrote:

So we were going to replace a Ceph cluster with some hardware we had
laying around using SATA HBAs but I was told that the only right way to
build Ceph in 2023 is with direct attach NVMe.


These kinds of statements make me at least ask questions. Dozens of 14TB 
HDDs have worked reasonably well for us for four years of RBD for cloud, 
and hundreds of 16TB HDDs have satisfied our requirements for two years 
of RGW operations, such that we are deploying 22TB HDDs in the next 
batch. It remains to be seen how well 60 disk SAS-attached JBOD chassis 
work, but we believe we have an effective use case.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Anthony D'Atri


> 
> Now that you say it's just backups/archival, QLC might be excessive for
> you (or a great fit if the backups are churned often).

PLC isn’t out yet, though, and probably won’t have a conventional block 
interface.

> USD70/TB is the best public large-NVME pricing I'm aware of presently; for QLC
> 30TB drives. Smaller capacity drives do get down to USD50/TB.
> 2.5" SATA spinning disk is USD20-30/TB.

2.5” spinners top out at 5TB last I checked, and a certain chassis vendor only 
resells half that capacity.

But as I’ve written, *drive* unit economics are myopic.  We don’t run 
palletloads of drives, we run *servers* with drive bays, admin overhead, switch 
ports, etc., that take up RUs, eat amps, and fart out watts.

> PCIe bandwidth: this goes for NVME as well as SATA/SAS.
> I won't name the vendor, but I saw a weird NVME server with 50+ drive
> slots.  Each drive slot was x4 lane width but had a number of PCIe
> expanders in the path from the motherboard, so it you were trying to max
> it out, simultaneously using all the drives, each drive only only got
> ~1.7x usable PCIe4.0 lanes.

I’ve seen a 2U server with … 102 IIRC E1.L bays, but it was only Gen3.

> Compare that to the Supermicro servers I suggested: The AMD variants use
> a H13SSF motherboard, which provides 64x PCIe5.0 lanes, split into 32x
> E3.S drive slots, and each drive slot has 4x PCIe 4.0, no
> over-subscription.

Having the lanes and filling them are two different things though.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Peter Grandi
>> So we were going to replace a Ceph cluster with some hardware we had
>> laying around using SATA HBAs but I was told that the only right way
>> to build Ceph in 2023 is with direct attach NVMe.

My impression are somewhat different:

* Nowadays it is rather more difficult to find 2.5in SAS or SATA
  "Enterprise" SSDs than most NVMe types. NVMe as a host bus
  also has much greater bandwidth than SAS or SATA, but Ceph is
  mostly about IOPS rather than single-device bandwidth. So in
  general willing or less willing one has got to move to NVMe.

* Ceph was designed (and most people have forgotten it) for many
  small capacity 1-OSD cheap servers, and lots of them, but
  unfortunately it is not easy to find small cheap "enterprise"
  SSD servers. In part because many people use rather unwisely
  as figure-of-merit the capacity per server-price most NVMe
  servers have many slots, which means either RAID-ing devices
  into a small number of large OSDs, which goes against all Ceph
  stands for, or running many OSD daemons on one system, which
  work-ish but is not best.

>> Does anyone have any recommendation for a 1U barebones server
>> (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays
>> that are direct attached to the motherboard without a bridge
>> or HBA for Ceph specifically?

> If you're buying new, Supermicro would be my first choice for
> vendor based on experience.
> https://www.supermicro.com/en/products/nvme

Indeed, SuperMicro does them fairly well, and there are also
GigaByte, and Tyan I think, not yet seen Intel-based models.

> You said 2.5" bays, which makes me think you have existing
> drives. There are models to fit that, but if you're also
> considering new drives, you can get further density in E1/E3

BTW "NVMe" is a bus specification (something not too different
from SCSI-over-PCIe), and there are several different physical
specifications, like 2.5in U.2 (SFF-8639), 2.5in U.3
(SFF-TA-1001), and various types of EDSFF (SFF-TA-1006,7,8). U.3
is still difficult to find but its connector supports SATA, SAS
and NVMe U.2; I have not yet seen EDSFF boxes actually available
retail without enormous delivery times, I guess the big internet
companies buy all the available production.

https://nvmexpress.org/wp-content/uploads/Session-4-NVMe-Form-Factors-Developer-Day-SSD-Form-Factors-v8.pdf
https://media.kingston.com/kingston/content/ktc-content-nvme-general-ssd-form-factors-graph-en-3.jpg
https://media.kingston.com/kingston/pdf/ktc-article-understanding-ssd-technology-en.pdf
https://www.snia.org/sites/default/files/SSSI/OCP%20EDSFF%20JM%20Hands.pdf

> The only caveat is that you will absolutely want to put a
> better NIC in these systems, because 2x10G is easy to saturate
> with a pile of NVME.

That's one reason why Ceph was designed for many small 1-OSD
servers (ideally distributed across several racks) :-). Note: to
maximize changes of many-to-many traffic instead of many-to-one.
Anyhow Ceph again is all about lots of IOPS more than
bandwidth, but if you need bandwidth nowadays many 10Gb NICs
support 25Gb/s too, and 40Gb/s and 100Gb/s are no longer that
expensive (but the cables are horrible).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Robin H. Johnson
On Mon, Jan 15, 2024 at 03:21:11PM +, Drew Weaver wrote:
> Oh, well what I was going to do wAs just use SATA HBAs on PowerEdge R740s 
> because we don't really care about performance as this is just used as a copy 
> point for backups/archival but the current Ceph cluster we have [Which is 
> based on HDDs attached to Dell RAID controllers with each disk in RAID-0 and 
> works just fine for us] is on EL7 and that is going to be EOL soon. So I 
> thought it might be better on the new cluster to use HBAs instead of having 
> the OSDs just be single disk RAID-0 volumes because I am pretty sure that's 
> the least good scenario whether or not it has been working for us for like 8 
> years now.
> 
> So I asked on the list for recommendations and also read on the website and 
> it really sounds like the only "right way" to run Ceph is by directly 
> attaching disks to a motherboard. I had thought that HBAs were okay before 
> but I am probably confusing that with ZFS/BSD or some other equally 
> hyperspecific requirement. The other note was about how using NVMe seems to 
> be the only right way now too.
> 
> I would've rather just stuck to SATA but I figured if I was going to have to 
> buy all new servers that direct attach the SATA ports right off the 
> motherboards to a backplane I may as well do it with NVMe (even though the 
> price of the media will be a lot higher).
> 
> It would be cool if someone made NVMe drives that were cost competitive and 
> had similar performance to hard drives (meaning, not super expensive but not 
> lightning fast either) because the $/GB on datacenter NVMe drives like 
> Kioxia, etc is still pretty far away from what it is for HDDs (obviously).

I think as a collective, the mailing list didn't do enough to ask about
your use case for the Ceph cluster earlier in the thread.

Now that you say it's just backups/archival, QLC might be excessive for
you (or a great fit if the backups are churned often).

USD70/TB is the best public large-NVME pricing I'm aware of presently; for QLC
30TB drives. Smaller capacity drives do get down to USD50/TB.
2.5" SATA spinning disk is USD20-30/TB.
All of those are much higher than the USD15-20/TB for 3.5" spinning disk
made for 24/7 operation.

Maybe it would also help as a community to explain "why" on the
perceptions of "right way".

It's a tradeoff in what you're doing, you don't want to
bottleneck/saturate critical parts of the system.

PCIe bandwidth: this goes for NVME as well as SATA/SAS.
I won't name the vendor, but I saw a weird NVME server with 50+ drive
slots.  Each drive slot was x4 lane width but had a number of PCIe
expanders in the path from the motherboard, so it you were trying to max
it out, simultaneously using all the drives, each drive only only got
~1.7x usable PCIe4.0 lanes.

Compare that to the Supermicro servers I suggested: The AMD variants use
a H13SSF motherboard, which provides 64x PCIe5.0 lanes, split into 32x
E3.S drive slots, and each drive slot has 4x PCIe 4.0, no
over-subscription.

On that same Supermicro system, how do you get the data out? There are
two PCIe 5.0 x16 slots for your network cards, so you only need to
saturate at most HALF the drives to saturate the network.

Taking this back to the SATA/SAS servers: if you had a 16-port HBA,
with only PCIe 2.0 x8, theoretical max 4GB/sec. Say you filled it with
Samsung QVO drives, and efficiently used them for 560MB/sec.
The drives can collectively get almost 9GB/sec.
=> probably worthwhile to buy a better HBA.

On the HBA side, some of the controllers, in any RAID mode (including
single-disk RAID0), cannot handle saturating every port at the same
time: the little CPU is just doing too much work. Those same controllers
in a passthrough/IT mode are fine because the CPU doesn't do work
anymore.

This turned out more rambling than I intended, but how can we capture
the 'why' of the recommendations into something usable by the community,
and have everybody be able to read that (esp. for those that don't want
to engage on a mailing list).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Anthony D'Atri


> Oh, well what I was going to do was just use SATA HBAs on PowerEdge R740s 
> because we don't really care about performance

That is important context.

> as this is just used as a copy point for backups/archival but the current 
> Ceph cluster we have [Which is based on HDDs attached to Dell RAID 
> controllers with each disk in RAID-0 and works just fine for us]

The H330?  You can set passthrough / JBOD / HBA personality and avoid the RAID0 
dance.

> is on EL7 and that is going to be EOL soon. So I thought it might be better 
> on the new cluster to use HBAs instead of having the OSDs just be single disk 
> RAID-0 volumes because I am pretty sure that's the least good scenario 
> whether or not it has been working for us for like 8 years now.

See above.

> So I asked on the list for recommendations and also read on the website and 
> it really sounds like the only "right way" to run Ceph is by directly 
> attaching disks to a motherboard

That isn’t quite what I meant.

If one is specking out *new* hardware:

* HDDs are a false economy
* SATA / SAS SSDs hobble performance for little or no cost savings over NVMe
* RAID HBAs are fussy and a waste of money in 2023


>  I had thought that HBAs were okay before

By HBA I suspect you mean a non-RAID HBA?

> but I am probably confusing that with ZFS/BSD or some other equally 
> hyperspecific requirement.

ZFS indeed prefers as little as possible between it and the drives.  The 
benefits for Ceph are not identical but very congruent.

> The other note was about how using NVMe seems to be the only right way now 
> too.

If we predicate that HDDs are a dead end, then that leaves us with SAS/SATA SSD 
vs NVMe SSD.

SAS is all but dead, and carries a price penalty.
SATA SSDs are steadily declining in the market.  5-10 years from now I suspect 
that no more than one manufacturer of enterprise-class SATA SSDs will remain.  
The future is PCI. SATA SSDs don’t save any money over NVMe SSDs, and 
additionally require some sort of HBA, be it an add-in card or on the 
motherboard.  SATA and NVMe SSDs use the same NAND, just with a different 
interface.


> I would've rather just stuck to SATA but I figured if I was going to have to 
> buy all new servers that direct attach the SATA ports right off the 
> motherboards to a backplane

On-board SATA chips may be relatively weak but I don’t know much about current 
implementations.

> I may as well do it with NVMe (even though the price of the media will be a 
> lot higher).

NVMe SSDs shouldn’t cost significantly more than SATA SSDs.  Hint:  certain 
tier-one chassis manufacturers mark both the fsck up.  You can get a better 
warranty and pricing by buying drives from a VAR.

> It would be cool if someone made NVMe drives that were cost competitive and 
> had similar performance to hard drives (meaning, not super expensive but not 
> lightning fast either) because the $/GB on datacenter NVMe drives like 
> Kioxia, etc is still pretty far away from what it is for HDDs (obviously).

It’s a trap!  Which is to say, that the $/GB really isn’t far away, and in fact 
once you step back to TCO from the unit economics of the drive in insolation, 
the HDDs often turn out to be *more* expensive.

Pore through this:  https://www.snia.org/forums/cmsi/programs/TCOcalc

* $/IOPS are higher for any HDD compared to NAND
* HDDs are available up to what, 22TB these days?  With the same tired SATA 
interface as when they were 2TB.  That’s rather a bottleneck.  We see HDD 
clusters limiting themselves to 8-10TB HDDs all the time; in fact AIUI RHCS 
stipulates no larger than 10TB.  Feed that into the equation and the TCO 
changes a bunch
* HDDs not only hobble steady-state performance, but under duress — expansion, 
component failure, etc., the impact to client operations will be higher and 
recovery to desired redundancy will be much longer.  I’ve seen a cluster — 
especially when using EC — take *4 weeks* to weight an 8TB HDD OSD up or down.  
Consider the operational cost and risk of that.  The SNIA calc has a 
performance multiplier that accounts for this.
* A SATA chassis is stuck with SATA, 5-10 years from now that will be 
increasingly limiting, especially if you go with LFF drives
* RUs cost money.  A 1U LFF server can hold what, at most 88TB raw when using 
HDDs?  With 60TB SSDs (*) one can fit 600TB of raw space into the same RU.






* If they meet your needs



> 
> Anyway thanks.
> -Drew
> 
> 
> 
> 
> 
> -Original Message-
> From: Robin H. Johnson  
> Sent: Sunday, January 14, 2024 5:00 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: recommendation for barebones server with 8-12 
> direct attach NVMe?
> 
> On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote:
>> Hello,
>> 
>> So we were going to replace a Ceph cluster with some hardware we had 

[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Drew Weaver
Oh, well what I was going to do was just use SATA HBAs on PowerEdge R740s 
because we don't really care about performance as this is just used as a copy 
point for backups/archival but the current Ceph cluster we have [Which is based 
on HDDs attached to Dell RAID controllers with each disk in RAID-0 and works 
just fine for us] is on EL7 and that is going to be EOL soon. So I thought it 
might be better on the new cluster to use HBAs instead of having the OSDs just 
be single disk RAID-0 volumes because I am pretty sure that's the least good 
scenario whether or not it has been working for us for like 8 years now.

So I asked on the list for recommendations and also read on the website and it 
really sounds like the only "right way" to run Ceph is by directly attaching 
disks to a motherboard. I had thought that HBAs were okay before but I am 
probably confusing that with ZFS/BSD or some other equally hyperspecific 
requirement. The other note was about how using NVMe seems to be the only right 
way now too.

I would've rather just stuck to SATA but I figured if I was going to have to 
buy all new servers that direct attach the SATA ports right off the 
motherboards to a backplane I may as well do it with NVMe (even though the 
price of the media will be a lot higher).

It would be cool if someone made NVMe drives that were cost competitive and had 
similar performance to hard drives (meaning, not super expensive but not 
lightning fast either) because the $/GB on datacenter NVMe drives like Kioxia, 
etc is still pretty far away from what it is for HDDs (obviously).

Anyway thanks.
-Drew





-Original Message-
From: Robin H. Johnson  
Sent: Sunday, January 14, 2024 5:00 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: recommendation for barebones server with 8-12 direct 
attach NVMe?

On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote:
> Hello,
> 
> So we were going to replace a Ceph cluster with some hardware we had 
> laying around using SATA HBAs but I was told that the only right way 
> to build Ceph in 2023 is with direct attach NVMe.
> 
> Does anyone have any recommendation for a 1U barebones server (we just 
> drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct 
> attached to the motherboard without a bridge or HBA for Ceph 
> specifically?
If you're buying new, Supermicro would be my first choice for vendor based on 
experience.
https://www.supermicro.com/en/products/nvme

You said 2.5" bays, which makes me think you have existing drives.
There are models to fit that, but if you're also considering new drives, you 
can get further density in E1/E3

The only caveat is that you will absolutely want to put a better NIC in these 
systems, because 2x10G is easy to saturate with a pile of NVME.

--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB 
E9B85B1F 825BCECF EE05E6F6 A48F6136
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-14 Thread Anthony D'Atri

Agreed, though today either limits one’s choices of manufacturer.

> There are models to fit that, but if you're also considering new drives,
> you can get further density in E1/E3

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-14 Thread Robin H. Johnson
On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote:
> Hello,
> 
> So we were going to replace a Ceph cluster with some hardware we had
> laying around using SATA HBAs but I was told that the only right way
> to build Ceph in 2023 is with direct attach NVMe.
> 
> Does anyone have any recommendation for a 1U barebones server (we just
> drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct
> attached to the motherboard without a bridge or HBA for Ceph
> specifically?
If you're buying new, Supermicro would be my first choice for vendor
based on experience.
https://www.supermicro.com/en/products/nvme

You said 2.5" bays, which makes me think you have existing drives.
There are models to fit that, but if you're also considering new drives,
you can get further density in E1/E3

The only caveat is that you will absolutely want to put a better NIC in
these systems, because 2x10G is easy to saturate with a pile of NVME.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Mike O'Connor

On 14/1/2024 1:57 pm, Anthony D'Atri wrote:

The OP is asking about new servers I think.
I was looking his statement below relating to using hardware laying 
around, just putting out there some options which worked for use.
  
So we were going to replace a Ceph cluster with some hardware we had

laying around using SATA HBAs but I was told that the only right way to
build Ceph in 2023 is with direct attach NVMe.


Cheers

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Anthony D'Atri
The OP is asking about new servers I think.  

> On Jan 13, 2024, at 9:36 PM, Mike O'Connor  wrote:
> 
> Because it's almost impossible to purchase the equipment required to convert 
> old drive bays to u.2 etc.
> 
> The M.2's we purchased are enterprise class.
> 
> Mike
> 
> 
>> On 14/1/2024 12:53 pm, Anthony D'Atri wrote:
>> Why use such a card and M.2 drives that I suspect aren’t enterprise-class? 
>> Instead of U.2, E1.s, or E3.s ?
>> 
 On Jan 13, 2024, at 5:10 AM, Mike O'Connor  wrote:
>>> 
>>> On 13/1/2024 1:02 am, Drew Weaver wrote:
 Hello,
 
 So we were going to replace a Ceph cluster with some hardware we had 
 laying around using SATA HBAs but I was told that the only right way to 
 build Ceph in 2023 is with direct attach NVMe.
 
 Does anyone have any recommendation for a 1U barebones server (we just 
 drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct 
 attached to the motherboard without a bridge or HBA for Ceph specifically?
 
 Thanks,
 -Drew
 
 ___
 ceph-users mailing list --ceph-users@ceph.io
 To unsubscribe send an emailtoceph-users-le...@ceph.io
>>> Hi
>>> 
>>> You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are 
>>> cheap enough around $USD180 from Aliexpress.
>>> 
>>> There are companies with cards which have many more m.2 ports but the cost 
>>> goes up greatly.
>>> 
>>> We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
>>> Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G 
>>> switch.
>>> 
>>> It works really well.
>>> 
>>> Cheers
>>> 
>>> Mike
>>> ___
>>> ceph-users mailing list --ceph-users@ceph.io
>>> To unsubscribe send an email toceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Mike O'Connor
Because it's almost impossible to purchase the equipment required to 
convert old drive bays to u.2 etc.


The M.2's we purchased are enterprise class.

Mike


On 14/1/2024 12:53 pm, Anthony D'Atri wrote:

Why use such a card and M.2 drives that I suspect aren’t enterprise-class? 
Instead of U.2, E1.s, or E3.s ?


On Jan 13, 2024, at 5:10 AM, Mike O'Connor  wrote:

On 13/1/2024 1:02 am, Drew Weaver wrote:

Hello,

So we were going to replace a Ceph cluster with some hardware we had laying 
around using SATA HBAs but I was told that the only right way to build Ceph in 
2023 is with direct attach NVMe.

Does anyone have any recommendation for a 1U barebones server (we just drop in ram 
disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the 
motherboard without a bridge or HBA for Ceph specifically?

Thanks,
-Drew

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an emailtoceph-users-le...@ceph.io

Hi

You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are cheap 
enough around $USD180 from Aliexpress.

There are companies with cards which have many more m.2 ports but the cost goes 
up greatly.

We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G switch.

It works really well.

Cheers

Mike
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Anthony D'Atri
Why use such a card and M.2 drives that I suspect aren’t enterprise-class? 
Instead of U.2, E1.s, or E3.s ?

> On Jan 13, 2024, at 5:10 AM, Mike O'Connor  wrote:
> 
> On 13/1/2024 1:02 am, Drew Weaver wrote:
>> Hello,
>> 
>> So we were going to replace a Ceph cluster with some hardware we had laying 
>> around using SATA HBAs but I was told that the only right way to build Ceph 
>> in 2023 is with direct attach NVMe.
>> 
>> Does anyone have any recommendation for a 1U barebones server (we just drop 
>> in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to 
>> the motherboard without a bridge or HBA for Ceph specifically?
>> 
>> Thanks,
>> -Drew
>> 
>> ___
>> ceph-users mailing list --ceph-users@ceph.io
>> To unsubscribe send an email toceph-users-le...@ceph.io
> 
> Hi
> 
> You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are 
> cheap enough around $USD180 from Aliexpress.
> 
> There are companies with cards which have many more m.2 ports but the cost 
> goes up greatly.
> 
> We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
> Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G 
> switch.
> 
> It works really well.
> 
> Cheers
> 
> Mike
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Mike O'Connor

On 13/1/2024 1:02 am, Drew Weaver wrote:

Hello,

So we were going to replace a Ceph cluster with some hardware we had laying 
around using SATA HBAs but I was told that the only right way to build Ceph in 
2023 is with direct attach NVMe.

Does anyone have any recommendation for a 1U barebones server (we just drop in ram 
disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the 
motherboard without a bridge or HBA for Ceph specifically?

Thanks,
-Drew

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io


Hi

You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME 
are cheap enough around $USD180 from Aliexpress.


There are companies with cards which have many more m.2 ports but the 
cost goes up greatly.


We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G 
switch.


It works really well.

Cheers

Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io