[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
> Also in our favour is that the users of the cluster we are currently > intending for this have established a practice of storing large objects. That definitely is in your favor. > but it remains to be seen how 60x 22TB behaves in practice. Be sure you don't get SMR drives. > and it's hard for it to rebalance. ^ This. > What is OLC? QLC SSDs store 33% more data than TLC, 4 voltage levels per cell vs 3. > Fascinating to hear about destroy-redeploy being safer than a simple > restart-recover! This was Luminous, that dynamic may be different now, esp. with Nautilus async recovery. > Agreed. I guess I wanted to add the data point that these kinds of clusters > can and do make full sense in certain contexts, and push a little away from > "friends don't let friends use HDDs" dogma. Understood. Some deployments aren't squeezed for DC space -- today. But since many HDD deployments are using LFF chassis, the form factor and interface limitations down the road still complicate expansion and SSD utilization. > For now, we limit individual cloud volumes to 300 IOPs, doubled for those who > need it. I'm curious how many clients / volumes you have vs. number of HDD OSDs and if you're using replication or EC. If you have relatively few clients per HDD that would definitely improve the dynamic. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
On 16/1/24 11:39, Anthony D'Atri wrote: by “RBD for cloud”, do you mean VM / container general-purposes volumes on which a filesystem is usually built? Or large archive / backup volumes that are read and written sequentially without much concern for latency or throughput? General purpose volumes for cloud instance filesystems. Performance is not high, but requirements are a moving target, and it performs better than it used to, so decision makers and users are satisfied. If more targeted requirements start to arise, of course architecture and costs will change. How many of those ultra-dense chassis in a cluster? Are all 60 off a single HBA? When we deploy prod RGW there it may be 10-20 in a cluster. Yes there is a single 4 miniSAS port HBA per head node, and one of those for each chassis. I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis, each of which had 2x server trays, so effectively 2x 45 slot chassis bound together. The bucket pool was EC 3,2 or 4,2. The motherboard was …. odd, as a certain chassis vendor had a thing for at a certain point in time. With only 12 DIMM slots each, they were chronically short on RAM and the single HBA was a bottleneck. Performance was acceptable for the use-case …. at first. As the cluster filled up and got busier, that was no longer the case. And these were 8TB capped drives. Not all slots were filled, at least initially. The index pool was on separate 1U servers with SATA SSDs. This sounds similar to our plans, albeit with denser nodes and a NVMe index pool. Also in our favour is that the users of the cluster we are currently intending for this have established a practice of storing large objects. There were hotspots, usually relatively small objects that clients hammered on. A single OSD restarting and recovering would tank the API; we found it better to destroy and redeploy it. Expanding faster than data was coming in was a challenge, as we had to throttle the heck out of the backfill to avoid rampant slow requests and API impact. QLC with a larger number of OSD node failure domains was a net win in that RAS was dramatically increased, and expensive engineer-hours weren’t soaked up fighting performance and availability issues. Thank you, this is helpful information. We haven't had that kind of performance concern with our RGW on 24x 14TB nodes, but it remains to be seen how 60x 22TB behaves in practice. Rebalancing is a big consideration, particularly if we have a whole node failure. We are currently contemplating a PG split and even more IO since the growing data volume and subsequent node additions has left us with low PG/OSD ratio and it's hard for it to rebalance. What is OLC? Fascinating to hear about destroy-redeploy being safer than a simple restart-recover! ymmv Agreed. I guess I wanted to add the data point that these kinds of clusters can and do make full sense in certain contexts, and push a little away from "friends don't let friends use HDDs" dogma. If spinners work for your purposes and you don’t need IOPs or the ability to provision SSDs down the road, more power to you. I expect our road to be long, and SSD usage will grow as the capital dollars, performance and TCO metrics change over time. For now, we limit individual cloud volumes to 300 IOPs, doubled for those who need it. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
>Groovy. Channel drives are IMHO a pain, though in the case of certain >manufacturers it can be the only way to get firmware updates. Channel drives >often only have a 3 year warranty, vs 5 for generic drives. Yes, we have run into this with Kioxia as far as being able to find new firmware. Which MFG tends to be the most responsible in this regard in your view? Not looking for a vendor rec or anything just specifically for this one particular issue. Thanks! -Drew ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
> > NVMe SSDs shouldn’t cost significantly more than SATA SSDs. Hint: certain > tier-one chassis manufacturers mark both the fsck up. You can get a better > warranty and pricing by buying drives from a VAR. > > We stopped buying “Vendor FW” drives a long time ago. Groovy. Channel drives are IMHO a pain, though in the case of certain manufacturers it can be the only way to get firmware updates. Channel drives often only have a 3 year warranty, vs 5 for generic drives. > Although when the PowerEdge R750 originally came out they removed the ability > for the DRAC to monitor the endurance of the non-vendor SSDs to penalize us, > it took about 6 months or arguing to get them to put that back in. I've seen a bug on R440s with certain drives around this as well, where a drive was falsely reported as EOL. It's a much better idea to monitor yourself than to trust iDRAC or any other BMC to do this. > It’s a trap! Which is to say, that the $/GB really isn’t far away, and in > fact once you step back to TCO from the unit economics of the drive in > insolation, the HDDs often turn out to be *more* expensive. > > I suppose depending on what DWPD/endurance you are assuming on > the SSDs but also in my very specific case we have PBs of HDDs in inventory > so that costs us…no additional money. Fair enough ; my remarks naturally are with respect to net new acquisitions. OpEx of HDDs is still higher though. > My comment on there being more economical NVMe disks available was simply > that if we are all changing over to NVMe but we don’t right now need to be > able to move 7GB/s per drive It's not just about performance, it's about drives that will be available if any 5 years from now. > it would be cool to just stop buying anything with SATA in it and then just > change out the drives later. Which was kind of the vibe with SATA when SSDs > were first introduced. Everyone disagrees with me on this point but it > doesn’t really make sense that you have to choose between SATA or NVME on a > system with a backplane. There are "universal" backplanes that will accept both, but of course you pay more and still need an HBA, even if it's built into the motherboard. > > But yes I see all of your points as far as if I was trying to build a Ceph > cluster as primary storage and had a budget for this project. That would > indeed change everything about my algebra. > > Thanks for your time and consideration I appreciate it. > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
By HBA I suspect you mean a non-RAID HBA? Yes, something like the HBA355 NVMe SSDs shouldn’t cost significantly more than SATA SSDs. Hint: certain tier-one chassis manufacturers mark both the fsck up. You can get a better warranty and pricing by buying drives from a VAR. We stopped buying “Vendor FW” drives a long time ago. Although when the PowerEdge R750 originally came out they removed the ability for the DRAC to monitor the endurance of the non-vendor SSDs to penalize us, it took about 6 months or arguing to get them to put that back in. It’s a trap! Which is to say, that the $/GB really isn’t far away, and in fact once you step back to TCO from the unit economics of the drive in insolation, the HDDs often turn out to be *more* expensive. I suppose depending on what DWPD/endurance you are assuming on the SSDs but also in my very specific case we have PBs of HDDs in inventory so that costs us…no additional money. My comment on there being more economical NVMe disks available was simply that if we are all changing over to NVMe but we don’t right now need to be able to move 7GB/s per drive it would be cool to just stop buying anything with SATA in it and then just change out the drives later. Which was kind of the vibe with SATA when SSDs were first introduced. Everyone disagrees with me on this point but it doesn’t really make sense that you have to choose between SATA or NVME on a system with a backplane. But yes I see all of your points as far as if I was trying to build a Ceph cluster as primary storage and had a budget for this project. That would indeed change everything about my algebra. Thanks for your time and consideration I appreciate it. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
by “RBD for cloud”, do you mean VM / container general-purposes volumes on which a filesystem is usually built? Or large archive / backup volumes that are read and written sequentially without much concern for latency or throughput? How many of those ultra-dense chassis in a cluster? Are all 60 off a single HBA? I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis, each of which had 2x server trays, so effectively 2x 45 slot chassis bound together. The bucket pool was EC 3,2 or 4,2. The motherboard was …. odd, as a certain chassis vendor had a thing for at a certain point in time. With only 12 DIMM slots each, they were chronically short on RAM and the single HBA was a bottleneck. Performance was acceptable for the use-case …. at first. As the cluster filled up and got busier, that was no longer the case. And these were 8TB capped drives. Not all slots were filled, at least initially. The index pool was on separate 1U servers with SATA SSDs. There were hotspots, usually relatively small objects that clients hammered on. A single OSD restarting and recovering would tank the API; we found it better to destroy and redeploy it. Expanding faster than data was coming in was a challenge, as we had to throttle the heck out of the backfill to avoid rampant slow requests and API impact. QLC with a larger number of OSD node failure domains was a net win in that RAS was dramatically increased, and expensive engineer-hours weren’t soaked up fighting performance and availability issues. ymmv, especially if one’s organization has unreasonably restrictive purchasing policies, row after row of empty DC racks, etc. I’ve suffered LFF spinners — just 3 / 4 TB — misused for OpenStack Cinder and Glance. Filestore with (wince) colocated journals * with 3R pools — EC for RBD was not yet a thing, else we would have been forced to make it even worse. The stated goal of the person who specked the hardware was for every instance to have the performance of its own 5400 RPM HDD. Three fallacies there: 1) that anyone would consider that acceptable 2) that it would be sustainable during heavy usage or backfill/recovery and especially 3) that 450 / 3 = 2000. It was just miserable. I suspect that your use-case is different. If spinners work for your purposes and you don’t need IOPs or the ability to provision SSDs down the road, more power to you. * Which tickled a certain HDD mfg’s design flaws in a manner that substantially risked data availability and durability, in turn directly costing the organization substantial user dissatisfaction and hundreds of thousands of dollars. > > These kinds of statements make me at least ask questions. Dozens of 14TB HDDs > have worked reasonably well for us for four years of RBD for cloud, and > hundreds of 16TB HDDs have satisfied our requirements for two years of RGW > operations, such that we are deploying 22TB HDDs in the next batch. It > remains to be seen how well 60 disk SAS-attached JBOD chassis work, but we > believe we have an effective use case. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
On 12/1/24 22:32, Drew Weaver wrote: So we were going to replace a Ceph cluster with some hardware we had laying around using SATA HBAs but I was told that the only right way to build Ceph in 2023 is with direct attach NVMe. These kinds of statements make me at least ask questions. Dozens of 14TB HDDs have worked reasonably well for us for four years of RBD for cloud, and hundreds of 16TB HDDs have satisfied our requirements for two years of RGW operations, such that we are deploying 22TB HDDs in the next batch. It remains to be seen how well 60 disk SAS-attached JBOD chassis work, but we believe we have an effective use case. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
> > Now that you say it's just backups/archival, QLC might be excessive for > you (or a great fit if the backups are churned often). PLC isn’t out yet, though, and probably won’t have a conventional block interface. > USD70/TB is the best public large-NVME pricing I'm aware of presently; for QLC > 30TB drives. Smaller capacity drives do get down to USD50/TB. > 2.5" SATA spinning disk is USD20-30/TB. 2.5” spinners top out at 5TB last I checked, and a certain chassis vendor only resells half that capacity. But as I’ve written, *drive* unit economics are myopic. We don’t run palletloads of drives, we run *servers* with drive bays, admin overhead, switch ports, etc., that take up RUs, eat amps, and fart out watts. > PCIe bandwidth: this goes for NVME as well as SATA/SAS. > I won't name the vendor, but I saw a weird NVME server with 50+ drive > slots. Each drive slot was x4 lane width but had a number of PCIe > expanders in the path from the motherboard, so it you were trying to max > it out, simultaneously using all the drives, each drive only only got > ~1.7x usable PCIe4.0 lanes. I’ve seen a 2U server with … 102 IIRC E1.L bays, but it was only Gen3. > Compare that to the Supermicro servers I suggested: The AMD variants use > a H13SSF motherboard, which provides 64x PCIe5.0 lanes, split into 32x > E3.S drive slots, and each drive slot has 4x PCIe 4.0, no > over-subscription. Having the lanes and filling them are two different things though. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
>> So we were going to replace a Ceph cluster with some hardware we had >> laying around using SATA HBAs but I was told that the only right way >> to build Ceph in 2023 is with direct attach NVMe. My impression are somewhat different: * Nowadays it is rather more difficult to find 2.5in SAS or SATA "Enterprise" SSDs than most NVMe types. NVMe as a host bus also has much greater bandwidth than SAS or SATA, but Ceph is mostly about IOPS rather than single-device bandwidth. So in general willing or less willing one has got to move to NVMe. * Ceph was designed (and most people have forgotten it) for many small capacity 1-OSD cheap servers, and lots of them, but unfortunately it is not easy to find small cheap "enterprise" SSD servers. In part because many people use rather unwisely as figure-of-merit the capacity per server-price most NVMe servers have many slots, which means either RAID-ing devices into a small number of large OSDs, which goes against all Ceph stands for, or running many OSD daemons on one system, which work-ish but is not best. >> Does anyone have any recommendation for a 1U barebones server >> (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays >> that are direct attached to the motherboard without a bridge >> or HBA for Ceph specifically? > If you're buying new, Supermicro would be my first choice for > vendor based on experience. > https://www.supermicro.com/en/products/nvme Indeed, SuperMicro does them fairly well, and there are also GigaByte, and Tyan I think, not yet seen Intel-based models. > You said 2.5" bays, which makes me think you have existing > drives. There are models to fit that, but if you're also > considering new drives, you can get further density in E1/E3 BTW "NVMe" is a bus specification (something not too different from SCSI-over-PCIe), and there are several different physical specifications, like 2.5in U.2 (SFF-8639), 2.5in U.3 (SFF-TA-1001), and various types of EDSFF (SFF-TA-1006,7,8). U.3 is still difficult to find but its connector supports SATA, SAS and NVMe U.2; I have not yet seen EDSFF boxes actually available retail without enormous delivery times, I guess the big internet companies buy all the available production. https://nvmexpress.org/wp-content/uploads/Session-4-NVMe-Form-Factors-Developer-Day-SSD-Form-Factors-v8.pdf https://media.kingston.com/kingston/content/ktc-content-nvme-general-ssd-form-factors-graph-en-3.jpg https://media.kingston.com/kingston/pdf/ktc-article-understanding-ssd-technology-en.pdf https://www.snia.org/sites/default/files/SSSI/OCP%20EDSFF%20JM%20Hands.pdf > The only caveat is that you will absolutely want to put a > better NIC in these systems, because 2x10G is easy to saturate > with a pile of NVME. That's one reason why Ceph was designed for many small 1-OSD servers (ideally distributed across several racks) :-). Note: to maximize changes of many-to-many traffic instead of many-to-one. Anyhow Ceph again is all about lots of IOPS more than bandwidth, but if you need bandwidth nowadays many 10Gb NICs support 25Gb/s too, and 40Gb/s and 100Gb/s are no longer that expensive (but the cables are horrible). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
On Mon, Jan 15, 2024 at 03:21:11PM +, Drew Weaver wrote: > Oh, well what I was going to do wAs just use SATA HBAs on PowerEdge R740s > because we don't really care about performance as this is just used as a copy > point for backups/archival but the current Ceph cluster we have [Which is > based on HDDs attached to Dell RAID controllers with each disk in RAID-0 and > works just fine for us] is on EL7 and that is going to be EOL soon. So I > thought it might be better on the new cluster to use HBAs instead of having > the OSDs just be single disk RAID-0 volumes because I am pretty sure that's > the least good scenario whether or not it has been working for us for like 8 > years now. > > So I asked on the list for recommendations and also read on the website and > it really sounds like the only "right way" to run Ceph is by directly > attaching disks to a motherboard. I had thought that HBAs were okay before > but I am probably confusing that with ZFS/BSD or some other equally > hyperspecific requirement. The other note was about how using NVMe seems to > be the only right way now too. > > I would've rather just stuck to SATA but I figured if I was going to have to > buy all new servers that direct attach the SATA ports right off the > motherboards to a backplane I may as well do it with NVMe (even though the > price of the media will be a lot higher). > > It would be cool if someone made NVMe drives that were cost competitive and > had similar performance to hard drives (meaning, not super expensive but not > lightning fast either) because the $/GB on datacenter NVMe drives like > Kioxia, etc is still pretty far away from what it is for HDDs (obviously). I think as a collective, the mailing list didn't do enough to ask about your use case for the Ceph cluster earlier in the thread. Now that you say it's just backups/archival, QLC might be excessive for you (or a great fit if the backups are churned often). USD70/TB is the best public large-NVME pricing I'm aware of presently; for QLC 30TB drives. Smaller capacity drives do get down to USD50/TB. 2.5" SATA spinning disk is USD20-30/TB. All of those are much higher than the USD15-20/TB for 3.5" spinning disk made for 24/7 operation. Maybe it would also help as a community to explain "why" on the perceptions of "right way". It's a tradeoff in what you're doing, you don't want to bottleneck/saturate critical parts of the system. PCIe bandwidth: this goes for NVME as well as SATA/SAS. I won't name the vendor, but I saw a weird NVME server with 50+ drive slots. Each drive slot was x4 lane width but had a number of PCIe expanders in the path from the motherboard, so it you were trying to max it out, simultaneously using all the drives, each drive only only got ~1.7x usable PCIe4.0 lanes. Compare that to the Supermicro servers I suggested: The AMD variants use a H13SSF motherboard, which provides 64x PCIe5.0 lanes, split into 32x E3.S drive slots, and each drive slot has 4x PCIe 4.0, no over-subscription. On that same Supermicro system, how do you get the data out? There are two PCIe 5.0 x16 slots for your network cards, so you only need to saturate at most HALF the drives to saturate the network. Taking this back to the SATA/SAS servers: if you had a 16-port HBA, with only PCIe 2.0 x8, theoretical max 4GB/sec. Say you filled it with Samsung QVO drives, and efficiently used them for 560MB/sec. The drives can collectively get almost 9GB/sec. => probably worthwhile to buy a better HBA. On the HBA side, some of the controllers, in any RAID mode (including single-disk RAID0), cannot handle saturating every port at the same time: the little CPU is just doing too much work. Those same controllers in a passthrough/IT mode are fine because the CPU doesn't do work anymore. This turned out more rambling than I intended, but how can we capture the 'why' of the recommendations into something usable by the community, and have everybody be able to read that (esp. for those that don't want to engage on a mailing list). -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: PGP signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
> Oh, well what I was going to do was just use SATA HBAs on PowerEdge R740s > because we don't really care about performance That is important context. > as this is just used as a copy point for backups/archival but the current > Ceph cluster we have [Which is based on HDDs attached to Dell RAID > controllers with each disk in RAID-0 and works just fine for us] The H330? You can set passthrough / JBOD / HBA personality and avoid the RAID0 dance. > is on EL7 and that is going to be EOL soon. So I thought it might be better > on the new cluster to use HBAs instead of having the OSDs just be single disk > RAID-0 volumes because I am pretty sure that's the least good scenario > whether or not it has been working for us for like 8 years now. See above. > So I asked on the list for recommendations and also read on the website and > it really sounds like the only "right way" to run Ceph is by directly > attaching disks to a motherboard That isn’t quite what I meant. If one is specking out *new* hardware: * HDDs are a false economy * SATA / SAS SSDs hobble performance for little or no cost savings over NVMe * RAID HBAs are fussy and a waste of money in 2023 > I had thought that HBAs were okay before By HBA I suspect you mean a non-RAID HBA? > but I am probably confusing that with ZFS/BSD or some other equally > hyperspecific requirement. ZFS indeed prefers as little as possible between it and the drives. The benefits for Ceph are not identical but very congruent. > The other note was about how using NVMe seems to be the only right way now > too. If we predicate that HDDs are a dead end, then that leaves us with SAS/SATA SSD vs NVMe SSD. SAS is all but dead, and carries a price penalty. SATA SSDs are steadily declining in the market. 5-10 years from now I suspect that no more than one manufacturer of enterprise-class SATA SSDs will remain. The future is PCI. SATA SSDs don’t save any money over NVMe SSDs, and additionally require some sort of HBA, be it an add-in card or on the motherboard. SATA and NVMe SSDs use the same NAND, just with a different interface. > I would've rather just stuck to SATA but I figured if I was going to have to > buy all new servers that direct attach the SATA ports right off the > motherboards to a backplane On-board SATA chips may be relatively weak but I don’t know much about current implementations. > I may as well do it with NVMe (even though the price of the media will be a > lot higher). NVMe SSDs shouldn’t cost significantly more than SATA SSDs. Hint: certain tier-one chassis manufacturers mark both the fsck up. You can get a better warranty and pricing by buying drives from a VAR. > It would be cool if someone made NVMe drives that were cost competitive and > had similar performance to hard drives (meaning, not super expensive but not > lightning fast either) because the $/GB on datacenter NVMe drives like > Kioxia, etc is still pretty far away from what it is for HDDs (obviously). It’s a trap! Which is to say, that the $/GB really isn’t far away, and in fact once you step back to TCO from the unit economics of the drive in insolation, the HDDs often turn out to be *more* expensive. Pore through this: https://www.snia.org/forums/cmsi/programs/TCOcalc * $/IOPS are higher for any HDD compared to NAND * HDDs are available up to what, 22TB these days? With the same tired SATA interface as when they were 2TB. That’s rather a bottleneck. We see HDD clusters limiting themselves to 8-10TB HDDs all the time; in fact AIUI RHCS stipulates no larger than 10TB. Feed that into the equation and the TCO changes a bunch * HDDs not only hobble steady-state performance, but under duress — expansion, component failure, etc., the impact to client operations will be higher and recovery to desired redundancy will be much longer. I’ve seen a cluster — especially when using EC — take *4 weeks* to weight an 8TB HDD OSD up or down. Consider the operational cost and risk of that. The SNIA calc has a performance multiplier that accounts for this. * A SATA chassis is stuck with SATA, 5-10 years from now that will be increasingly limiting, especially if you go with LFF drives * RUs cost money. A 1U LFF server can hold what, at most 88TB raw when using HDDs? With 60TB SSDs (*) one can fit 600TB of raw space into the same RU. * If they meet your needs > > Anyway thanks. > -Drew > > > > > > -Original Message- > From: Robin H. Johnson > Sent: Sunday, January 14, 2024 5:00 PM > To: ceph-users@ceph.io > Subject: [ceph-users] Re: recommendation for barebones server with 8-12 > direct attach NVMe? > > On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote: >> Hello, >> >> So we were going to replace a Ceph cluster with some hardware we had
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
Oh, well what I was going to do was just use SATA HBAs on PowerEdge R740s because we don't really care about performance as this is just used as a copy point for backups/archival but the current Ceph cluster we have [Which is based on HDDs attached to Dell RAID controllers with each disk in RAID-0 and works just fine for us] is on EL7 and that is going to be EOL soon. So I thought it might be better on the new cluster to use HBAs instead of having the OSDs just be single disk RAID-0 volumes because I am pretty sure that's the least good scenario whether or not it has been working for us for like 8 years now. So I asked on the list for recommendations and also read on the website and it really sounds like the only "right way" to run Ceph is by directly attaching disks to a motherboard. I had thought that HBAs were okay before but I am probably confusing that with ZFS/BSD or some other equally hyperspecific requirement. The other note was about how using NVMe seems to be the only right way now too. I would've rather just stuck to SATA but I figured if I was going to have to buy all new servers that direct attach the SATA ports right off the motherboards to a backplane I may as well do it with NVMe (even though the price of the media will be a lot higher). It would be cool if someone made NVMe drives that were cost competitive and had similar performance to hard drives (meaning, not super expensive but not lightning fast either) because the $/GB on datacenter NVMe drives like Kioxia, etc is still pretty far away from what it is for HDDs (obviously). Anyway thanks. -Drew -Original Message- From: Robin H. Johnson Sent: Sunday, January 14, 2024 5:00 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe? On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote: > Hello, > > So we were going to replace a Ceph cluster with some hardware we had > laying around using SATA HBAs but I was told that the only right way > to build Ceph in 2023 is with direct attach NVMe. > > Does anyone have any recommendation for a 1U barebones server (we just > drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct > attached to the motherboard without a bridge or HBA for Ceph > specifically? If you're buying new, Supermicro would be my first choice for vendor based on experience. https://www.supermicro.com/en/products/nvme You said 2.5" bays, which makes me think you have existing drives. There are models to fit that, but if you're also considering new drives, you can get further density in E1/E3 The only caveat is that you will absolutely want to put a better NIC in these systems, because 2x10G is easy to saturate with a pile of NVME. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
Agreed, though today either limits one’s choices of manufacturer. > There are models to fit that, but if you're also considering new drives, > you can get further density in E1/E3 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote: > Hello, > > So we were going to replace a Ceph cluster with some hardware we had > laying around using SATA HBAs but I was told that the only right way > to build Ceph in 2023 is with direct attach NVMe. > > Does anyone have any recommendation for a 1U barebones server (we just > drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct > attached to the motherboard without a bridge or HBA for Ceph > specifically? If you're buying new, Supermicro would be my first choice for vendor based on experience. https://www.supermicro.com/en/products/nvme You said 2.5" bays, which makes me think you have existing drives. There are models to fit that, but if you're also considering new drives, you can get further density in E1/E3 The only caveat is that you will absolutely want to put a better NIC in these systems, because 2x10G is easy to saturate with a pile of NVME. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: PGP signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
On 14/1/2024 1:57 pm, Anthony D'Atri wrote: The OP is asking about new servers I think. I was looking his statement below relating to using hardware laying around, just putting out there some options which worked for use. So we were going to replace a Ceph cluster with some hardware we had laying around using SATA HBAs but I was told that the only right way to build Ceph in 2023 is with direct attach NVMe. Cheers ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
The OP is asking about new servers I think. > On Jan 13, 2024, at 9:36 PM, Mike O'Connor wrote: > > Because it's almost impossible to purchase the equipment required to convert > old drive bays to u.2 etc. > > The M.2's we purchased are enterprise class. > > Mike > > >> On 14/1/2024 12:53 pm, Anthony D'Atri wrote: >> Why use such a card and M.2 drives that I suspect aren’t enterprise-class? >> Instead of U.2, E1.s, or E3.s ? >> On Jan 13, 2024, at 5:10 AM, Mike O'Connor wrote: >>> >>> On 13/1/2024 1:02 am, Drew Weaver wrote: Hello, So we were going to replace a Ceph cluster with some hardware we had laying around using SATA HBAs but I was told that the only right way to build Ceph in 2023 is with direct attach NVMe. Does anyone have any recommendation for a 1U barebones server (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the motherboard without a bridge or HBA for Ceph specifically? Thanks, -Drew ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an emailtoceph-users-le...@ceph.io >>> Hi >>> >>> You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are >>> cheap enough around $USD180 from Aliexpress. >>> >>> There are companies with cards which have many more m.2 ports but the cost >>> goes up greatly. >>> >>> We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G >>> Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G >>> switch. >>> >>> It works really well. >>> >>> Cheers >>> >>> Mike >>> ___ >>> ceph-users mailing list --ceph-users@ceph.io >>> To unsubscribe send an email toceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
Because it's almost impossible to purchase the equipment required to convert old drive bays to u.2 etc. The M.2's we purchased are enterprise class. Mike On 14/1/2024 12:53 pm, Anthony D'Atri wrote: Why use such a card and M.2 drives that I suspect aren’t enterprise-class? Instead of U.2, E1.s, or E3.s ? On Jan 13, 2024, at 5:10 AM, Mike O'Connor wrote: On 13/1/2024 1:02 am, Drew Weaver wrote: Hello, So we were going to replace a Ceph cluster with some hardware we had laying around using SATA HBAs but I was told that the only right way to build Ceph in 2023 is with direct attach NVMe. Does anyone have any recommendation for a 1U barebones server (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the motherboard without a bridge or HBA for Ceph specifically? Thanks, -Drew ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an emailtoceph-users-le...@ceph.io Hi You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are cheap enough around $USD180 from Aliexpress. There are companies with cards which have many more m.2 ports but the cost goes up greatly. We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G switch. It works really well. Cheers Mike ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an email toceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
Why use such a card and M.2 drives that I suspect aren’t enterprise-class? Instead of U.2, E1.s, or E3.s ? > On Jan 13, 2024, at 5:10 AM, Mike O'Connor wrote: > > On 13/1/2024 1:02 am, Drew Weaver wrote: >> Hello, >> >> So we were going to replace a Ceph cluster with some hardware we had laying >> around using SATA HBAs but I was told that the only right way to build Ceph >> in 2023 is with direct attach NVMe. >> >> Does anyone have any recommendation for a 1U barebones server (we just drop >> in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to >> the motherboard without a bridge or HBA for Ceph specifically? >> >> Thanks, >> -Drew >> >> ___ >> ceph-users mailing list --ceph-users@ceph.io >> To unsubscribe send an email toceph-users-le...@ceph.io > > Hi > > You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are > cheap enough around $USD180 from Aliexpress. > > There are companies with cards which have many more m.2 ports but the cost > goes up greatly. > > We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G > Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G > switch. > > It works really well. > > Cheers > > Mike > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
On 13/1/2024 1:02 am, Drew Weaver wrote: Hello, So we were going to replace a Ceph cluster with some hardware we had laying around using SATA HBAs but I was told that the only right way to build Ceph in 2023 is with direct attach NVMe. Does anyone have any recommendation for a 1U barebones server (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the motherboard without a bridge or HBA for Ceph specifically? Thanks, -Drew ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an email toceph-users-le...@ceph.io Hi You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are cheap enough around $USD180 from Aliexpress. There are companies with cards which have many more m.2 ports but the cost goes up greatly. We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G switch. It works really well. Cheers Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io