Well said, Herr Kraftmayer. - aad
> On Oct 28, 2023, at 4:22 AM, Joachim Kraftmayer - ceph ambassador > <joachim.kraftma...@clyso.com> wrote: > > Hi, > > I know similar requirements, the motivation and the need behind them. > We have chosen a clear approach to this, which also does not make the whole > setup too complicated to operate. > 1.) Everything that doesn't require strong consistency we do with other > tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies with > high IOPs and low latencies. > > 2.) Everything that requires high data security, strong consistency and > higher failure domains as host we do with Ceph. > > Joachim > > ___________________________________ > ceph ambassador DACH > ceph consultant since 2012 > > Clyso GmbH - Premier Ceph Foundation Member > > https://www.clyso.com/ > > Am 27.10.23 um 17:58 schrieb Anthony D'Atri: >> Ceph is all about strong consistency and data durability. There can also be >> a distinction between performance of the cluster in aggregate vs a single >> client, especially in a virtualization scenario where to avoid the >> noisy-neighbor dynamic you deliberately throttle iops and bandwidth per >> client. >> >>> For my discussion I am assuming nowadays PCIe based NVMe drives, which are >>> capable of writing about 8GiB/s, which is about 64GBit/s. >> Written how, though? Benchmarks sometimes are written with 100% sequential >> workloads, top-SKU CPUs that mortals can't afford, and especially with a >> queue depth of like 256. >> >> With most Ceph deployments, the IO a given drive experiences is often pretty >> much random and with lower QD. And depending on the drive, significant read >> traffic may impact write bandwidth to a degree. At ..... Mountpoint >> (Vancouver BC 2018) someone gave a presentation about the difficulties >> saturating NVMe bandwidth. >> >>> Now considering the situation that you have 5 nodes each has 4 of that >>> drives, >>> will make all small and mid-sized companies to go bankrupt ;-) only from >>> buying the corresponding networking switches. >> Depending where you get your components... >> >> * You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes >> "read intensive" (~1DWPD) (or less, sometimes) are plenty. But please >> please please stick with real enterprise-class drives. >> >> * Chassis brands mark up their storage (and RAM) quite a bit. You can often >> get SSDs elsewhere for half of what they cost from your chassis manufacturer. >> >>> But the servers hardware is still a simplistic commodity hardware which >>> can saturate the given any given commodity network hardware easily. >>> If I want to be able to use full 64GBit/s I would require at least >>> 100GBit/s networking or tons of trunked ports and cabaling with lower >>> bandwidth switches. >> Throughput and latency are different things, though. Also, are you assuming >> here the traditional topology of separate public and >> cluster/private/replication networks? With modern networking (and Ceph >> releases) that is often overkill and you can leave out the replication >> network. >> >> Also, would your clients have the same networking provisioned? If you're >> >>> If we now also consider distributing the nodes over racks, building on >>> same location or distributed datacenters, the costs will be even more >>> painfull. >> Don't you already have multiple racks? They don't need to be dedicated only >> to Ceph. >> >>> The ceph commit requirement will be 2 copies on different OSDs (comparable >>> to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable >>> to a RAID with multiple disk redudancy) >> Not entirely comparable, but the distinctions mostly don't matter here. >> >>> In all our tests so far, we could not control the behavior of how ceph is >>> persisting this 2 copies. It will always try to persist it somehow over the >>> network. >>> Q1: Is this behavior mandatory? >> It's a question of how important the data is, and how bad it would be to >> lose some. >> >>> Our common workload, and afaik nearly all webservice based applications >>> are: >>> - a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s) >>> - and probably mostly 1write to 4read or even 1:6 ratio on utilizing the >>> cluster >> QLC might help your costs, look into the D5-P5430, D5-P5366, etc. Though >> these days if you shop smart you can get TLC for close the same cost. Won't >> always be true though, and you can't get a 60TB TLC SKU ;) >> >>> Hope I could explain the situation here well enough. >>> Now assuming my ideal world with ceph: >>> if ceph would do: >>> 1. commit 2 copies to local drives to the node there ceph client is >>> connected to >>> 2. after commit sync (optimized/queued) the data over the network to >>> fulfill the common needs of ceph storage with 4 copies >> You could I think craft a CRUSH rule to do that. Default for replicated >> pools FWIW is 3 copies not 4. >> >>> 3. maybe optionally move 1 copy away from the intial node which still holds >>> the 2 local copies... >> I don't know of an elegant way to change placement after the fact. >> >>> this behaviour would ensure that: >>> - the felt performance of the OSD clients will be the full bandwidth of the >>> local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s >>> and the latency would be comparable as writing locally >>> - we would have 2 copies nearly "immediately" reported to any ceph client >> I was once told that writes return to the client when min_size copies are >> written; later I was told that it's actually not until all copies are >> written. >> >> But say we could do this. Think about what happens if one of those two >> local drives -- or the entire server -- dies. Before any copies are >> persisted to other servers, or if only one copy is persisted to another >> server. You risk data loss. >> >>> - bandwidth utilization will be optimized, since we do not duplicate the >>> stored data transfers on the network immediatelly, we defer it from the >>> initial writing of the ceph client and can so utilize better a queing >>> mechanism >> Unless you have an unusually random io pattern, I'm not sure if that would >> affect bandwidth much. >> >>> - IMHO the scalability with commodity network would be far easier to >>> implement, since the networking requirements are factors lower >> How so? I would think you'd still need the same networking. Also remember >> that having your PCI-e lanes and keeping them full are very different things. >> >>> Mabe I have a total wrong understanding of ceph cluster and data >>> distribution of the copies. >>> Q2: If so plz let me know where I may read more about this? >> https://www.amazon.com/Learning-Ceph-scalable-reliable-solution-ebook/dp/B01NBP2D9I >> >> ;) >> >> >> You might be able to achieve parts of what you envision here with commercial >> NVMeoF solutions. When I researched them they tended to have low latency, >> but some required proprietary hardware. Mostly they defaulted to only 2 >> replicas and had significant scaling and flexibility limitations. All >> depends on what you're solving for. >> >> >>> So to bring it quickly down: >>> Q3: is it possible to configure ceph to behave like named above in my ideal >>> world? >>> means to first write n minimal copies to local drives, and deferred the >>> syncing of the other copies into the network >>> Q4: if not, are there any plans into this direction? >>> Q5: if possible, is there a good documentation for it? >>> Q6: we would still like to be able to distribute over racks, enclosures and >>> datacenters >>> best wishes >>> Hans >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io