[ceph-users] Re: Stickyness of writing vs full network storage writing

Anthony D'Atri Sat, 28 Oct 2023 07:44:17 -0700

Well said, Herr Kraftmayer.

- aad


> On Oct 28, 2023, at 4:22 AM, Joachim Kraftmayer - ceph ambassador 
> <joachim.kraftma...@clyso.com> wrote:
> 
> Hi,
> 
> I know similar requirements, the motivation and the need behind them.
> We have chosen a clear approach to this, which also does not make the whole 
> setup too complicated to operate.
> 1.) Everything that doesn't require strong consistency we do with other 
> tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies with 
> high IOPs and low latencies.
> 
> 2.) Everything that requires high data security, strong consistency and 
> higher failure domains as host we do with Ceph.
> 
> Joachim
> 
> ___________________________________
> ceph ambassador DACH
> ceph consultant since 2012
> 
> Clyso GmbH - Premier Ceph Foundation Member
> 
> https://www.clyso.com/
> 
> Am 27.10.23 um 17:58 schrieb Anthony D'Atri:
>> Ceph is all about strong consistency and data durability.  There can also be 
>> a distinction between performance of the cluster in aggregate vs a single 
>> client, especially in a virtualization scenario where to avoid the 
>> noisy-neighbor dynamic you deliberately throttle iops and bandwidth per 
>> client.
>> 
>>> For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
>>> capable of writing about 8GiB/s, which is about 64GBit/s.
>> Written how, though?  Benchmarks sometimes are written with 100% sequential 
>> workloads, top-SKU CPUs that mortals can't afford, and especially with a 
>> queue depth of like 256.
>> 
>> With most Ceph deployments, the IO a given drive experiences is often pretty 
>> much random and with lower QD.  And depending on the drive, significant read 
>> traffic may impact write bandwidth to a degree.  At ..... Mountpoint 
>> (Vancouver BC 2018) someone gave a presentation about the difficulties 
>> saturating NVMe bandwidth.
>> 
>>> Now considering the situation that you have 5 nodes each has 4 of that 
>>> drives,
>>> will make all small and mid-sized companies to go bankrupt ;-) only from 
>>> buying the corresponding networking switches.
>> Depending where you get your components...
>> 
>> * You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes 
>> "read intensive" (~1DWPD) (or less, sometimes) are plenty.  But please 
>> please please stick with real enterprise-class drives.
>> 
>> * Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
>> get SSDs elsewhere for half of what they cost from your chassis manufacturer.
>> 
>>>   But the servers hardware is still a simplistic commodity hardware which 
>>> can saturate the given any given commodity network hardware easily.
>>> If I want to be able to use full 64GBit/s I would require at least 
>>> 100GBit/s networking or tons of trunked ports and cabaling with lower 
>>> bandwidth switches.
>> Throughput and latency are different things, though.  Also, are you assuming 
>> here the traditional topology of separate public and 
>> cluster/private/replication networks?  With modern networking (and Ceph 
>> releases) that is often overkill and you can leave out the replication 
>> network.
>> 
>> Also, would your clients have the same networking provisioned?  If you're
>> 
>>>   If we now also consider distributing the nodes over racks, building on 
>>> same location or distributed datacenters, the costs will be even more 
>>> painfull.
>> Don't you already have multiple racks?  They don't need to be dedicated only 
>> to Ceph.
>> 
>>> The ceph commit requirement will be 2 copies on different OSDs (comparable 
>>> to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable 
>>> to a RAID with multiple disk redudancy)
>> Not entirely comparable, but the distinctions mostly don't matter here.
>> 
>>> In all our tests so far, we could not control the behavior of how ceph is 
>>> persisting this 2 copies. It will always try to persist it somehow over the 
>>> network.
>>> Q1: Is this behavior mandatory?
>> It's a question of how important the data is, and how bad it would be to 
>> lose some.
>> 
>>>   Our common workload, and afaik nearly all webservice based applications 
>>> are:
>>> - a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
>>> - and probably mostly 1write to 4read or even 1:6 ratio on utilizing the 
>>> cluster
>> QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though 
>> these days if you shop smart you can get TLC for close the same cost.  Won't 
>> always be true though, and you can't get a 60TB TLC SKU ;)
>> 
>>> Hope I could explain the situation here well enough.
>>>     Now assuming my ideal world with ceph:
>>> if ceph would do:
>>> 1. commit 2 copies to local drives to the node there ceph client is 
>>> connected to
>>> 2. after commit sync (optimized/queued) the data over the network to 
>>> fulfill the common needs of ceph storage with 4 copies
>> You could I think craft a CRUSH rule to do that.  Default for replicated 
>> pools FWIW is 3 copies not 4.
>> 
>>> 3. maybe optionally move 1 copy away from the intial node which still holds 
>>> the 2 local copies...
>> I don't know of an elegant way to change placement after the fact.
>> 
>>>   this behaviour would ensure that:
>>> - the felt performance of the OSD clients will be the full bandwidth of the 
>>> local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s 
>>> and the latency would be comparable as writing locally
>>> - we would have 2 copies nearly "immediately" reported to any ceph client
>> I was once told that writes return to the client when min_size copies are 
>> written; later I was told that it's actually not until all copies are 
>> written.
>> 
>> But say we could do this.  Think about what happens if one of those two 
>> local drives -- or the entire server -- dies.  Before any copies are 
>> persisted to other servers, or if only one copy is persisted to another 
>> server.  You risk data loss.
>> 
>>> - bandwidth utilization will be optimized, since we do not duplicate the 
>>> stored data transfers on the network immediatelly, we defer it from the 
>>> initial writing of the ceph client and can so utilize better a queing 
>>> mechanism
>> Unless you have an unusually random io pattern, I'm not sure if that would 
>> affect bandwidth much.
>> 
>>> - IMHO the scalability with commodity network would be far easier to 
>>> implement, since the networking requirements are factors lower
>> How so?  I would think you'd still need the same networking.  Also remember 
>> that having your PCI-e lanes and keeping them full are very different things.
>> 
>>>   Mabe I have a total wrong understanding of ceph cluster and data 
>>> distribution of the copies.
>>> Q2: If so plz let me know where I may read more about this?
>> https://www.amazon.com/Learning-Ceph-scalable-reliable-solution-ebook/dp/B01NBP2D9I
>> 
>> ;)
>> 
>> 
>> You might be able to achieve parts of what you envision here with commercial 
>> NVMeoF solutions.  When I researched them they tended to have low latency, 
>> but some required proprietary hardware.  Mostly they defaulted to only 2 
>> replicas and had significant scaling and flexibility limitations.  All 
>> depends on what you're solving for.
>> 
>> 
>>> So to bring it quickly down:
>>> Q3: is it possible to configure ceph to behave like named above in my ideal 
>>> world?
>>>    means to first write n minimal copies to local drives, and deferred the 
>>> syncing of the other copies into the network
>>> Q4: if not, are there any plans into this direction?
>>> Q5: if possible, is there a good documentation for it?
>>> Q6: we would still like to be able to distribute over racks, enclosures and 
>>> datacenters
>>>   best wishes
>>> Hans
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stickyness of writing vs full network storage writing

Reply via email to