[ceph-users] Re: Stickyness of writing vs full network storage writing

Anthony D'Atri Fri, 27 Oct 2023 08:59:33 -0700

Ceph is all about strong consistency and data durability.  There can also be a 
distinction between performance of the cluster in aggregate vs a single client, 
especially in a virtualization scenario where to avoid the noisy-neighbor 
dynamic you deliberately throttle iops and bandwidth per client.


> For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
> capable of writing about 8GiB/s, which is about 64GBit/s.

Written how, though?  Benchmarks sometimes are written with 100% sequential 
workloads, top-SKU CPUs that mortals can't afford, and especially with a queue 
depth of like 256.

With most Ceph deployments, the IO a given drive experiences is often pretty 
much random and with lower QD.  And depending on the drive, significant read 
traffic may impact write bandwidth to a degree.  At ..... Mountpoint (Vancouver 
BC 2018) someone gave a presentation about the difficulties saturating NVMe 
bandwidth.  

> Now considering the situation that you have 5 nodes each has 4 of that drives,
> will make all small and mid-sized companies to go bankrupt ;-) only from 
> buying the corresponding networking switches.

Depending where you get your components...

* You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes "read 
intensive" (~1DWPD) (or less, sometimes) are plenty.  But please please please 
stick with real enterprise-class drives.

* Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
get SSDs elsewhere for half of what they cost from your chassis manufacturer.

>   But the servers hardware is still a simplistic commodity hardware which can 
> saturate the given any given commodity network hardware easily.
> If I want to be able to use full 64GBit/s I would require at least 100GBit/s 
> networking or tons of trunked ports and cabaling with lower bandwidth 
> switches.

Throughput and latency are different things, though.  Also, are you assuming 
here the traditional topology of separate public and 
cluster/private/replication networks?  With modern networking (and Ceph 
releases) that is often overkill and you can leave out the replication network.

Also, would your clients have the same networking provisioned?  If you're 

>   If we now also consider distributing the nodes over racks, building on same 
> location or distributed datacenters, the costs will be even more painfull.

Don't you already have multiple racks?  They don't need to be dedicated only to 
Ceph.

> The ceph commit requirement will be 2 copies on different OSDs (comparable to 
> a mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a 
> RAID with multiple disk redudancy)

Not entirely comparable, but the distinctions mostly don't matter here.

> In all our tests so far, we could not control the behavior of how ceph is 
> persisting this 2 copies. It will always try to persist it somehow over the 
> network.
> Q1: Is this behavior mandatory?

It's a question of how important the data is, and how bad it would be to lose 
some.

>   Our common workload, and afaik nearly all webservice based applications are:
> - a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
> - and probably mostly 1write to 4read or even 1:6 ratio on utilizing the 
> cluster

QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though these 
days if you shop smart you can get TLC for close the same cost.  Won't always 
be true though, and you can't get a 60TB TLC SKU ;)

> Hope I could explain the situation here well enough.
>     Now assuming my ideal world with ceph:
> if ceph would do:
> 1. commit 2 copies to local drives to the node there ceph client is connected 
> to
> 2. after commit sync (optimized/queued) the data over the network to fulfill 
> the common needs of ceph storage with 4 copies

You could I think craft a CRUSH rule to do that.  Default for replicated pools 
FWIW is 3 copies not 4.

> 3. maybe optionally move 1 copy away from the intial node which still holds 
> the 2 local copies...

I don't know of an elegant way to change placement after the fact.

>   this behaviour would ensure that:
> - the felt performance of the OSD clients will be the full bandwidth of the 
> local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s 
> and the latency would be comparable as writing locally
> - we would have 2 copies nearly "immediately" reported to any ceph client

I was once told that writes return to the client when min_size copies are 
written; later I was told that it's actually not until all copies are written.

But say we could do this.  Think about what happens if one of those two local 
drives -- or the entire server -- dies.  Before any copies are persisted to 
other servers, or if only one copy is persisted to another server.  You risk 
data loss.

> - bandwidth utilization will be optimized, since we do not duplicate the 
> stored data transfers on the network immediatelly, we defer it from the 
> initial writing of the ceph client and can so utilize better a queing 
> mechanism

Unless you have an unusually random io pattern, I'm not sure if that would 
affect bandwidth much.

> - IMHO the scalability with commodity network would be far easier to 
> implement, since the networking requirements are factors lower

How so?  I would think you'd still need the same networking.  Also remember 
that having your PCI-e lanes and keeping them full are very different things.

>   Mabe I have a total wrong understanding of ceph cluster and data 
> distribution of the copies.
> Q2: If so plz let me know where I may read more about this?

https://www.amazon.com/Learning-Ceph-scalable-reliable-solution-ebook/dp/B01NBP2D9I

;)


You might be able to achieve parts of what you envision here with commercial 
NVMeoF solutions.  When I researched them they tended to have low latency, but 
some required proprietary hardware.  Mostly they defaulted to only 2 replicas 
and had significant scaling and flexibility limitations.  All depends on what 
you're solving for.


> 
> So to bring it quickly down:
> Q3: is it possible to configure ceph to behave like named above in my ideal 
> world?
>    means to first write n minimal copies to local drives, and deferred the 
> syncing of the other copies into the network
> Q4: if not, are there any plans into this direction?
> Q5: if possible, is there a good documentation for it?
> Q6: we would still like to be able to distribute over racks, enclosures and 
> datacenters
>   best wishes
> Hans
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stickyness of writing vs full network storage writing

Reply via email to