[ceph-users] Re: Stickyness of writing vs full network storage writing

2023-10-28 Thread Anthony D'Atri
Well said, Herr Kraftmayer.

- aad

> On Oct 28, 2023, at 4:22 AM, Joachim Kraftmayer - ceph ambassador 
>  wrote:
> 
> Hi,
> 
> I know similar requirements, the motivation and the need behind them.
> We have chosen a clear approach to this, which also does not make the whole 
> setup too complicated to operate.
> 1.) Everything that doesn't require strong consistency we do with other 
> tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies with 
> high IOPs and low latencies.
> 
> 2.) Everything that requires high data security, strong consistency and 
> higher failure domains as host we do with Ceph.
> 
> Joachim
> 
> ___
> ceph ambassador DACH
> ceph consultant since 2012
> 
> Clyso GmbH - Premier Ceph Foundation Member
> 
> https://www.clyso.com/
> 
> Am 27.10.23 um 17:58 schrieb Anthony D'Atri:
>> Ceph is all about strong consistency and data durability.  There can also be 
>> a distinction between performance of the cluster in aggregate vs a single 
>> client, especially in a virtualization scenario where to avoid the 
>> noisy-neighbor dynamic you deliberately throttle iops and bandwidth per 
>> client.
>> 
>>> For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
>>> capable of writing about 8GiB/s, which is about 64GBit/s.
>> Written how, though?  Benchmarks sometimes are written with 100% sequential 
>> workloads, top-SKU CPUs that mortals can't afford, and especially with a 
>> queue depth of like 256.
>> 
>> With most Ceph deployments, the IO a given drive experiences is often pretty 
>> much random and with lower QD.  And depending on the drive, significant read 
>> traffic may impact write bandwidth to a degree.  At . Mountpoint 
>> (Vancouver BC 2018) someone gave a presentation about the difficulties 
>> saturating NVMe bandwidth.
>> 
>>> Now considering the situation that you have 5 nodes each has 4 of that 
>>> drives,
>>> will make all small and mid-sized companies to go bankrupt ;-) only from 
>>> buying the corresponding networking switches.
>> Depending where you get your components...
>> 
>> * You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes 
>> "read intensive" (~1DWPD) (or less, sometimes) are plenty.  But please 
>> please please stick with real enterprise-class drives.
>> 
>> * Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
>> get SSDs elsewhere for half of what they cost from your chassis manufacturer.
>> 
>>>   But the servers hardware is still a simplistic commodity hardware which 
>>> can saturate the given any given commodity network hardware easily.
>>> If I want to be able to use full 64GBit/s I would require at least 
>>> 100GBit/s networking or tons of trunked ports and cabaling with lower 
>>> bandwidth switches.
>> Throughput and latency are different things, though.  Also, are you assuming 
>> here the traditional topology of separate public and 
>> cluster/private/replication networks?  With modern networking (and Ceph 
>> releases) that is often overkill and you can leave out the replication 
>> network.
>> 
>> Also, would your clients have the same networking provisioned?  If you're
>> 
>>>   If we now also consider distributing the nodes over racks, building on 
>>> same location or distributed datacenters, the costs will be even more 
>>> painfull.
>> Don't you already have multiple racks?  They don't need to be dedicated only 
>> to Ceph.
>> 
>>> The ceph commit requirement will be 2 copies on different OSDs (comparable 
>>> to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable 
>>> to a RAID with multiple disk redudancy)
>> Not entirely comparable, but the distinctions mostly don't matter here.
>> 
>>> In all our tests so far, we could not control the behavior of how ceph is 
>>> persisting this 2 copies. It will always try to persist it somehow over the 
>>> network.
>>> Q1: Is this behavior mandatory?
>> It's a question of how important the data is, and how bad it would be to 
>> lose some.
>> 
>>>   Our common workload, and afaik nearly all webservice based applications 
>>> are:
>>> - a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
>>> - and probably mostly 1write to 4read or even 1:6 ratio on utilizing the 
>>> cluster
>> QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though 
>> these days if you shop smart you can get TLC for close the same cost.  Won't 
>> always be true though, and you can't get a 60TB TLC SKU ;)
>> 
>>> Hope I could explain the situation here well enough.
>>> Now assuming my ideal world with ceph:
>>> if ceph would do:
>>> 1. commit 2 copies to local drives to the node there ceph client is 
>>> connected to
>>> 2. after commit sync (optimized/queued) the data over the network to 
>>> fulfill the common needs of ceph storage with 4 copies
>> You could I think craft a CRUSH rule to do that.  Default for replicated 
>> pools FWIW is 3 copies not 

[ceph-users] Re: Stickyness of writing vs full network storage writing

2023-10-28 Thread Joachim Kraftmayer - ceph ambassador

Hi,

I know similar requirements, the motivation and the need behind them.
We have chosen a clear approach to this, which also does not make the 
whole setup too complicated to operate.
1.) Everything that doesn't require strong consistency we do with other 
tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies 
with high IOPs and low latencies.


2.) Everything that requires high data security, strong consistency and 
higher failure domains as host we do with Ceph.


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 27.10.23 um 17:58 schrieb Anthony D'Atri:

Ceph is all about strong consistency and data durability.  There can also be a 
distinction between performance of the cluster in aggregate vs a single client, 
especially in a virtualization scenario where to avoid the noisy-neighbor 
dynamic you deliberately throttle iops and bandwidth per client.


For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
capable of writing about 8GiB/s, which is about 64GBit/s.

Written how, though?  Benchmarks sometimes are written with 100% sequential 
workloads, top-SKU CPUs that mortals can't afford, and especially with a queue 
depth of like 256.

With most Ceph deployments, the IO a given drive experiences is often pretty 
much random and with lower QD.  And depending on the drive, significant read 
traffic may impact write bandwidth to a degree.  At . Mountpoint (Vancouver 
BC 2018) someone gave a presentation about the difficulties saturating NVMe 
bandwidth.


Now considering the situation that you have 5 nodes each has 4 of that drives,
will make all small and mid-sized companies to go bankrupt ;-) only from buying 
the corresponding networking switches.

Depending where you get your components...

* You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes "read 
intensive" (~1DWPD) (or less, sometimes) are plenty.  But please please please stick with real 
enterprise-class drives.

* Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
get SSDs elsewhere for half of what they cost from your chassis manufacturer.


   But the servers hardware is still a simplistic commodity hardware which can 
saturate the given any given commodity network hardware easily.
If I want to be able to use full 64GBit/s I would require at least 100GBit/s 
networking or tons of trunked ports and cabaling with lower bandwidth switches.

Throughput and latency are different things, though.  Also, are you assuming 
here the traditional topology of separate public and 
cluster/private/replication networks?  With modern networking (and Ceph 
releases) that is often overkill and you can leave out the replication network.

Also, would your clients have the same networking provisioned?  If you're


   If we now also consider distributing the nodes over racks, building on same 
location or distributed datacenters, the costs will be even more painfull.

Don't you already have multiple racks?  They don't need to be dedicated only to 
Ceph.


The ceph commit requirement will be 2 copies on different OSDs (comparable to a 
mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a RAID 
with multiple disk redudancy)

Not entirely comparable, but the distinctions mostly don't matter here.


In all our tests so far, we could not control the behavior of how ceph is 
persisting this 2 copies. It will always try to persist it somehow over the 
network.
Q1: Is this behavior mandatory?

It's a question of how important the data is, and how bad it would be to lose 
some.


   Our common workload, and afaik nearly all webservice based applications are:
- a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
- and probably mostly 1write to 4read or even 1:6 ratio on utilizing the cluster

QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though these 
days if you shop smart you can get TLC for close the same cost.  Won't always 
be true though, and you can't get a 60TB TLC SKU ;)


Hope I could explain the situation here well enough.
 Now assuming my ideal world with ceph:
if ceph would do:
1. commit 2 copies to local drives to the node there ceph client is connected to
2. after commit sync (optimized/queued) the data over the network to fulfill 
the common needs of ceph storage with 4 copies

You could I think craft a CRUSH rule to do that.  Default for replicated pools 
FWIW is 3 copies not 4.


3. maybe optionally move 1 copy away from the intial node which still holds the 
2 local copies...

I don't know of an elegant way to change placement after the fact.


   this behaviour would ensure that:
- the felt performance of the OSD clients will be the full bandwidth of the 
local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s and 
the latency would be comparable as writing locally
- 

[ceph-users] Re: Stickyness of writing vs full network storage writing

2023-10-27 Thread Anthony D'Atri
Ceph is all about strong consistency and data durability.  There can also be a 
distinction between performance of the cluster in aggregate vs a single client, 
especially in a virtualization scenario where to avoid the noisy-neighbor 
dynamic you deliberately throttle iops and bandwidth per client.

> For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
> capable of writing about 8GiB/s, which is about 64GBit/s.

Written how, though?  Benchmarks sometimes are written with 100% sequential 
workloads, top-SKU CPUs that mortals can't afford, and especially with a queue 
depth of like 256.

With most Ceph deployments, the IO a given drive experiences is often pretty 
much random and with lower QD.  And depending on the drive, significant read 
traffic may impact write bandwidth to a degree.  At . Mountpoint (Vancouver 
BC 2018) someone gave a presentation about the difficulties saturating NVMe 
bandwidth.  

> Now considering the situation that you have 5 nodes each has 4 of that drives,
> will make all small and mid-sized companies to go bankrupt ;-) only from 
> buying the corresponding networking switches.

Depending where you get your components...

* You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes "read 
intensive" (~1DWPD) (or less, sometimes) are plenty.  But please please please 
stick with real enterprise-class drives.

* Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
get SSDs elsewhere for half of what they cost from your chassis manufacturer.

>   But the servers hardware is still a simplistic commodity hardware which can 
> saturate the given any given commodity network hardware easily.
> If I want to be able to use full 64GBit/s I would require at least 100GBit/s 
> networking or tons of trunked ports and cabaling with lower bandwidth 
> switches.

Throughput and latency are different things, though.  Also, are you assuming 
here the traditional topology of separate public and 
cluster/private/replication networks?  With modern networking (and Ceph 
releases) that is often overkill and you can leave out the replication network.

Also, would your clients have the same networking provisioned?  If you're 

>   If we now also consider distributing the nodes over racks, building on same 
> location or distributed datacenters, the costs will be even more painfull.

Don't you already have multiple racks?  They don't need to be dedicated only to 
Ceph.

> The ceph commit requirement will be 2 copies on different OSDs (comparable to 
> a mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a 
> RAID with multiple disk redudancy)

Not entirely comparable, but the distinctions mostly don't matter here.

> In all our tests so far, we could not control the behavior of how ceph is 
> persisting this 2 copies. It will always try to persist it somehow over the 
> network.
> Q1: Is this behavior mandatory?

It's a question of how important the data is, and how bad it would be to lose 
some.

>   Our common workload, and afaik nearly all webservice based applications are:
> - a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
> - and probably mostly 1write to 4read or even 1:6 ratio on utilizing the 
> cluster

QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though these 
days if you shop smart you can get TLC for close the same cost.  Won't always 
be true though, and you can't get a 60TB TLC SKU ;)

> Hope I could explain the situation here well enough.
> Now assuming my ideal world with ceph:
> if ceph would do:
> 1. commit 2 copies to local drives to the node there ceph client is connected 
> to
> 2. after commit sync (optimized/queued) the data over the network to fulfill 
> the common needs of ceph storage with 4 copies

You could I think craft a CRUSH rule to do that.  Default for replicated pools 
FWIW is 3 copies not 4.

> 3. maybe optionally move 1 copy away from the intial node which still holds 
> the 2 local copies...

I don't know of an elegant way to change placement after the fact.

>   this behaviour would ensure that:
> - the felt performance of the OSD clients will be the full bandwidth of the 
> local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s 
> and the latency would be comparable as writing locally
> - we would have 2 copies nearly "immediately" reported to any ceph client

I was once told that writes return to the client when min_size copies are 
written; later I was told that it's actually not until all copies are written.

But say we could do this.  Think about what happens if one of those two local 
drives -- or the entire server -- dies.  Before any copies are persisted to 
other servers, or if only one copy is persisted to another server.  You risk 
data loss.

> - bandwidth utilization will be optimized, since we do not duplicate the 
> stored data transfers on the network immediatelly, we defer it from the 
> initial