[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

Frank Schilder Tue, 04 Apr 2023 01:11:17 -0700

Hi Michel,

I don't have experience with LRC profiles. They may reduce cross-site traffic 
at the expense of extra overhead. But this might actually be unproblematic with 
EC profiles that have a large m any ways. If you do experiments with this, 
please let the list know.


I would like to add here a remark that might be of general interest. One 
particular advantage of the 4+5 and 8+7 profile in addition to k being a power 
of 2 is the following. In case of DC failure, one has 6 or 10 shards available, 
respectively. This computes to being equivalent to 4+2 and 8+2. In other words, 
the 4+5 and 8+7 profiles allow maintenance under degraded conditions, because 
one can loose 1DC + 1 other host and still have RW access. This is an advantage 
over 3-times replicated (with less overhead!), where if 1 DC is down one really 
needs to keep everything else running at all cost.

In addition to that, the 4+5 profile tolerates 4 and the 8+7 profile 6 hosts 
down. This means that one can use DCs that are independently administrated. A 
ceph upgrade would require only minimal coordination to synchronize the major 
steps. Specifically, after MONs and MGRs are upgraded, the OSD upgrade can 
proceed independently host by host on each DC without service outage. Even if 
something goes wrong in one DC, the others could proceed without service outage.

In many discussions I miss the impact of replication on maintainability (under 
degraded conditions). Just adding that here, because often the value of 
maintainability greatly outweighs the cost of the extra redundancy. For 
example, I gave up on trying to squeeze out the last byte of disks that are 
actually very cheap compared to my salary and rather have a system that runs 
without me doing overtime on incidents. Its much cheaper in the long run. A 
50-60% usable-capacity cluster is very easy and cheap to administrate.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michel Jouvin <michel.jou...@ijclab.in2p3.fr>
Sent: Monday, April 3, 2023 10:19 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Crushmap rule for multi-datacenter erasure coding

Hi Frank,

Thanks for this detailed answer. About your point of 4+2 or similar schemes 
defeating the purpose of a 3-datacenter configuration, you're right in 
principle. In our case, the goal is to avoid any impact for replicated pools 
(in particular RBD for the cloud) but it may be acceptable for some pools to be 
readonly during a short period. But I'll explore your alternative k+m scénarios 
as some may be interesting..

I'm also interested by experience feedback with LRC EC, even if I don't think 
it changes the problem for resilience to a DC failure.

Best regards,

Michel
Sent from my mobile

Le 3 avril 2023 21:57:41 Frank Schilder <fr...@dtu.dk> a écrit :

Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 shard 
per failure domain and you have 3 data centers and not 6. The modified crush 
rule you wrote should work. I believe equally well with x=0 or 2 -- but try it 
out before doing anything to your cluster.

The easiest way for non-destructive testing is to download the osdmap from your 
cluster and from that map extract the crush map. You can then *without* 
modifying your cluster update the crush map in the (off-line) copy of the OSD 
map and let it compute mappings (commands for all this are in the ceph docs, 
look for osdmaptool). These mappings you can check for if they are as you want. 
There was an earlier case where someone posted a script to confirm mappings 
automatically. I used some awk magic, its not that difficult.

As a note of warning, if you want to be failure resistant, don't use 4+2. Its 
not worth the effort of having 3 data centers. In case you loose one DC, you 
have only 4 shards left, in which case the pool becomes read-only. Don't even 
consider to set min_size=4, it again completely defeats the purpose of having 3 
DCs in the first place.

The smallest profile you can use that will ensure RW access in case of a DC 
failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, which 
is equivalent to 5+1. Here, you have RW access in case of 1 DC down. However, 
k=5 is a prime number with negative performance impact, ideal are powers of 2 
for k. The alternative is k=4, m=5 (44% usable capacity) with good performance 
but higher redundancy overhead.

You can construct valid schemes by looking at all N multiples of 3 and trying 
k<=(2N/3-1):

N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6

As you can see, the larger N the smaller the overhead. The downside is larger 
stripes, meaning that larger N only make sense if you have a large 
files/objects. An often overlooked advantage of profiles with large m is that 
you can defeat tail latencies for read operations by setting fast_read=true for 
the pool. This is really great when you have silently failing disks. 
Unfortunately, there is no fast_write counterpart (which would not be useful in 
your use case any ways).

There are only very few useful profiles with k a power of 2 (4+5, 8+7). Some 
people use 7+5 with success and 5+4 does look somewhat OK as well. If you use 
latest ceph with bluestore min alloc size = 4K, stripe size is less of an issue 
and 8+7 is a really good candidate that I would give a shot in a benchmark.

You should benchmark a number of different profiles on your system to get an 
idea of how important the profile is for performance and how much replication 
overhead you can afford. Remember to benchmark also in degraded condition. 
While as an admin you might be happy that stuff is up, users will still 
complain if things are suddenly unbearably slow. Make long-running tests in 
degraded state to catch all the pitfalls of MONs not trimming logs etc. to have 
a reliable configuration that doesn't let you down the first time it rains.

Good luck and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michel Jouvin <michel.jou...@ijclab.in2p3.fr>
Sent: Monday, April 3, 2023 6:40 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Crushmap rule for multi-datacenter erasure coding

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:

step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

 From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?

Thanks in advance for your help or suggestions. Best regards,

Michel

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

Reply via email to