Re: [ceph-users] Replication vs Erasure Coding with only 2 elementsinthe failure-domain.

2017-03-08 Thread Maxime Guyot
Hi,

If using Erasure Coding, I think that should be using “choose indep” rather 
than “firstn” (according to 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/007306.html)

“- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

Unfortunately I'm not aware of a solution. It would require replacing 'step 
take ' with 'step take ' and 'step take ' 
with 'step take '. Iteration is not part of crush as far as I 
know. Maybe someone else can give some more insight into this.”

How about something like this:

“rule eck2m2_ruleset {
  ruleset 0
  type erasure
  min_size 4
  max_size 4
  step take default
  step choose indep 2 type room
 step choose indep 2 type host
  step emit
}
“
Such a rule should put 2 shards in each room on different 4 hosts.

If you are serious about surviving the loss of one of the room, you might want 
to consider the recovery time and how likely it is to have an OSD failure in 
the surviving room during the recovery phase. Something like EC(n,n+1) or  LRC 
(http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/) might 
help.

Cheers,
Maxime

From: ceph-users  on behalf of Burkhard 
Linke 
Date: Wednesday 8 March 2017 08:05
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Replication vs Erasure Coding with only 2 
elementsinthe failure-domain.


Hi,

On 03/07/2017 05:53 PM, Francois Blondel wrote:

Hi all,



We have (only) 2 separate "rooms" (crush bucket) and would like to build a 
cluster being able to handle the complete loss of one room.

*snipsnap*

Second idea would be to use Erasure Coding, as it fits our performance 
requirements and would use less raw space.



Creating an EC profile like:

   “ceph osd erasure-code-profile set eck2m2room k=2 m=2 
ruleset-failure-domain=room”

and a pool using that EC profile, with “ceph osd pool create ecpool 128 128 
erasure eck2m2room” of course leads to having “128 creating+incomplete” PGs, as 
we only have 2 rooms.



Is there somehow a way to store the “parity chuncks” (m) on both rooms, so that 
the loss of a room would be possible ?



If I understood correctly, an Erasure Coding of for example k=2, m=2, would use 
the same space as a replication with a size of 2, but be more reliable, as we 
could afford the loss of more OSDs at the same time.

Would it be possible to instruct the crush rule to store the first k and m 
chuncks in room 1, and the second k and m chuncks in room 2 ?

As far as I understand erasure coding there's no special handling for parity or 
data chunks. To assemble an EC object you just need k chunks, regardless 
whether they are data or parity chunks.

You should be able to distribute the chunks among two rooms by creating a new 
crush rule:

- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

I'm not 100% sure about whether chooseleaf is correct or another choose step is 
necessary to ensure that two osd from differents hosts are chosen (if 
necessary). The important point is using two choose-emit cycles and using the 
correct start points. Just insert the crush labels for the rooms.

This approach should work, but it has two drawbacks:

- crash handling
In case of host failing in a room, the PG from that host will be replicated to 
another host in the same room. You have to ensure that there's enough capacity 
in each rooms (vs. having enough capacity in the complete cluster), which might 
be tricky.

- bandwidth / host utilization
Almost all ceph based applications/libraries use the 'primary' osd for 
accessing data in a PG. The primary OSD is the first one generated by the crush 
rule. In the upper example, the primary OSDs will all be located in the first 
room. All client traffic will be heading to hosts in that room. Depending on 
your setup this might not be a desired solution.

Unfortunately I'm not aware of a solution. It would require to replace 'step 
take ' with 'step take ' and 'step take ' 
with 'step take '. Iteration is not part of crush as far as I 
know. Maybe someone else can give some more insight into this.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication vs Erasure Coding with only 2 elementsinthe failure-domain.

2017-03-07 Thread Burkhard Linke

Hi,


On 03/07/2017 05:53 PM, Francois Blondel wrote:


Hi all,


We have (only) 2 separate "rooms" (crush bucket) and would like to 
build a cluster being able to handle the complete loss of one room.




*snipsnap*


Second idea would be to use Erasure Coding, as it fits our performance 
requirements and would use less raw space.



Creating an EC profile like:

   “ceph osd erasure-code-profile set eck2m2room k=2 m=2 
ruleset-failure-domain=room”


and a pool using that EC profile, with “ceph osd pool create ecpool 
128 128 erasure eck2m2room” of course leads to having “128 
creating+incomplete” PGs, as we only have 2 rooms.



Is there somehow a way to store the “parity chuncks” (m) on both 
rooms, so that the loss of a room would be possible ?



If I understood correctly, an Erasure Coding of for example k=2, m=2, 
would use the same space as a replication with a size of 2, but be 
more reliable, as we could afford the loss of more OSDs at the same time.


Would it be possible to instruct the crush rule to store the first k 
and m chuncks in room 1, and the second k and m chuncks in room 2 ?




As far as I understand erasure coding there's no special handling for 
parity or data chunks. To assemble an EC object you just need k chunks, 
regardless whether they are data or parity chunks.


You should be able to distribute the chunks among two rooms by creating 
a new crush rule:


- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

I'm not 100% sure about whether chooseleaf is correct or another choose 
step is necessary to ensure that two osd from differents hosts are 
chosen (if necessary). The important point is using two choose-emit 
cycles and using the correct start points. Just insert the crush labels 
for the rooms.


This approach should work, but it has two drawbacks:

- crash handling
In case of host failing in a room, the PG from that host will be 
replicated to another host in the same room. You have to ensure that 
there's enough capacity in each rooms (vs. having enough capacity in the 
complete cluster), which might be tricky.


- bandwidth / host utilization
Almost all ceph based applications/libraries use the 'primary' osd for 
accessing data in a PG. The primary OSD is the first one generated by 
the crush rule. In the upper example, the primary OSDs will all be 
located in the first room. All client traffic will be heading to hosts 
in that room. Depending on your setup this might not be a desired solution.


Unfortunately I'm not aware of a solution. It would require to replace 
'step take ' with 'step take ' and 'step take 
' with 'step take '. Iteration is not 
part of crush as far as I know. Maybe someone else can give some more 
insight into this.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com