[ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-14 Thread info
Hi all, 

after encountering a warning about one of my OSDs running out of space i tried 
to study better how data distribution works. 

I'm running a Hammer Ceph cluster v. 0.94.7 

I did some test with crushtool trying to figure out how to achieve even data 
distribution across OSDs. 

Let's take this simple CRUSH MAP: 

# begin crush map 
tunable choose_local_tries 0 
tunable choose_local_fallback_tries 0 
tunable choose_total_tries 50 
tunable chooseleaf_descend_once 1 
tunable straw_calc_version 1 
tunable chooseleaf_vary_r 1 

# devices 
# ceph-osd-001 
device 0 osd.0 # sata-p 
device 1 osd.1 # sata-p 
device 3 osd.3 # sata-p 
device 4 osd.4 # sata-p 
device 5 osd.5 # sata-p 
device 7 osd.7 # sata-p 
device 9 osd.9 # sata-p 
device 10 osd.10 # sata-p 
device 11 osd.11 # sata-p 
device 13 osd.13 # sata-p 
# ceph-osd-002 
device 14 osd.14 # sata-p 
device 15 osd.15 # sata-p 
device 16 osd.16 # sata-p 
device 18 osd.18 # sata-p 
device 19 osd.19 # sata-p 
device 21 osd.21 # sata-p 
device 23 osd.23 # sata-p 
device 24 osd.24 # sata-p 
device 25 osd.25 # sata-p 
device 26 osd.26 # sata-p 
# ceph-osd-003 
device 28 osd.28 # sata-p 
device 29 osd.29 # sata-p 
device 30 osd.30 # sata-p 
device 31 osd.31 # sata-p 
device 32 osd.32 # sata-p 
device 33 osd.33 # sata-p 
device 34 osd.34 # sata-p 
device 35 osd.35 # sata-p 
device 36 osd.36 # sata-p 
device 41 osd.41 # sata-p 
# types 
type 0 osd 
type 1 server 
type 3 datacenter 

# buckets 

### CEPH-OSD-003 ### 
server ceph-osd-003-sata-p { 
id -12 
alg straw 
hash 0 # rjenkins1 
item osd.28 weight 1.000 
item osd.29 weight 1.000 
item osd.30 weight 1.000 
item osd.31 weight 1.000 
item osd.32 weight 1.000 
item osd.33 weight 1.000 
item osd.34 weight 1.000 
item osd.35 weight 1.000 
item osd.36 weight 1.000 
item osd.41 weight 1.000 
} 

### CEPH-OSD-002 ### 
server ceph-osd-002-sata-p { 
id -9 
alg straw 
hash 0 # rjenkins1 
item osd.14 weight 1.000 
item osd.15 weight 1.000 
item osd.16 weight 1.000 
item osd.18 weight 1.000 
item osd.19 weight 1.000 
item osd.21 weight 1.000 
item osd.23 weight 1.000 
item osd.24 weight 1.000 
item osd.25 weight 1.000 
item osd.26 weight 1.000 
} 

### CEPH-OSD-001 ### 
server ceph-osd-001-sata-p { 
id -5 
alg straw 
hash 0 # rjenkins1 
item osd.0 weight 1.000 
item osd.1 weight 1.000 
item osd.3 weight 1.000 
item osd.4 weight 1.000 
item osd.5 weight 1.000 
item osd.7 weight 1.000 
item osd.9 weight 1.000 
item osd.10 weight 1.000 
item osd.11 weight 1.000 
item osd.13 weight 1.000 
} 

# DATACENTER 
datacenter dc1 { 
id -1 
alg straw 
hash 0 # rjenkins1 
item ceph-osd-001-sata-p weight 10.000 
item ceph-osd-002-sata-p weight 10.000 
item ceph-osd-003-sata-p weight 10.000 
} 

# rules 
rule sata-p { 
ruleset 0 
type replicated 
min_size 2 
max_size 10 
step take dc1 
step chooseleaf firstn 0 type server 
step emit 
} 

# end crush map 


Basically it's 30 OSDs spanned across 3 servers. One rule exists, the classic 
replica-3 


cephadm@cephadm01:/etc/ceph/$ crushtool -i crushprova.c --test 
--show-utilization --num-rep 3 --tree --max-x 1 

ID WEIGHT TYPE NAME 
-1 30.0 datacenter milano1 
-5 10.0 server ceph-osd-001-sata-p 
0 1.0 osd.0 
1 1.0 osd.1 
3 1.0 osd.3 
4 1.0 osd.4 
5 1.0 osd.5 
7 1.0 osd.7 
9 1.0 osd.9 
10 1.0 osd.10 
11 1.0 osd.11 
13 1.0 osd.13 
-9 10.0 server ceph-osd-002-sata-p 
14 1.0 osd.14 
15 1.0 osd.15 
16 1.0 osd.16 
18 1.0 osd.18 
19 1.0 osd.19 
21 1.0 osd.21 
23 1.0 osd.23 
24 1.0 osd.24 
25 1.0 osd.25 
26 1.0 osd.26 
-12 10.0 server ceph-osd-003-sata-p 
28 1.0 osd.28 
29 1.0 osd.29 
30 1.0 osd.30 
31 1.0 osd.31 
32 1.0 osd.32 
33 1.0 osd.33 
34 1.0 osd.34 
35 1.0 osd.35 
36 1.0 osd.36 
41 1.0 osd.41 

rule 0 (sata-performance), x = 0..1023, numrep = 3..3 
rule 0 (sata-performance) num_rep 3 result size == 3: 1024/1024 
device 0: stored : 95 expected : 102.49 
device 1: stored : 95 expected : 102.49 
device 3: stored : 104 expected : 102.49 
device 4: stored : 95 expected : 102.49 
device 5: stored : 110 expected : 102.49 
device 7: stored : 111 expected : 102.49 
device 9: stored : 106 expected : 102.49 
device 10: stored : 97 expected : 102.49 
device 11: stored : 105 expected : 102.49 
device 13: stored : 106 expected : 102.49 
device 14: stored : 107 expected : 102.49 
device 15: stored : 107 expected : 102.49 
device 16: stored : 101 expected : 102.49 
device 18: stored : 93 expected : 102.49 
device 19: stored : 102 expected : 102.49 
device 21: stored : 112 expected : 102.49 
device 23: stored : 115 expected : 102.49 
device 24: stored : 95 expected : 102.49 
device 25: stored : 98 expected : 102.49 
device 26: stored : 94 expected : 102.49 
device 28: stored : 92 expected : 102.49 
device 29: stored : 87 expected : 102.49 
device 30: stored : 109 expected : 102.49 
de

Re: [ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-16 Thread Wido den Hollander

> Op 14 oktober 2016 om 19:13 schreef i...@witeq.com:
> 
> 
> Hi all, 
> 
> after encountering a warning about one of my OSDs running out of space i 
> tried to study better how data distribution works. 
> 

100% perfect data distribution is not possible with straw. It is even very hard 
to accomplish this with a deterministic algorithm. It's a trade-off between 
balance and performance.

You might want to read the original paper from Sage: 
http://ceph.com/papers/weil-crush-sc06.pdf

Another thing to look at is: 
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/#crush-map-parameters

With different algorithms like list and uniform you could do other things, but 
use them carefully! I would say, read the PDF first.

Wido

> I'm running a Hammer Ceph cluster v. 0.94.7 
> 
> I did some test with crushtool trying to figure out how to achieve even data 
> distribution across OSDs. 
> 
> Let's take this simple CRUSH MAP: 
> 
> # begin crush map 
> tunable choose_local_tries 0 
> tunable choose_local_fallback_tries 0 
> tunable choose_total_tries 50 
> tunable chooseleaf_descend_once 1 
> tunable straw_calc_version 1 
> tunable chooseleaf_vary_r 1 
> 
> # devices 
> # ceph-osd-001 
> device 0 osd.0 # sata-p 
> device 1 osd.1 # sata-p 
> device 3 osd.3 # sata-p 
> device 4 osd.4 # sata-p 
> device 5 osd.5 # sata-p 
> device 7 osd.7 # sata-p 
> device 9 osd.9 # sata-p 
> device 10 osd.10 # sata-p 
> device 11 osd.11 # sata-p 
> device 13 osd.13 # sata-p 
> # ceph-osd-002 
> device 14 osd.14 # sata-p 
> device 15 osd.15 # sata-p 
> device 16 osd.16 # sata-p 
> device 18 osd.18 # sata-p 
> device 19 osd.19 # sata-p 
> device 21 osd.21 # sata-p 
> device 23 osd.23 # sata-p 
> device 24 osd.24 # sata-p 
> device 25 osd.25 # sata-p 
> device 26 osd.26 # sata-p 
> # ceph-osd-003 
> device 28 osd.28 # sata-p 
> device 29 osd.29 # sata-p 
> device 30 osd.30 # sata-p 
> device 31 osd.31 # sata-p 
> device 32 osd.32 # sata-p 
> device 33 osd.33 # sata-p 
> device 34 osd.34 # sata-p 
> device 35 osd.35 # sata-p 
> device 36 osd.36 # sata-p 
> device 41 osd.41 # sata-p 
> # types 
> type 0 osd 
> type 1 server 
> type 3 datacenter 
> 
> # buckets 
> 
> ### CEPH-OSD-003 ### 
> server ceph-osd-003-sata-p { 
> id -12 
> alg straw 
> hash 0 # rjenkins1 
> item osd.28 weight 1.000 
> item osd.29 weight 1.000 
> item osd.30 weight 1.000 
> item osd.31 weight 1.000 
> item osd.32 weight 1.000 
> item osd.33 weight 1.000 
> item osd.34 weight 1.000 
> item osd.35 weight 1.000 
> item osd.36 weight 1.000 
> item osd.41 weight 1.000 
> } 
> 
> ### CEPH-OSD-002 ### 
> server ceph-osd-002-sata-p { 
> id -9 
> alg straw 
> hash 0 # rjenkins1 
> item osd.14 weight 1.000 
> item osd.15 weight 1.000 
> item osd.16 weight 1.000 
> item osd.18 weight 1.000 
> item osd.19 weight 1.000 
> item osd.21 weight 1.000 
> item osd.23 weight 1.000 
> item osd.24 weight 1.000 
> item osd.25 weight 1.000 
> item osd.26 weight 1.000 
> } 
> 
> ### CEPH-OSD-001 ### 
> server ceph-osd-001-sata-p { 
> id -5 
> alg straw 
> hash 0 # rjenkins1 
> item osd.0 weight 1.000 
> item osd.1 weight 1.000 
> item osd.3 weight 1.000 
> item osd.4 weight 1.000 
> item osd.5 weight 1.000 
> item osd.7 weight 1.000 
> item osd.9 weight 1.000 
> item osd.10 weight 1.000 
> item osd.11 weight 1.000 
> item osd.13 weight 1.000 
> } 
> 
> # DATACENTER 
> datacenter dc1 { 
> id -1 
> alg straw 
> hash 0 # rjenkins1 
> item ceph-osd-001-sata-p weight 10.000 
> item ceph-osd-002-sata-p weight 10.000 
> item ceph-osd-003-sata-p weight 10.000 
> } 
> 
> # rules 
> rule sata-p { 
> ruleset 0 
> type replicated 
> min_size 2 
> max_size 10 
> step take dc1 
> step chooseleaf firstn 0 type server 
> step emit 
> } 
> 
> # end crush map 
> 
> 
> Basically it's 30 OSDs spanned across 3 servers. One rule exists, the classic 
> replica-3 
> 
> 
> cephadm@cephadm01:/etc/ceph/$ crushtool -i crushprova.c --test 
> --show-utilization --num-rep 3 --tree --max-x 1 
> 
> ID WEIGHT TYPE NAME 
> -1 30.0 datacenter milano1 
> -5 10.0 server ceph-osd-001-sata-p 
> 0 1.0 osd.0 
> 1 1.0 osd.1 
> 3 1.0 osd.3 
> 4 1.0 osd.4 
> 5 1.0 osd.5 
> 7 1.0 osd.7 
> 9 1.0 osd.9 
> 10 1.0 osd.10 
> 11 1.0 osd.11 
> 13 1.0 osd.13 
> -9 10.0 server ceph-osd-002-sata-p 
> 14 1.0 osd.14 
> 15 1.0 osd.15 
> 16 1.0 osd.16 
> 18 1.0 osd.18 
> 19 1.0 osd.19 
> 21 1.0 osd.21 
> 23 1.0 osd.23 
> 24 1.0 osd.24 
> 25 1.0 osd.25 
> 26 1.0 osd.26 
> -12 10.0 server ceph-osd-003-sata-p 
> 28 1.0 osd.28 
> 29 1.0 osd.29 
> 30 1.0 osd.30 
> 31 1.0 osd.31 
> 32 1.0 osd.32 
> 33 1.0 osd.33 
> 34 1.0 osd.34 
> 35 1.0 osd.35 
> 36 1.0 osd.36 
> 41 1.0 osd.41 
> 
> rule 0 (sata-performance), x = 0..1023, numrep = 3..3 
> rule 0 (sata-performance) num_rep 3 result size == 3: 1024/1024 
> device 0: stored : 95 expected : 102.49 
> device 1: stored : 95 expected : 102.49 
> device 3: st

Re: [ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-17 Thread info
Hi Wido, 

thanks for the explanation, generally speaking what is the best practice when a 
couple of OSDs are reaching near-full capacity? 

I could set their weight do something like 0.9 but this seems only a temporary 
solution. 
Of course i can add more OSDs, but this change radically my prospective in 
terms of capacity planning, what would you do in production? 

Thanks 
Giordano 


From: "Wido den Hollander"  
To: ceph-users@lists.ceph.com, i...@witeq.com 
Sent: Monday, October 17, 2016 8:57:16 AM 
Subject: Re: [ceph-users] Even data distribution across OSD - Impossible 
Achievement? 

> Op 14 oktober 2016 om 19:13 schreef i...@witeq.com: 
> 
> 
> Hi all, 
> 
> after encountering a warning about one of my OSDs running out of space i 
> tried to study better how data distribution works. 
> 

100% perfect data distribution is not possible with straw. It is even very hard 
to accomplish this with a deterministic algorithm. It's a trade-off between 
balance and performance. 

You might want to read the original paper from Sage: 
http://ceph.com/papers/weil-crush-sc06.pdf 

Another thing to look at is: 
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/#crush-map-parameters
 

With different algorithms like list and uniform you could do other things, but 
use them carefully! I would say, read the PDF first. 

Wido 

> I'm running a Hammer Ceph cluster v. 0.94.7 
> 
> I did some test with crushtool trying to figure out how to achieve even data 
> distribution across OSDs. 
> 
> Let's take this simple CRUSH MAP: 
> 
> # begin crush map 
> tunable choose_local_tries 0 
> tunable choose_local_fallback_tries 0 
> tunable choose_total_tries 50 
> tunable chooseleaf_descend_once 1 
> tunable straw_calc_version 1 
> tunable chooseleaf_vary_r 1 
> 
> # devices 
> # ceph-osd-001 
> device 0 osd.0 # sata-p 
> device 1 osd.1 # sata-p 
> device 3 osd.3 # sata-p 
> device 4 osd.4 # sata-p 
> device 5 osd.5 # sata-p 
> device 7 osd.7 # sata-p 
> device 9 osd.9 # sata-p 
> device 10 osd.10 # sata-p 
> device 11 osd.11 # sata-p 
> device 13 osd.13 # sata-p 
> # ceph-osd-002 
> device 14 osd.14 # sata-p 
> device 15 osd.15 # sata-p 
> device 16 osd.16 # sata-p 
> device 18 osd.18 # sata-p 
> device 19 osd.19 # sata-p 
> device 21 osd.21 # sata-p 
> device 23 osd.23 # sata-p 
> device 24 osd.24 # sata-p 
> device 25 osd.25 # sata-p 
> device 26 osd.26 # sata-p 
> # ceph-osd-003 
> device 28 osd.28 # sata-p 
> device 29 osd.29 # sata-p 
> device 30 osd.30 # sata-p 
> device 31 osd.31 # sata-p 
> device 32 osd.32 # sata-p 
> device 33 osd.33 # sata-p 
> device 34 osd.34 # sata-p 
> device 35 osd.35 # sata-p 
> device 36 osd.36 # sata-p 
> device 41 osd.41 # sata-p 
> # types 
> type 0 osd 
> type 1 server 
> type 3 datacenter 
> 
> # buckets 
> 
> ### CEPH-OSD-003 ### 
> server ceph-osd-003-sata-p { 
> id -12 
> alg straw 
> hash 0 # rjenkins1 
> item osd.28 weight 1.000 
> item osd.29 weight 1.000 
> item osd.30 weight 1.000 
> item osd.31 weight 1.000 
> item osd.32 weight 1.000 
> item osd.33 weight 1.000 
> item osd.34 weight 1.000 
> item osd.35 weight 1.000 
> item osd.36 weight 1.000 
> item osd.41 weight 1.000 
> } 
> 
> ### CEPH-OSD-002 ### 
> server ceph-osd-002-sata-p { 
> id -9 
> alg straw 
> hash 0 # rjenkins1 
> item osd.14 weight 1.000 
> item osd.15 weight 1.000 
> item osd.16 weight 1.000 
> item osd.18 weight 1.000 
> item osd.19 weight 1.000 
> item osd.21 weight 1.000 
> item osd.23 weight 1.000 
> item osd.24 weight 1.000 
> item osd.25 weight 1.000 
> item osd.26 weight 1.000 
> } 
> 
> ### CEPH-OSD-001 ### 
> server ceph-osd-001-sata-p { 
> id -5 
> alg straw 
> hash 0 # rjenkins1 
> item osd.0 weight 1.000 
> item osd.1 weight 1.000 
> item osd.3 weight 1.000 
> item osd.4 weight 1.000 
> item osd.5 weight 1.000 
> item osd.7 weight 1.000 
> item osd.9 weight 1.000 
> item osd.10 weight 1.000 
> item osd.11 weight 1.000 
> item osd.13 weight 1.000 
> } 
> 
> # DATACENTER 
> datacenter dc1 { 
> id -1 
> alg straw 
> hash 0 # rjenkins1 
> item ceph-osd-001-sata-p weight 10.000 
> item ceph-osd-002-sata-p weight 10.000 
> item ceph-osd-003-sata-p weight 10.000 
> } 
> 
> # rules 
> rule sata-p { 
> ruleset 0 
> type replicated 
> min_size 2 
> max_size 10 
> step take dc1 
> step chooseleaf firstn 0 type server 
> step emit 
> } 
> 
> # end crush map 
> 
> 
> Basically it's 30 OSDs spanned across 3 servers. One rule exists, the classic 
> replica-3 
> 
> 
> cephadm@cephadm01:/etc/ceph/$ crushtool -i crushprova.c --test 
> --show-utili

Re: [ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-17 Thread Christian Balzer

Hello,

On Mon, 17 Oct 2016 09:42:09 +0200 (CEST) i...@witeq.com wrote:

> Hi Wido, 
> 
> thanks for the explanation, generally speaking what is the best practice when 
> a couple of OSDs are reaching near-full capacity? 
> 
This has (of course) been discussed here many times.
Google is your friend (when it's not creepy).

> I could set their weight do something like 0.9 but this seems only a 
> temporary solution. 
> Of course i can add more OSDs, but this change radically my prospective in 
> terms of capacity planning, what would you do in production? 
>

Manually re-weighting (CRUSH, not reweight) is one approach, IMHO it's
better to give the least utilized OSDs a higher score than the other way
around, while keeping per node scores as equal as possible.

Doing the reweight by utilization dance, which in the latest Hammer
versions is much improved and has a dry-run option is another option.
I don't like it because it's a temporary setting, lost if the OSD is ever
set OUT.

Both of these can create a very even cluster, but may make imbalances
during OSD adds/removals worse than otherwise.

The larger your cluster gets, the less likely you'll run into extreme
outliers, but of course it's something to monitor (graph) anyway.
The smaller your cluster, the less painful it is to manually adjust things.

If you have only 1 or a few over-utilized OSDs and don't want to add more
OSDs, fiddle their weight.

However if one OSD is getting near-full I'd take that also as hint to check
the numbers, i.e. what would happen if you'd loose an OSD (or 2) or a host,
could Ceph survive this w/o everything getting full?

Christian
 
> Thanks 
> Giordano 
> 
> 
> From: "Wido den Hollander"  
> To: ceph-users@lists.ceph.com, i...@witeq.com 
> Sent: Monday, October 17, 2016 8:57:16 AM 
> Subject: Re: [ceph-users] Even data distribution across OSD - Impossible 
> Achievement? 
> 
> > Op 14 oktober 2016 om 19:13 schreef i...@witeq.com: 
> > 
> > 
> > Hi all, 
> > 
> > after encountering a warning about one of my OSDs running out of space i 
> > tried to study better how data distribution works. 
> > 
> 
> 100% perfect data distribution is not possible with straw. It is even very 
> hard to accomplish this with a deterministic algorithm. It's a trade-off 
> between balance and performance. 
> 
> You might want to read the original paper from Sage: 
> http://ceph.com/papers/weil-crush-sc06.pdf 
> 
> Another thing to look at is: 
> http://docs.ceph.com/docs/jewel/rados/operations/crush-map/#crush-map-parameters
>  
> 
> With different algorithms like list and uniform you could do other things, 
> but use them carefully! I would say, read the PDF first. 
> 
> Wido 
> 
> > I'm running a Hammer Ceph cluster v. 0.94.7 
> > 
> > I did some test with crushtool trying to figure out how to achieve even 
> > data distribution across OSDs. 
> > 
> > Let's take this simple CRUSH MAP: 
> > 
> > # begin crush map 
> > tunable choose_local_tries 0 
> > tunable choose_local_fallback_tries 0 
> > tunable choose_total_tries 50 
> > tunable chooseleaf_descend_once 1 
> > tunable straw_calc_version 1 
> > tunable chooseleaf_vary_r 1 
> > 
> > # devices 
> > # ceph-osd-001 
> > device 0 osd.0 # sata-p 
> > device 1 osd.1 # sata-p 
> > device 3 osd.3 # sata-p 
> > device 4 osd.4 # sata-p 
> > device 5 osd.5 # sata-p 
> > device 7 osd.7 # sata-p 
> > device 9 osd.9 # sata-p 
> > device 10 osd.10 # sata-p 
> > device 11 osd.11 # sata-p 
> > device 13 osd.13 # sata-p 
> > # ceph-osd-002 
> > device 14 osd.14 # sata-p 
> > device 15 osd.15 # sata-p 
> > device 16 osd.16 # sata-p 
> > device 18 osd.18 # sata-p 
> > device 19 osd.19 # sata-p 
> > device 21 osd.21 # sata-p 
> > device 23 osd.23 # sata-p 
> > device 24 osd.24 # sata-p 
> > device 25 osd.25 # sata-p 
> > device 26 osd.26 # sata-p 
> > # ceph-osd-003 
> > device 28 osd.28 # sata-p 
> > device 29 osd.29 # sata-p 
> > device 30 osd.30 # sata-p 
> > device 31 osd.31 # sata-p 
> > device 32 osd.32 # sata-p 
> > device 33 osd.33 # sata-p 
> > device 34 osd.34 # sata-p 
> > device 35 osd.35 # sata-p 
> > device 36 osd.36 # sata-p 
> > device 41 osd.41 # sata-p 
> > # types 
> > type 0 osd 
> > type 1 server 
> > type 3 datacenter 
> > 
> > # buckets 
> > 
> > ### CEPH-OSD-003 ### 
> > server ceph-osd-003-sata-p { 
> > id -12 
> > alg straw 
> > hash 0 # rjenkins1 
> > item osd.28 weight 1.000 
> > item o