Re: [ceph-users] how to improve ceph cluster capacity usage

2015-09-08 Thread Gregory Farnum
On Tue, Sep 1, 2015 at 3:58 PM, huang jun  wrote:
> hi,all
>
> Recently, i did some experiments on OSD data distribution,
> we set up a cluster with 72 OSDs,all 2TB sata disk,
> and ceph version is v0.94.3 and linux kernel version is 3.18,
> and set "ceph osd crush tunables optimal".
> There are 3 pools:
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 832
> crash_replay_interval 45 stripe_width 0
> pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
> stripe_width 0
>
> the osd pg num of each osd:
> pool  : 0  1  2  | SUM
> 
> osd.0   13 10518 | 136
> osd.1   17 11026 | 153
> osd.2   15 11420 | 149
> osd.3   11 10117 | 129
> osd.4   8  10617 | 131
> osd.5   12 10219 | 133
> osd.6   19 11429 | 162
> osd.7   16 11521 | 152
> osd.8   15 11725 | 157
> osd.9   13 11723 | 153
> osd.10  13 13316 | 162
> osd.11  14 10521 | 140
> osd.12  11 94 16 | 121
> osd.13  12 11021 | 143
> osd.14  20 11926 | 165
> osd.15  12 12519 | 156
> osd.16  15 12622 | 163
> osd.17  13 10919 | 141
> osd.18  8  11919 | 146
> osd.19  14 11419 | 147
> osd.20  17 11329 | 159
> osd.21  17 11127 | 155
> osd.22  13 12120 | 154
> osd.23  14 95 23 | 132
> osd.24  17 11026 | 153
> osd.25  13 13315 | 161
> osd.26  17 12424 | 165
> osd.27  16 11920 | 155
> osd.28  19 13430 | 183
> osd.29  13 12120 | 154
> osd.30  11 97 20 | 128
> osd.31  12 10918 | 139
> osd.32  10 11215 | 137
> osd.33  18 11428 | 160
> osd.34  19 11229 | 160
> osd.35  16 12132 | 169
> osd.36  13 11118 | 142
> osd.37  15 10722 | 144
> osd.38  21 12924 | 174
> osd.39  9  12117 | 147
> osd.40  11 10218 | 131
> osd.41  14 10119 | 134
> osd.42  16 11925 | 160
> osd.43  12 11813 | 143
> osd.44  17 11425 | 156
> osd.45  11 11415 | 140
> osd.46  12 10716 | 135
> osd.47  15 11123 | 149
> osd.48  14 11520 | 149
> osd.49  9  94 13 | 116
> osd.50  14 11718 | 149
> osd.51  13 11219 | 144
> osd.52  11 12622 | 159
> osd.53  12 12218 | 152
> osd.54  13 12120 | 154
> osd.55  17 11425 | 156
> osd.56  11 11818 | 147
> osd.57  22 13725 | 184
> osd.58  15 10522 | 142
> osd.59  13 12018 | 151
> osd.60  12 11019 | 141
> osd.61  21 11428 | 163
> osd.62  12 97 18 | 127
> osd.63  19 10931 | 159
> osd.64  10 13221 | 163
> osd.65  19 13721 | 177
> osd.66  22 10732 | 161
> osd.67  12 10720 | 139
> osd.68  14 10022 | 136
> osd.69  16 11024 | 150
> osd.70  9  10114 | 124
> osd.71  15 11224 | 151
>
> 
> SUM   : 1024   8192   1536   |
>
> We can found that, for poolid=1(data pool),
> osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
> which maybe cause data distribution imbanlance, and reduces the space
> utilization of the cluster.
>
> Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
> 2 --min-x 1 --max-x %s"
> we tested different pool pg_num:
>
> Total PG num PG num stats
>  ---
> 4096 avg: 113.78 (avg stands for avg PG num of every osd)
> total: 8192  (total stands for total PG num, include replica PG)
> max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
> for percent above avage PG num )
> min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
> for ratio below avage PG num )
>
> 8192 avg: 227.56
> total: 16384
> max: 267 0.173340
> min: 226 -0.129883
>
> 16384 avg: 455.11
> total: 32768
> max: 502 0.103027
> min: 455 -0.127686
>
> 32768 avg: 910.22
> total: 65536
> max: 966 0.061279
> min: 910 -0.076050
>
> With bigger pg_num, the gap between the maximum and the minimum decreased.
> But it's unreasonable to set such large pg_num, which will increase
> OSD and MON load.
>
> Is there any way to get a more balanced PG distribution of the cluster?
> We tried "ceph osd reweight-by-pg 110 data" many times, but that can
> not resolve the problem.

The numbers you're seeing her

Re: [ceph-users] how to improve ceph cluster capacity usage

2015-09-02 Thread huang jun
After search the source code, i found ceph_psim tool which can
simulate objects distribution,
but it seems a little simple.



2015-09-01 22:58 GMT+08:00 huang jun :
> hi,all
>
> Recently, i did some experiments on OSD data distribution,
> we set up a cluster with 72 OSDs,all 2TB sata disk,
> and ceph version is v0.94.3 and linux kernel version is 3.18,
> and set "ceph osd crush tunables optimal".
> There are 3 pools:
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 832
> crash_replay_interval 45 stripe_width 0
> pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
> stripe_width 0
>
> the osd pg num of each osd:
> pool  : 0  1  2  | SUM
> 
> osd.0   13 10518 | 136
> osd.1   17 11026 | 153
> osd.2   15 11420 | 149
> osd.3   11 10117 | 129
> osd.4   8  10617 | 131
> osd.5   12 10219 | 133
> osd.6   19 11429 | 162
> osd.7   16 11521 | 152
> osd.8   15 11725 | 157
> osd.9   13 11723 | 153
> osd.10  13 13316 | 162
> osd.11  14 10521 | 140
> osd.12  11 94 16 | 121
> osd.13  12 11021 | 143
> osd.14  20 11926 | 165
> osd.15  12 12519 | 156
> osd.16  15 12622 | 163
> osd.17  13 10919 | 141
> osd.18  8  11919 | 146
> osd.19  14 11419 | 147
> osd.20  17 11329 | 159
> osd.21  17 11127 | 155
> osd.22  13 12120 | 154
> osd.23  14 95 23 | 132
> osd.24  17 11026 | 153
> osd.25  13 13315 | 161
> osd.26  17 12424 | 165
> osd.27  16 11920 | 155
> osd.28  19 13430 | 183
> osd.29  13 12120 | 154
> osd.30  11 97 20 | 128
> osd.31  12 10918 | 139
> osd.32  10 11215 | 137
> osd.33  18 11428 | 160
> osd.34  19 11229 | 160
> osd.35  16 12132 | 169
> osd.36  13 11118 | 142
> osd.37  15 10722 | 144
> osd.38  21 12924 | 174
> osd.39  9  12117 | 147
> osd.40  11 10218 | 131
> osd.41  14 10119 | 134
> osd.42  16 11925 | 160
> osd.43  12 11813 | 143
> osd.44  17 11425 | 156
> osd.45  11 11415 | 140
> osd.46  12 10716 | 135
> osd.47  15 11123 | 149
> osd.48  14 11520 | 149
> osd.49  9  94 13 | 116
> osd.50  14 11718 | 149
> osd.51  13 11219 | 144
> osd.52  11 12622 | 159
> osd.53  12 12218 | 152
> osd.54  13 12120 | 154
> osd.55  17 11425 | 156
> osd.56  11 11818 | 147
> osd.57  22 13725 | 184
> osd.58  15 10522 | 142
> osd.59  13 12018 | 151
> osd.60  12 11019 | 141
> osd.61  21 11428 | 163
> osd.62  12 97 18 | 127
> osd.63  19 10931 | 159
> osd.64  10 13221 | 163
> osd.65  19 13721 | 177
> osd.66  22 10732 | 161
> osd.67  12 10720 | 139
> osd.68  14 10022 | 136
> osd.69  16 11024 | 150
> osd.70  9  10114 | 124
> osd.71  15 11224 | 151
>
> 
> SUM   : 1024   8192   1536   |
>
> We can found that, for poolid=1(data pool),
> osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
> which maybe cause data distribution imbanlance, and reduces the space
> utilization of the cluster.
>
> Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
> 2 --min-x 1 --max-x %s"
> we tested different pool pg_num:
>
> Total PG num PG num stats
>  ---
> 4096 avg: 113.78 (avg stands for avg PG num of every osd)
> total: 8192  (total stands for total PG num, include replica PG)
> max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
> for percent above avage PG num )
> min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
> for ratio below avage PG num )
>
> 8192 avg: 227.56
> total: 16384
> max: 267 0.173340
> min: 226 -0.129883
>
> 16384 avg: 455.11
> total: 32768
> max: 502 0.103027
> min: 455 -0.127686
>
> 32768 avg: 910.22
> total: 65536
> max: 966 0.061279
> min: 910 -0.076050
>
> With bigger pg_num, the gap between the maximum and the minimum decreased.
> But it's unreasonable to set such large pg_num, which will increase
> OSD and MON load.
>
> Is there any way to get a more balanced PG distribution of the cluster?
> We tried "

[ceph-users] how to improve ceph cluster capacity usage

2015-09-01 Thread huang jun
hi,all

Recently, i did some experiments on OSD data distribution,
we set up a cluster with 72 OSDs,all 2TB sata disk,
and ceph version is v0.94.3 and linux kernel version is 3.18,
and set "ceph osd crush tunables optimal".
There are 3 pools:
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 4096 pgp_num 4096 last_change 832
crash_replay_interval 45 stripe_width 0
pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
stripe_width 0

the osd pg num of each osd:
pool  : 0  1  2  | SUM

osd.0   13 10518 | 136
osd.1   17 11026 | 153
osd.2   15 11420 | 149
osd.3   11 10117 | 129
osd.4   8  10617 | 131
osd.5   12 10219 | 133
osd.6   19 11429 | 162
osd.7   16 11521 | 152
osd.8   15 11725 | 157
osd.9   13 11723 | 153
osd.10  13 13316 | 162
osd.11  14 10521 | 140
osd.12  11 94 16 | 121
osd.13  12 11021 | 143
osd.14  20 11926 | 165
osd.15  12 12519 | 156
osd.16  15 12622 | 163
osd.17  13 10919 | 141
osd.18  8  11919 | 146
osd.19  14 11419 | 147
osd.20  17 11329 | 159
osd.21  17 11127 | 155
osd.22  13 12120 | 154
osd.23  14 95 23 | 132
osd.24  17 11026 | 153
osd.25  13 13315 | 161
osd.26  17 12424 | 165
osd.27  16 11920 | 155
osd.28  19 13430 | 183
osd.29  13 12120 | 154
osd.30  11 97 20 | 128
osd.31  12 10918 | 139
osd.32  10 11215 | 137
osd.33  18 11428 | 160
osd.34  19 11229 | 160
osd.35  16 12132 | 169
osd.36  13 11118 | 142
osd.37  15 10722 | 144
osd.38  21 12924 | 174
osd.39  9  12117 | 147
osd.40  11 10218 | 131
osd.41  14 10119 | 134
osd.42  16 11925 | 160
osd.43  12 11813 | 143
osd.44  17 11425 | 156
osd.45  11 11415 | 140
osd.46  12 10716 | 135
osd.47  15 11123 | 149
osd.48  14 11520 | 149
osd.49  9  94 13 | 116
osd.50  14 11718 | 149
osd.51  13 11219 | 144
osd.52  11 12622 | 159
osd.53  12 12218 | 152
osd.54  13 12120 | 154
osd.55  17 11425 | 156
osd.56  11 11818 | 147
osd.57  22 13725 | 184
osd.58  15 10522 | 142
osd.59  13 12018 | 151
osd.60  12 11019 | 141
osd.61  21 11428 | 163
osd.62  12 97 18 | 127
osd.63  19 10931 | 159
osd.64  10 13221 | 163
osd.65  19 13721 | 177
osd.66  22 10732 | 161
osd.67  12 10720 | 139
osd.68  14 10022 | 136
osd.69  16 11024 | 150
osd.70  9  10114 | 124
osd.71  15 11224 | 151


SUM   : 1024   8192   1536   |

We can found that, for poolid=1(data pool),
osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
which maybe cause data distribution imbanlance, and reduces the space
utilization of the cluster.

Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
2 --min-x 1 --max-x %s"
we tested different pool pg_num:

Total PG num PG num stats
 ---
4096 avg: 113.78 (avg stands for avg PG num of every osd)
total: 8192  (total stands for total PG num, include replica PG)
max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
for percent above avage PG num )
min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
for ratio below avage PG num )

8192 avg: 227.56
total: 16384
max: 267 0.173340
min: 226 -0.129883

16384 avg: 455.11
total: 32768
max: 502 0.103027
min: 455 -0.127686

32768 avg: 910.22
total: 65536
max: 966 0.061279
min: 910 -0.076050

With bigger pg_num, the gap between the maximum and the minimum decreased.
But it's unreasonable to set such large pg_num, which will increase
OSD and MON load.

Is there any way to get a more balanced PG distribution of the cluster?
We tried "ceph osd reweight-by-pg 110 data" many times, but that can
not resolve the problem.

Another problem is that if we can ensure the PG is distributed
balanced, can we ensure the data
distribution is balanced like PG?

Btw, we will write data to this cluster until one or more osd get
full, we set full ratio to 0.98,
and we expect the cluster can use 0.9 total capacity.

Any tips are welcome.

-- 
thanks
huangjun
__