[ceph-users] Re: Balancing with upmap

2021-02-01 Thread Francois Legrand
This is the pgs repartition as given by the command I found here 
http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd 
:



pool :    35   44   36   31   32   33    2   34    43   | SUM

osd.0 0    5    26    1    1    1    1    1    2    | 38
osd.1 1    6    23    1    1    1    2    1    1    | 37
osd.2 1    6    22    1    1    1    2    1    2    | 37
osd.3 1    11   35    1    1    1    3    2    3    | 58
osd.4 1    6    23    1    1    1    1    1    1    | 36
osd.5 0    6    22    1    1    1    1    1    1    | 34
osd.6 1    6    23    1    1    1    1    1    2    | 37
osd.7 1    6    22    1    1    1    2    1    2    | 37
osd.8 1    5    26    1    1    1    2    1    1    | 39
osd.9 0    5    26    1    0    1    2    1    2    | 38
osd.10    1    6    30    1    1    1    2    1    2    | 45
osd.11    1    6    26    1    1    1    2    0    2    | 40
osd.12    1    5    25    1    1    1    1    1    1    | 37
osd.13    1    6    22    1    1    1    2    1    1    | 36
osd.14    1    8    22    1    1    1    2    1    3    | 40
osd.15    1    6    26    1    1    1    1    1    2    | 40
osd.16    0    6    23    1    1    1    2    1    2    | 37
osd.17    1    5    25    1    0    0    2    1    1    | 36
osd.18    1    6    28    1    1    1    2    1    2    | 43
osd.19    1    6    22    1    1    1    2    1    2    | 37
osd.20    1    5    22    1    1    1    1    1    1    | 34
osd.21    0    5    22    1    1    0    1    1    2    | 33
osd.22    1    6    30    1    1    1    2    1    3    | 46
osd.23    1    9    35    1    1    1    3    2    2    | 55
osd.24    1    5    24    1    1    1    2    1    2    | 38
osd.25    1    6    24    1    1    1    1    1    2    | 38
osd.26    1    8    23    1    1    1    1    1    2    | 39
osd.27    1    6    22    1    1    1    1    0    1    | 34
osd.28    0    5    22    1    1    1    1    0    1    | 32
osd.29    1    5    24    0    1    1    1    1    0    | 34
osd.30    1    6    24    1    1    1    2    1    2    | 39
osd.31    1    6    22    1    1    1    1    1    1    | 35
osd.32    0    5    25    1    0    1    1    1    0    | 34
osd.33    1    5    25    1    1    0    1    1    1    | 36
osd.34    0    9    28    1    1    1    1    0    2    | 43
osd.35    1    6    22    1    1    1    2    1    2    | 37
osd.36    1    5    25    0    0    1    1    0    2    | 35
osd.37    1    5    24    0    0    0    1    1    2    | 34
osd.38    0    6    26    1    1    1    1    0    2    | 38
osd.39    1    6    23    1    1    1    1    0    2    | 36
osd.40    1    6    24    1    1    0    2    1    1    | 37
osd.41    1    6    22    1    0    0    1    0    2    | 33
osd.42    1    7    25    1    1    1    2    1    2    | 41
osd.43    1    6    24    0    1    1    1    0    1    | 35
osd.44    1    6    24    0    0    1    2    1    1    | 36
osd.45    1    5    25    0    1    0    2    0    1    | 35
osd.46    1    6    22    0    1    1    1    1    2    | 35
osd.47    1    6    26    1    1    1    2    1    2    | 41
osd.48    0    5    22    0    1    1    1    1    2    | 33
osd.49    1    5    26    1    0    0    2    1    0    | 36
osd.50    1    9    23    1    1    1    2    0    2    | 40
osd.51    1    6    22    0    1    0    2    0    2    | 34
osd.52    1    5    22    0    1    0    1    1    2    | 33
osd.53    0    5    22    1    1    1    1    0    2    | 33
osd.54    0    6    24    0    1    1    1    0    2    | 35
osd.55    1    6    22    1    0    1    2    0    2    | 35
osd.56    0    6    22    1    1    1    2    1    0    | 34
osd.57    1    6    24    1    1    1    1    0    1    | 36
osd.58    1    6    25    0    1    1    2    1    2    | 39
osd.59    1    6    26    1    1    1    2    0    2    | 40
osd.60    0    4    22    1    1    0    1    0    1    | 30
osd.61    1    5    25    0    0    0    2    1    2    | 36
osd.62    1    9    22    1    1    1    2    1    2    | 40
osd.63    1    9    35    1    1    1    3    1    3    | 55
osd.64    1    11   36    2    1    1    2    1    2    | 57
osd.65    1    8    37    1    1    2    2    1    3    | 56
osd.66    1    9    35    1    1    2    2    2    2    | 55
osd.67    1    10   34    1    1    2    2    2    3    | 56
osd.68    1    9    40    1    1    1    2    1    2    | 58
osd.69    1    8    40    1    1    1    2    1    3    | 58
osd.70    1    8    34    1    1    1    2    1    2    | 51
osd.71    1    11   36    1    1    2    2    1    2    | 57
osd.72    1    6    26    1    1    1    1    1    1    | 39
osd.73    1    8    37    1    1    2    3    1    1    | 55
osd.74    1    6    22    1    0    1    1    1    1    | 34
osd.75    1    6    22    1    1    1    2    1    1    | 36
osd.76    1    6    22    1    0    1    1    1    

[ceph-users] Re: Balancing with upmap

2021-02-01 Thread Dan van der Ster
On Mon, Feb 1, 2021 at 10:03 AM Francois Legrand  wrote:
>
> Hi,
>
> Actually we have no EC pools... all are replica 3. And we have only 9 pools.
>
> The average number og pg/osd is not very high (40.6).
>
> Here is the detail of the pools :
>
> pool 2 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
> pg_num 64 pgp_num 64 last_change 623105 lfor 0/608315/608313 flags
> hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 31 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
> pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
> 0/0/171563 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 32 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
> pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
> 436085/436085/436085 flags hashpspool,selfmanaged_snaps stripe_width 0
> application rbd
> pool 33 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
> pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
> 0/0/171554 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 34 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
> pg_num 32 pgp_num 32 autoscale_mode on last_change 623470 lfor
> 0/0/171558 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 35 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
> pg_num 32 pgp_num 32 last_change 621529 lfor 0/598286/598284 flags
> hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
> recovery_priority 5 application cephfs
> pool 36 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
> pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 624174 flags
> hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
> pool 43 replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins
> pg_num 64 pgp_num 64 autoscale_mode warn last_change 624174 flags
> hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
> pool 44 replicated size 3 min_size 3 crush_rule 2 object_hash rjenkins
> pg_num 256 pgp_num 256 autoscale_mode warn last_change 622177 lfor
> 0/0/449412 flags hashpspool,selfmanaged_snaps stripe_width 0
> expected_num_objects 400 target_size_bytes 17592186044416 application rbd
>
> Pools 35 (meta), 36 and 43 (datas) are for cephfs.
>

How does the distribution for pool 36 look? This pool has the best
chance to be balanced -- the others have too few PGs so you shouldn't
even be worried.

> The point should be the crush rule. Indeed, as we have servers in 2
> different rooms, we have a crush rule to ensure that at least one copy
> of the datas is stored in each room (for disaster recovery):
>
> {
>  "rule_id": 2,
>  "rule_name": "replicated3over2rooms",
>  "ruleset": 2,
>  "type": 1,
>  "min_size": 3,
>  "max_size": 4,
>  "steps": [
>  {
>  "op": "take",
>  "item": -1,
>  "item_name": "default"
>  },
>  {
>  "op": "choose_firstn",
>  "num": 0,
>  "type": "room"
>  },
>  {
>  "op": "chooseleaf_firstn",
>  "num": 2,
>  "type": "host"
>  },
>  {
>  "op": "emit"
>  }
>  ]
>  },
>
> This rule should pick up a room, put 2 copies on different hosts in that
> room and put the third copy on any host in the second room.
>
> I understand that it will not lead to a totally uniform repartition, but
> statistically it should not be too far.
>
> The repartition of disks between rooms is the following : 4(servers)x16
> disks of 8T in the first room and 1(server)x24 disks of 16 T + 1x16 +
> 1x12 disks of 8T in the second room.
>
> This repartition is not homogeneous (4 servers in the first room and 3
> in the second, 64 disks in a room and 52 in the second and disks of
> different capacity) and for sure we have an excess in capacity of 12x8T
> in the second room (I am aware that this capacity is "lost" for now...
> it will be usable in the future if we add some new servers in the first
> room).

This non trivial crush rule and "tree imbalance" is probably confusing
the balancer a lot.

-- dan

P.S. min_size 1 will lead to tears down the road

>
> But in theory (which I agree is generally far from reality) a rather
> balanced repartition of datas should be reached.
>
> F.
>
>
>
> Le 31/01/2021 à 17:30, Dan van der Ster a écrit :
> > Hi,
> >
> > I think what's happening is that because you have few PGs and many
> > pools, the balancer cannot achieve a good uniform distribution.
> > The upmap balancer works to make the PGs uniform for each pool
> > individually -- it doesn't look at the total PGs per OSD, so perhaps
> > with your low # PGs per pool per OSD you are just unlucky.
> >
> > You can use a script like this:
> > 

[ceph-users] Re: Balancing with upmap

2021-02-01 Thread Francois Legrand

Hi,

Actually we have no EC pools... all are replica 3. And we have only 9 pools.

The average number og pg/osd is not very high (40.6).

Here is the detail of the pools :

pool 2 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 623105 lfor 0/608315/608313 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 31 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor 
0/0/171563 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 32 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor 
436085/436085/436085 flags hashpspool,selfmanaged_snaps stripe_width 0 
application rbd
pool 33 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor 
0/0/171554 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 34 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 623470 lfor 
0/0/171558 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 35 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins 
pg_num 32 pgp_num 32 last_change 621529 lfor 0/598286/598284 flags 
hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 application cephfs
pool 36 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins 
pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 624174 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
pool 43 replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins 
pg_num 64 pgp_num 64 autoscale_mode warn last_change 624174 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
pool 44 replicated size 3 min_size 3 crush_rule 2 object_hash rjenkins 
pg_num 256 pgp_num 256 autoscale_mode warn last_change 622177 lfor 
0/0/449412 flags hashpspool,selfmanaged_snaps stripe_width 0 
expected_num_objects 400 target_size_bytes 17592186044416 application rbd


Pools 35 (meta), 36 and 43 (datas) are for cephfs.

The point should be the crush rule. Indeed, as we have servers in 2 
different rooms, we have a crush rule to ensure that at least one copy 
of the datas is stored in each room (for disaster recovery):


{
    "rule_id": 2,
    "rule_name": "replicated3over2rooms",
    "ruleset": 2,
    "type": 1,
    "min_size": 3,
    "max_size": 4,
    "steps": [
    {
    "op": "take",
    "item": -1,
    "item_name": "default"
    },
    {
    "op": "choose_firstn",
    "num": 0,
    "type": "room"
    },
    {
    "op": "chooseleaf_firstn",
    "num": 2,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
    },

This rule should pick up a room, put 2 copies on different hosts in that 
room and put the third copy on any host in the second room.


I understand that it will not lead to a totally uniform repartition, but 
statistically it should not be too far.


The repartition of disks between rooms is the following : 4(servers)x16 
disks of 8T in the first room and 1(server)x24 disks of 16 T + 1x16 + 
1x12 disks of 8T in the second room.


This repartition is not homogeneous (4 servers in the first room and 3 
in the second, 64 disks in a room and 52 in the second and disks of 
different capacity) and for sure we have an excess in capacity of 12x8T 
in the second room (I am aware that this capacity is "lost" for now... 
it will be usable in the future if we add some new servers in the first 
room).


But in theory (which I agree is generally far from reality) a rather 
balanced repartition of datas should be reached.


F.



Le 31/01/2021 à 17:30, Dan van der Ster a écrit :

Hi,

I think what's happening is that because you have few PGs and many
pools, the balancer cannot achieve a good uniform distribution.
The upmap balancer works to make the PGs uniform for each pool
individually -- it doesn't look at the total PGs per OSD, so perhaps
with your low # PGs per pool per OSD you are just unlucky.

You can use a script like this:
https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-pool-pg-distribution
to see the PG distribution for any given pool. E.g on one of my clusters:

# ./ceph-pool-pg-distribution 38
Searching for PGs in pools: ['38']
Summary: 32 pgs on 52 osds

Num OSDs with X PGs:
   1: 21
   2: 20
   3: 9
   4: 2

That shows a pretty non-uniform distribution, because this example
pool id 38 has up to 4 PGs on some OSDs but 1 or 2 on most.
(this is a cluster with the balancer disabled).

The other explanation I can think of is that you have relatively wide
EC pools and few hosts. In that case there would 

[ceph-users] Re: Balancing with upmap

2021-01-31 Thread Francois Legrand

Hi,

After 2 days, the recovery ended. The situation is clearly better (but 
still not perfect) with 339.8 Ti available in pools (for 575.8 Ti 
available in the whole cluster).


The balancing remains not perfect (31 to 47 pgs on 8TB disks). And the 
ceph osd df tree returns :


ID  CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP    META    
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
 -1   1018.65833    -  466 TiB 214 TiB 214 TiB 126 GiB 609 GiB 
251 TiB 0    0   -    root default
-15    465.66577    -  466 TiB 214 TiB 214 TiB 126 GiB 609 GiB 
251 TiB 46.04 1.06   -    room 1222-2-10
 -3    116.41678    -  116 TiB  53 TiB  53 TiB 24 GiB 152 GiB  
64 TiB 45.45 1.05   -    host lpnceph01
  0   hdd    7.27599  1.0  7.3 TiB 3.7 TiB 3.7 TiB 2.5 GiB  16 GiB 
3.5 TiB 51.34 1.18  38 up osd.0
  4   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 2.4 GiB 8.7 GiB 
4.1 TiB 44.12 1.01  36 up osd.4
  8   hdd    7.27699  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.3 GiB 
3.7 TiB 48.52 1.12  39 up osd.8
 12   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.5 GiB 
3.9 TiB 46.69 1.07  37 up osd.12
 16   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.4 TiB 38 MiB 9.7 GiB 
3.8 TiB 47.49 1.09  37 up osd.16
 20   hdd    7.27599  1.0  7.3 TiB 3.1 TiB 3.0 TiB 2.4 GiB 8.7 GiB 
4.2 TiB 41.95 0.96  34 up osd.20
 24   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.8 GiB 
3.8 TiB 48.45 1.11  38 up osd.24
 28   hdd    7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 55 MiB 8.2 GiB 
4.2 TiB 41.74 0.96  32 up osd.28
 32   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.1 TiB 32 MiB 8.4 GiB 
4.1 TiB 43.33 1.00  34 up osd.32
 36   hdd    7.27599  1.0  7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB  11 GiB 
3.6 TiB 50.50 1.16  35 up osd.36
 40   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.3 TiB 2.4 GiB 9.1 GiB 
3.9 TiB 46.15 1.06  37 up osd.40
 44   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.2 GiB 
3.9 TiB 46.28 1.06  36 up osd.44
 48   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 92 MiB 8.8 GiB 
4.0 TiB 44.88 1.03  33 up osd.48
 52   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.0 GiB 
4.0 TiB 44.86 1.03  33 up osd.52
 56   hdd    7.27599  1.0  7.3 TiB 2.9 TiB 2.9 TiB 23 MiB 8.3 GiB 
4.4 TiB 39.79 0.92  34 up osd.56
 60   hdd    7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 40 MiB 8.3 GiB 
4.3 TiB 41.12 0.95  30 up osd.60
 -5    116.41600    -  116 TiB  54 TiB  54 TiB 30 GiB 150 GiB  
63 TiB 46.12 1.06   -    host lpnceph02
  1   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 2.2 GiB 8.9 GiB 
4.0 TiB 44.53 1.02  37 up osd.1
  5   hdd    7.27599  1.0  7.3 TiB 3.1 TiB 3.1 TiB 24 MiB 8.3 GiB 
4.2 TiB 42.56 0.98  34 up osd.5
  9   hdd    7.27599  1.0  7.3 TiB 3.8 TiB 3.8 TiB 42 MiB  11 GiB 
3.4 TiB 52.61 1.21  38 up osd.9
 13   hdd    7.27599  1.0  7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 9.7 GiB 
4.2 TiB 42.89 0.99  36 up osd.13
 17   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.1 GiB 
3.9 TiB 46.80 1.08  36 up osd.17
 21   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 41 MiB 9.2 GiB 
4.0 TiB 44.90 1.03  33 up osd.21
 25   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.4 GiB 9.4 GiB 
3.7 TiB 48.75 1.12  38 up osd.25
 29   hdd    7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 2.3 GiB 8.7 GiB 
4.2 TiB 41.91 0.96  34 up osd.29
 33   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.4 GiB 
3.9 TiB 46.60 1.07  36 up osd.33
 37   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 4.6 GiB  10 GiB 
3.8 TiB 47.90 1.10  34 up osd.37
 41   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.2 GiB  11 GiB 
3.9 TiB 45.91 1.06  33 up osd.41
 45   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.3 GiB 
3.9 TiB 46.85 1.08  35 up osd.45
 49   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 8.9 GiB 
4.0 TiB 45.35 1.04  36 up osd.49
 53   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 36 MiB 9.0 GiB 
4.0 TiB 44.85 1.03  33 up osd.53
 57   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 9.0 GiB 
4.0 TiB 45.67 1.05  36 up osd.57
 61   hdd    7.27599  1.0  7.3 TiB 3.6 TiB 3.6 TiB 2.4 GiB 9.8 GiB 
3.7 TiB 49.75 1.14  36 up osd.61
 -9    116.41600    -  116 TiB  56 TiB  56 TiB 35 GiB 159 GiB  
61 TiB 48.03 1.10   -    host lpnceph04
  7   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.4 GiB 
3.9 TiB 45.96 1.06  37 up osd.7
 11   hdd    7.27599  1.0  7.3 TiB 3.9 TiB 3.9 TiB 4.7 GiB  11 GiB 
3.4 TiB 53.20 1.22  40 up osd.11
 15   hdd    7.27599  1.0  7.3 TiB 3.8 TiB 3.8 TiB 2.3 GiB 9.8 GiB 
3.5 TiB 51.84 1.19  40 up osd.15
 27   hdd    7.27599  1.0  7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 8.5 GiB 
4.2 TiB 42.50 0.98  34 up osd.27
 31   hdd    7.27599  1.0  7.3 TiB 3.1 TiB 3.1 TiB 2.2 GiB 8.7 GiB 
4.2 TiB 42.61 0.98  

[ceph-users] Re: Balancing with upmap

2021-01-30 Thread Francois Legrand

Hi,

Thanks for your advices. Here is the output of ceph osd df tree :

ID  CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP    META    
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
 -1   1018.65833    -  466 TiB 214 TiB 213 TiB 117 GiB 605 GiB 
252 TiB 0    0   -    root default
-15    465.66577    -  466 TiB 214 TiB 213 TiB 117 GiB 605 GiB 
252 TiB 45.88 1.06   -    room 1222-2-10
 -3    116.41678    -  116 TiB  52 TiB  52 TiB 24 GiB 153 GiB  
64 TiB 44.91 1.04   -    host lpnceph01
  0   hdd    7.27599  1.0  7.3 TiB 3.6 TiB 3.6 TiB 2.5 GiB  16 GiB 
3.7 TiB 49.31 1.14  35 up osd.0
  4   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.1 TiB 2.4 GiB 8.5 GiB 
4.1 TiB 43.39 1.00  35 up osd.4
  8   hdd    7.27699  1.0  7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 9.1 GiB 
4.1 TiB 43.23 1.00  33 up osd.8
 12   hdd    7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 2.4 GiB 8.8 GiB 
4.3 TiB 40.85 0.94  32 up osd.12
 16   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 40 MiB 9.7 GiB 
3.8 TiB 47.95 1.11  36 up osd.16
 20   hdd    7.27599  1.0  7.3 TiB 2.8 TiB 2.8 TiB 2.4 GiB 8.3 GiB 
4.5 TiB 38.00 0.88  33 up osd.20
 24   hdd    7.27599  1.0  7.3 TiB 3.6 TiB 3.6 TiB 2.3 GiB  10 GiB 
3.6 TiB 49.98 1.15  37 up osd.24
 28   hdd    7.27599  1.0  7.3 TiB 2.6 TiB 2.6 TiB 50 MiB 8.3 GiB 
4.7 TiB 35.39 0.82  26 up osd.28
 32   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 31 MiB 9.3 GiB 
4.1 TiB 44.21 1.02  32 up osd.32
 36   hdd    7.27599  1.0  7.3 TiB 4.2 TiB 4.2 TiB 2.6 GiB  11 GiB 
3.1 TiB 57.79 1.33  37 up osd.36
 40   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.4 GiB 9.1 GiB 
3.8 TiB 47.84 1.10  42 up osd.40
 44   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.2 GiB 
3.8 TiB 48.44 1.12  39 up osd.44
 48   hdd    7.27599  1.0  7.3 TiB 3.1 TiB 3.0 TiB 91 MiB 9.0 GiB 
4.2 TiB 41.93 0.97  30 up osd.48
 52   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.4 GiB 9.7 GiB 
3.8 TiB 47.59 1.10  33 up osd.52
 56   hdd    7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 23 MiB 8.2 GiB 
4.2 TiB 41.88 0.97  42 up osd.56
 60   hdd    7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 38 MiB 8.3 GiB 
4.3 TiB 40.76 0.94  29 up osd.60
 -5    116.41600    -  116 TiB  54 TiB  53 TiB 28 GiB 150 GiB  
63 TiB 46.02 1.06   -    host lpnceph02
  1   hdd    7.27599  1.0  7.3 TiB 2.9 TiB 2.9 TiB 26 MiB 8.0 GiB 
4.4 TiB 40.19 0.93  34 up osd.1
  5   hdd    7.27599  1.0  7.3 TiB 2.7 TiB 2.7 TiB 26 MiB 7.9 GiB 
4.6 TiB 36.92 0.85  26 up osd.5
  9   hdd    7.27599  1.0  7.3 TiB 4.0 TiB 4.0 TiB 42 MiB  11 GiB 
3.3 TiB 54.44 1.26  38 up osd.9
 13   hdd    7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 2.3 GiB 9.6 GiB 
4.3 TiB 41.47 0.96  37 up osd.13
 17   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.4 GiB 
3.9 TiB 46.79 1.08  37 up osd.17
 21   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 41 MiB 9.2 GiB 
4.1 TiB 44.18 1.02  30 up osd.21
 25   hdd    7.27599  1.0  7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB  10 GiB 
3.5 TiB 51.33 1.19  41 up osd.25
 29   hdd    7.27599  1.0  7.3 TiB 3.1 TiB 3.1 TiB 2.4 GiB 8.7 GiB 
4.2 TiB 42.14 0.97  35 up osd.29
 33   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.4 GiB 
3.8 TiB 48.01 1.11  39 up osd.33
 37   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 4.5 GiB 9.8 GiB 
4.0 TiB 44.57 1.03  30 up osd.37
 41   hdd    7.27599  1.0  7.3 TiB 3.8 TiB 3.8 TiB 2.2 GiB  11 GiB 
3.5 TiB 52.50 1.21  36 up osd.41
 45   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.4 GiB 
3.9 TiB 46.87 1.08  36 up osd.45
 49   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 9.0 GiB 
4.0 TiB 45.39 1.05  39 up osd.49
 53   hdd    7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 37 MiB 8.9 GiB 
4.1 TiB 43.80 1.01  31 up osd.53
 57   hdd    7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.2 GiB 
3.9 TiB 47.01 1.09  38 up osd.57
 61   hdd    7.27599  1.0  7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB 9.8 GiB 
3.6 TiB 50.64 1.17  36 up osd.61
 -9    116.41600    -  116 TiB  56 TiB  56 TiB 31 GiB 158 GiB  
60 TiB 48.12 1.11   -    host lpnceph04
  7   hdd    7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.2 GiB 
3.9 TiB 45.74 1.06  34 up osd.7
 11   hdd    7.27599  1.0  7.3 TiB 3.9 TiB 3.9 TiB 7.1 GiB  11 GiB 
3.4 TiB 53.24 1.23  39 up osd.11
 15   hdd    7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 43 MiB 9.3 GiB 
3.7 TiB 48.54 1.12  38 up osd.15
 27   hdd    7.27599  1.0  7.3 TiB 2.9 TiB 2.9 TiB 2.3 GiB 8.2 GiB 
4.4 TiB 39.91 0.92  33 up osd.27
 31   hdd    7.27599  1.0  7.3 TiB 2.8 TiB 2.8 TiB 24 MiB 8.1 GiB 
4.4 TiB 39.16 0.90  34 up osd.31
 35   hdd    7.27599  1.0  7.3 TiB 3.8 TiB 3.7 TiB 2.3 GiB  13 GiB 
3.5 TiB 51.71 1.19  40 up osd.35
 39   hdd    7.27599  1.0  7.3 TiB 3.8 TiB 3.7 TiB 65 MiB  13 GiB 
3.5 TiB 51.65 

[ceph-users] Re: Balancing with upmap

2021-01-29 Thread Dan van der Ster
Thanks, and thanks for the log file OTR which simply showed:

2021-01-29 23:17:32.567 7f6155cae700  4 mgr[balancer] prepared 0/10 changes

This indeed means that balancer believes those pools are all balanced
according to the config (which you have set to the defaults).

Could you please also share the output of `ceph osd df tree` so we can
see the distribution and OSD weights?

You might need simply to decrease the upmap_max_deviation from the
default of 5. On our clusters we do:

ceph config set mgr mgr/balancer/upmap_max_deviation 1

Cheers, Dan

On Fri, Jan 29, 2021 at 11:25 PM Francois Legrand  wrote:
>
> Hi Dan,
>
> Here is the output of ceph balancer status :
>
> /ceph balancer status//
> //{//
> //"last_optimize_duration": "0:00:00.074965", //
> //"plans": [], //
> //"mode": "upmap", //
> //"active": true, //
> //"optimize_result": "Unable to find further optimization, or
> pool(s) pg_num is decreasing, or distribution is already perfect", //
> //"last_optimize_started": "Fri Jan 29 23:13:31 2021"//
> //}/
>
>
> F.
>
> Le 29/01/2021 à 10:57, Dan van der Ster a écrit :
> > Hi Francois,
> >
> > What is the output of `ceph balancer status` ?
> > Also, can you increase the debug_mgr to 4/5 then share the log file of
> > the active mgr?
> >
> > Best,
> >
> > Dan
> >
> > On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand  
> > wrote:
> >> Thanks for your suggestion. I will have a look !
> >>
> >> But I am a bit surprised that the "official" balancer seems so unefficient 
> >> !
> >>
> >> F.
> >>
> >> Le 28/01/2021 à 12:00, Jonas Jelten a écrit :
> >>> Hi!
> >>>
> >>> We also suffer heavily from this so I wrote a custom balancer which 
> >>> yields much better results:
> >>> https://github.com/TheJJ/ceph-balancer
> >>>
> >>> After you run it, it echoes the PG movements it suggests. You can then 
> >>> just run those commands the cluster will balance more.
> >>> It's kinda work in progress, so I'm glad about your feedback.
> >>>
> >>> Maybe it helps you :)
> >>>
> >>> -- Jonas
> >>>
> >>> On 27/01/2021 17.15, Francois Legrand wrote:
>  Hi all,
>  I have a cluster with 116 disks (24 new disks of 16TB added in december 
>  and the rest of 8TB) running nautilus 14.2.16.
>  I moved (8 month ago) from crush_compat to upmap balancing.
>  But the cluster seems not well balanced, with a number of pgs on the 8TB 
>  disks varying from 26 to 52 ! And an occupation from 35 to 69%.
>  The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space 
>  between 30 and 43%.
>  Last week, I realized that some osd were maybe not using upmap because I 
>  did a ceph osd crush weight-set ls and got (compat) as result.
>  Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
>  rebalancing. Now there is no more recovery for 2 days, but the cluster 
>  is still unbalanced.
>  As far as I understand, upmap is supposed to reach an equal number of 
>  pgs on all the disks (I guess weighted by their capacity).
>  Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 
>  16TB and around 50% usage on all. Which is not the case (by far).
>  The problem is that it impact the free available space in the pools 
>  (264Ti while there is more than 578Ti free in the cluster) because free 
>  space seems to be based on space available before the first osd will be 
>  full !
>  Is it normal ? Did I missed something ? What could I do ?
> 
>  F.
>  ___
>  ceph-users mailing list -- ceph-users@ceph.io
>  To unsubscribe send an email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancing with upmap

2021-01-29 Thread Dan van der Ster
Hi Francois,

What is the output of `ceph balancer status` ?
Also, can you increase the debug_mgr to 4/5 then share the log file of
the active mgr?

Best,

Dan

On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand  wrote:
>
> Thanks for your suggestion. I will have a look !
>
> But I am a bit surprised that the "official" balancer seems so unefficient !
>
> F.
>
> Le 28/01/2021 à 12:00, Jonas Jelten a écrit :
> > Hi!
> >
> > We also suffer heavily from this so I wrote a custom balancer which yields 
> > much better results:
> > https://github.com/TheJJ/ceph-balancer
> >
> > After you run it, it echoes the PG movements it suggests. You can then just 
> > run those commands the cluster will balance more.
> > It's kinda work in progress, so I'm glad about your feedback.
> >
> > Maybe it helps you :)
> >
> > -- Jonas
> >
> > On 27/01/2021 17.15, Francois Legrand wrote:
> >> Hi all,
> >> I have a cluster with 116 disks (24 new disks of 16TB added in december 
> >> and the rest of 8TB) running nautilus 14.2.16.
> >> I moved (8 month ago) from crush_compat to upmap balancing.
> >> But the cluster seems not well balanced, with a number of pgs on the 8TB 
> >> disks varying from 26 to 52 ! And an occupation from 35 to 69%.
> >> The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space 
> >> between 30 and 43%.
> >> Last week, I realized that some osd were maybe not using upmap because I 
> >> did a ceph osd crush weight-set ls and got (compat) as result.
> >> Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
> >> rebalancing. Now there is no more recovery for 2 days, but the cluster is 
> >> still unbalanced.
> >> As far as I understand, upmap is supposed to reach an equal number of pgs 
> >> on all the disks (I guess weighted by their capacity).
> >> Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 
> >> 16TB and around 50% usage on all. Which is not the case (by far).
> >> The problem is that it impact the free available space in the pools (264Ti 
> >> while there is more than 578Ti free in the cluster) because free space 
> >> seems to be based on space available before the first osd will be full !
> >> Is it normal ? Did I missed something ? What could I do ?
> >>
> >> F.
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancing with upmap

2021-01-29 Thread Francois Legrand

Thanks for your suggestion. I will have a look !

But I am a bit surprised that the "official" balancer seems so unefficient !

F.

Le 28/01/2021 à 12:00, Jonas Jelten a écrit :

Hi!

We also suffer heavily from this so I wrote a custom balancer which yields much 
better results:
https://github.com/TheJJ/ceph-balancer

After you run it, it echoes the PG movements it suggests. You can then just run 
those commands the cluster will balance more.
It's kinda work in progress, so I'm glad about your feedback.

Maybe it helps you :)

-- Jonas

On 27/01/2021 17.15, Francois Legrand wrote:

Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancing with upmap

2021-01-28 Thread Jonas Jelten
Hi!

We also suffer heavily from this so I wrote a custom balancer which yields much 
better results:
https://github.com/TheJJ/ceph-balancer

After you run it, it echoes the PG movements it suggests. You can then just run 
those commands the cluster will balance more.
It's kinda work in progress, so I'm glad about your feedback.

Maybe it helps you :)

-- Jonas

On 27/01/2021 17.15, Francois Legrand wrote:
> Hi all,
> I have a cluster with 116 disks (24 new disks of 16TB added in december and 
> the rest of 8TB) running nautilus 14.2.16.
> I moved (8 month ago) from crush_compat to upmap balancing.
> But the cluster seems not well balanced, with a number of pgs on the 8TB 
> disks varying from 26 to 52 ! And an occupation from 35 to 69%.
> The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space 
> between 30 and 43%.
> Last week, I realized that some osd were maybe not using upmap because I did 
> a ceph osd crush weight-set ls and got (compat) as result.
> Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
> rebalancing. Now there is no more recovery for 2 days, but the cluster is 
> still unbalanced.
> As far as I understand, upmap is supposed to reach an equal number of pgs on 
> all the disks (I guess weighted by their capacity).
> Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB 
> and around 50% usage on all. Which is not the case (by far).
> The problem is that it impact the free available space in the pools (264Ti 
> while there is more than 578Ti free in the cluster) because free space seems 
> to be based on space available before the first osd will be full !
> Is it normal ? Did I missed something ? What could I do ?
> 
> F.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancing with upmap

2021-01-27 Thread Francois Legrand

Nope !

Le 27/01/2021 à 17:40, Anthony D'Atri a écrit :

Do you have any override reweights set to values less than 1.0?

The REWEIGHT column when you run `ceph osd df`


On Jan 27, 2021, at 8:15 AM, Francois Legrand  wrote:

Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io