This is the pgs repartition as given by the command I found here http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd :

pool :    35   44   36   31   32   33    2   34    43   | SUM
----------------------------------------------------------------------------------------
osd.0     0    5    26    1    1    1    1    1    2    | 38
osd.1     1    6    23    1    1    1    2    1    1    | 37
osd.2     1    6    22    1    1    1    2    1    2    | 37
osd.3     1    11   35    1    1    1    3    2    3    | 58
osd.4     1    6    23    1    1    1    1    1    1    | 36
osd.5     0    6    22    1    1    1    1    1    1    | 34
osd.6     1    6    23    1    1    1    1    1    2    | 37
osd.7     1    6    22    1    1    1    2    1    2    | 37
osd.8     1    5    26    1    1    1    2    1    1    | 39
osd.9     0    5    26    1    0    1    2    1    2    | 38
osd.10    1    6    30    1    1    1    2    1    2    | 45
osd.11    1    6    26    1    1    1    2    0    2    | 40
osd.12    1    5    25    1    1    1    1    1    1    | 37
osd.13    1    6    22    1    1    1    2    1    1    | 36
osd.14    1    8    22    1    1    1    2    1    3    | 40
osd.15    1    6    26    1    1    1    1    1    2    | 40
osd.16    0    6    23    1    1    1    2    1    2    | 37
osd.17    1    5    25    1    0    0    2    1    1    | 36
osd.18    1    6    28    1    1    1    2    1    2    | 43
osd.19    1    6    22    1    1    1    2    1    2    | 37
osd.20    1    5    22    1    1    1    1    1    1    | 34
osd.21    0    5    22    1    1    0    1    1    2    | 33
osd.22    1    6    30    1    1    1    2    1    3    | 46
osd.23    1    9    35    1    1    1    3    2    2    | 55
osd.24    1    5    24    1    1    1    2    1    2    | 38
osd.25    1    6    24    1    1    1    1    1    2    | 38
osd.26    1    8    23    1    1    1    1    1    2    | 39
osd.27    1    6    22    1    1    1    1    0    1    | 34
osd.28    0    5    22    1    1    1    1    0    1    | 32
osd.29    1    5    24    0    1    1    1    1    0    | 34
osd.30    1    6    24    1    1    1    2    1    2    | 39
osd.31    1    6    22    1    1    1    1    1    1    | 35
osd.32    0    5    25    1    0    1    1    1    0    | 34
osd.33    1    5    25    1    1    0    1    1    1    | 36
osd.34    0    9    28    1    1    1    1    0    2    | 43
osd.35    1    6    22    1    1    1    2    1    2    | 37
osd.36    1    5    25    0    0    1    1    0    2    | 35
osd.37    1    5    24    0    0    0    1    1    2    | 34
osd.38    0    6    26    1    1    1    1    0    2    | 38
osd.39    1    6    23    1    1    1    1    0    2    | 36
osd.40    1    6    24    1    1    0    2    1    1    | 37
osd.41    1    6    22    1    0    0    1    0    2    | 33
osd.42    1    7    25    1    1    1    2    1    2    | 41
osd.43    1    6    24    0    1    1    1    0    1    | 35
osd.44    1    6    24    0    0    1    2    1    1    | 36
osd.45    1    5    25    0    1    0    2    0    1    | 35
osd.46    1    6    22    0    1    1    1    1    2    | 35
osd.47    1    6    26    1    1    1    2    1    2    | 41
osd.48    0    5    22    0    1    1    1    1    2    | 33
osd.49    1    5    26    1    0    0    2    1    0    | 36
osd.50    1    9    23    1    1    1    2    0    2    | 40
osd.51    1    6    22    0    1    0    2    0    2    | 34
osd.52    1    5    22    0    1    0    1    1    2    | 33
osd.53    0    5    22    1    1    1    1    0    2    | 33
osd.54    0    6    24    0    1    1    1    0    2    | 35
osd.55    1    6    22    1    0    1    2    0    2    | 35
osd.56    0    6    22    1    1    1    2    1    0    | 34
osd.57    1    6    24    1    1    1    1    0    1    | 36
osd.58    1    6    25    0    1    1    2    1    2    | 39
osd.59    1    6    26    1    1    1    2    0    2    | 40
osd.60    0    4    22    1    1    0    1    0    1    | 30
osd.61    1    5    25    0    0    0    2    1    2    | 36
osd.62    1    9    22    1    1    1    2    1    2    | 40
osd.63    1    9    35    1    1    1    3    1    3    | 55
osd.64    1    11   36    2    1    1    2    1    2    | 57
osd.65    1    8    37    1    1    2    2    1    3    | 56
osd.66    1    9    35    1    1    2    2    2    2    | 55
osd.67    1    10   34    1    1    2    2    2    3    | 56
osd.68    1    9    40    1    1    1    2    1    2    | 58
osd.69    1    8    40    1    1    1    2    1    3    | 58
osd.70    1    8    34    1    1    1    2    1    2    | 51
osd.71    1    11   36    1    1    2    2    1    2    | 57
osd.72    1    6    26    1    1    1    1    1    1    | 39
osd.73    1    8    37    1    1    2    3    1    1    | 55
osd.74    1    6    22    1    0    1    1    1    1    | 34
osd.75    1    6    22    1    1    1    2    1    1    | 36
osd.76    1    6    22    1    0    1    1    1    1    | 34
osd.77    1    6    23    1    0    0    2    1    1    | 35
osd.78    1    6    24    1    0    0    1    1    1    | 35
osd.79    1    6    22    1    1    0    2    1    1    | 35
osd.80    1    6    22    1    1    0    1    1    1    | 34
osd.81    1    6    24    1    1    0    2    1    2    | 38
osd.82    1    6    23    1    0    1    1    1    0    | 34
osd.83    0    6    23    1    1    0    1    1    0    | 33
osd.84    1    6    25    1    1    1    2    1    2    | 40
osd.85    1    6    22    1    0    0    2    0    2    | 34
osd.86    0    6    22    1    0    0    1    0    1    | 31
osd.87    1    6    22    0    0    0    2    1    2    | 34
osd.88    1    8    34    1    1    0    2    1    3    | 51
osd.89    1    7    22    1    1    1    2    1    2    | 38
osd.90    1    6    25    0    1    1    2    1    2    | 39
osd.91    1    8    32    0    1    1    2    1    1    | 47
osd.92    1    6    22    0    1    2    1    1    2    | 36
osd.93    1    7    22    1    1    1    2    1    2    | 38
osd.94    1    6    27    0    1    1    1    1    1    | 39
osd.95    1    7    30    0    1    1    2    1    1    | 44
osd.96    1    10   35    1    1    1    3    1    3    | 56
osd.97    1    6    28    1    1    1    1    1    1    | 41
osd.98    1    6    22    0    1    1    2    0    1    | 34
osd.99    1    6    29    1    1    1    2    1    1    | 43
osd.100   1    6    26    1    1    0    2    0    2    | 39
osd.101   0    6    24    1    0    1    2    1    1    | 36
osd.102   0    6    22    1    0    1    2    0    2    | 34
osd.103   1    6    22    0    1    1    2    1    2    | 36
osd.104   0    6    30    1    1    1    2    1    2    | 44
osd.105   0    6    26    1    1    1    1    0    1    | 37
osd.106   1    11   34    1    1    1    1    1    2    | 53
osd.107   1    8    38    1    1    0    2    1    2    | 54
osd.108   1    8    34    1    1    2    2    1    3    | 53
osd.109   1    9    34    1    1    1    1    1    3    | 52
osd.110   1    8    37    1    1    0    3    1    3    | 55
osd.111   1    8    40    1    1    0    2    1    1    | 55
osd.112   1    8    37    1    1    2    3    1    1    | 55
osd.113   1    8    34    1    1    0    1    1    1    | 48
osd.114   1    11   34    1    1    1    1    1    2    | 53
osd.115   1    11   34    1    1    0    1    1    1    | 51
----------------------------------------------------------------------------------------
SUM :    96   768  3072   96   96   96  192   96   192  |

F.

Le 01/02/2021 à 10:26, Dan van der Ster a écrit :
On Mon, Feb 1, 2021 at 10:03 AM Francois Legrand <f...@lpnhe.in2p3.fr> wrote:
Hi,

Actually we have no EC pools... all are replica 3. And we have only 9 pools.

The average number og pg/osd is not very high (40.6).

Here is the detail of the pools :

pool 2 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 623105 lfor 0/608315/608313 flags
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 31 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
0/0/171563 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 32 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
436085/436085/436085 flags hashpspool,selfmanaged_snaps stripe_width 0
application rbd
pool 33 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
0/0/171554 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 34 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 623470 lfor
0/0/171558 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 35 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 last_change 621529 lfor 0/598286/598284 flags
hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
recovery_priority 5 application cephfs
pool 36 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 624174 flags
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
pool 43 replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins
pg_num 64 pgp_num 64 autoscale_mode warn last_change 624174 flags
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
pool 44 replicated size 3 min_size 3 crush_rule 2 object_hash rjenkins
pg_num 256 pgp_num 256 autoscale_mode warn last_change 622177 lfor
0/0/449412 flags hashpspool,selfmanaged_snaps stripe_width 0
expected_num_objects 400 target_size_bytes 17592186044416 application rbd

Pools 35 (meta), 36 and 43 (datas) are for cephfs.

How does the distribution for pool 36 look? This pool has the best
chance to be balanced -- the others have too few PGs so you shouldn't
even be worried.

The point should be the crush rule. Indeed, as we have servers in 2
different rooms, we have a crush rule to ensure that at least one copy
of the datas is stored in each room (for disaster recovery):

{
          "rule_id": 2,
          "rule_name": "replicated3over2rooms",
          "ruleset": 2,
          "type": 1,
          "min_size": 3,
          "max_size": 4,
          "steps": [
              {
                  "op": "take",
                  "item": -1,
                  "item_name": "default"
              },
              {
                  "op": "choose_firstn",
                  "num": 0,
                  "type": "room"
              },
              {
                  "op": "chooseleaf_firstn",
                  "num": 2,
                  "type": "host"
              },
              {
                  "op": "emit"
              }
          ]
      },

This rule should pick up a room, put 2 copies on different hosts in that
room and put the third copy on any host in the second room.

I understand that it will not lead to a totally uniform repartition, but
statistically it should not be too far.

The repartition of disks between rooms is the following : 4(servers)x16
disks of 8T in the first room and 1(server)x24 disks of 16 T + 1x16 +
1x12 disks of 8T in the second room.

This repartition is not homogeneous (4 servers in the first room and 3
in the second, 64 disks in a room and 52 in the second and disks of
different capacity) and for sure we have an excess in capacity of 12x8T
in the second room (I am aware that this capacity is "lost" for now...
it will be usable in the future if we add some new servers in the first
room).
This non trivial crush rule and "tree imbalance" is probably confusing
the balancer a lot.

-- dan

P.S. min_size 1 will lead to tears down the road....

But in theory (which I agree is generally far from reality) a rather
balanced repartition of datas should be reached.

F.



Le 31/01/2021 à 17:30, Dan van der Ster a écrit :
Hi,

I think what's happening is that because you have few PGs and many
pools, the balancer cannot achieve a good uniform distribution.
The upmap balancer works to make the PGs uniform for each pool
individually -- it doesn't look at the total PGs per OSD, so perhaps
with your low # PGs per pool per OSD you are just unlucky.

You can use a script like this:
https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-pool-pg-distribution
to see the PG distribution for any given pool. E.g on one of my clusters:

# ./ceph-pool-pg-distribution 38
Searching for PGs in pools: ['38']
Summary: 32 pgs on 52 osds

Num OSDs with X PGs:
    1: 21
    2: 20
    3: 9
    4: 2

That shows a pretty non-uniform distribution, because this example
pool id 38 has up to 4 PGs on some OSDs but 1 or 2 on most.
(this is a cluster with the balancer disabled).

The other explanation I can think of is that you have relatively wide
EC pools and few hosts. In that case there would be very little that
the balancer could do to flatten the distribution.
If in doubt, please share your pool details and crush rules so we can
investigate further.

Cheers, Dan




On Sun, Jan 31, 2021 at 5:10 PM Francois Legrand <f...@lpnhe.in2p3.fr> wrote:
Hi,

After 2 days, the recovery ended. The situation is clearly better (but
still not perfect) with 339.8 Ti available in pools (for 575.8 Ti
available in the whole cluster).

The balancing remains not perfect (31 to 47 pgs on 8TB disks). And the
ceph osd df tree returns :

ID  CLASS WEIGHT     REWEIGHT SIZE     RAW USE DATA OMAP    META
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
    -1       1018.65833        -  466 TiB 214 TiB 214 TiB 126 GiB 609 GiB
251 TiB     0    0   -        root default
-15        465.66577        -  466 TiB 214 TiB 214 TiB 126 GiB 609 GiB
251 TiB 46.04 1.06   -            room 1222-2-10
    -3        116.41678        -  116 TiB  53 TiB  53 TiB 24 GiB 152 GiB
64 TiB 45.45 1.05   -                host lpnceph01
     0   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.5 GiB  16 GiB
3.5 TiB 51.34 1.18  38     up osd.0
     4   hdd    7.27599  1.00000  7.3 TiB 3.2 TiB 3.2 TiB 2.4 GiB 8.7 GiB
4.1 TiB 44.12 1.01  36     up osd.4
     8   hdd    7.27699  1.00000  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.3 GiB
3.7 TiB 48.52 1.12  39     up osd.8
    12   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.5 GiB
3.9 TiB 46.69 1.07  37     up osd.12
    16   hdd    7.27599  1.00000  7.3 TiB 3.5 TiB 3.4 TiB 38 MiB 9.7 GiB
3.8 TiB 47.49 1.09  37     up osd.16
    20   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.0 TiB 2.4 GiB 8.7 GiB
4.2 TiB 41.95 0.96  34     up osd.20
    24   hdd    7.27599  1.00000  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.8 GiB
3.8 TiB 48.45 1.11  38     up osd.24
    28   hdd    7.27599  1.00000  7.3 TiB 3.0 TiB 3.0 TiB 55 MiB 8.2 GiB
4.2 TiB 41.74 0.96  32     up osd.28
    32   hdd    7.27599  1.00000  7.3 TiB 3.2 TiB 3.1 TiB 32 MiB 8.4 GiB
4.1 TiB 43.33 1.00  34     up osd.32
    36   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB  11 GiB
3.6 TiB 50.50 1.16  35     up osd.36
    40   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.3 TiB 2.4 GiB 9.1 GiB
3.9 TiB 46.15 1.06  37     up osd.40
    44   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.2 GiB
3.9 TiB 46.28 1.06  36     up osd.44
    48   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 92 MiB 8.8 GiB
4.0 TiB 44.88 1.03  33     up osd.48
    52   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.0 GiB
4.0 TiB 44.86 1.03  33     up osd.52
    56   hdd    7.27599  1.00000  7.3 TiB 2.9 TiB 2.9 TiB 23 MiB 8.3 GiB
4.4 TiB 39.79 0.92  34     up osd.56
    60   hdd    7.27599  1.00000  7.3 TiB 3.0 TiB 3.0 TiB 40 MiB 8.3 GiB
4.3 TiB 41.12 0.95  30     up osd.60
    -5        116.41600        -  116 TiB  54 TiB  54 TiB 30 GiB 150 GiB
63 TiB 46.12 1.06   -                host lpnceph02
     1   hdd    7.27599  1.00000  7.3 TiB 3.2 TiB 3.2 TiB 2.2 GiB 8.9 GiB
4.0 TiB 44.53 1.02  37     up osd.1
     5   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 24 MiB 8.3 GiB
4.2 TiB 42.56 0.98  34     up osd.5
     9   hdd    7.27599  1.00000  7.3 TiB 3.8 TiB 3.8 TiB 42 MiB  11 GiB
3.4 TiB 52.61 1.21  38     up osd.9
    13   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 9.7 GiB
4.2 TiB 42.89 0.99  36     up osd.13
    17   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.1 GiB
3.9 TiB 46.80 1.08  36     up osd.17
    21   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 41 MiB 9.2 GiB
4.0 TiB 44.90 1.03  33     up osd.21
    25   hdd    7.27599  1.00000  7.3 TiB 3.5 TiB 3.5 TiB 2.4 GiB 9.4 GiB
3.7 TiB 48.75 1.12  38     up osd.25
    29   hdd    7.27599  1.00000  7.3 TiB 3.0 TiB 3.0 TiB 2.3 GiB 8.7 GiB
4.2 TiB 41.91 0.96  34     up osd.29
    33   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.4 GiB
3.9 TiB 46.60 1.07  36     up osd.33
    37   hdd    7.27599  1.00000  7.3 TiB 3.5 TiB 3.5 TiB 4.6 GiB  10 GiB
3.8 TiB 47.90 1.10  34     up osd.37
    41   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.2 GiB  11 GiB
3.9 TiB 45.91 1.06  33     up osd.41
    45   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.3 GiB
3.9 TiB 46.85 1.08  35     up osd.45
    49   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 8.9 GiB
4.0 TiB 45.35 1.04  36     up osd.49
    53   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 36 MiB 9.0 GiB
4.0 TiB 44.85 1.03  33     up osd.53
    57   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 9.0 GiB
4.0 TiB 45.67 1.05  36     up osd.57
    61   hdd    7.27599  1.00000  7.3 TiB 3.6 TiB 3.6 TiB 2.4 GiB 9.8 GiB
3.7 TiB 49.75 1.14  36     up osd.61
    -9        116.41600        -  116 TiB  56 TiB  56 TiB 35 GiB 159 GiB
61 TiB 48.03 1.10   -                host lpnceph04
     7   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.4 GiB
3.9 TiB 45.96 1.06  37     up osd.7
    11   hdd    7.27599  1.00000  7.3 TiB 3.9 TiB 3.9 TiB 4.7 GiB  11 GiB
3.4 TiB 53.20 1.22  40     up osd.11
    15   hdd    7.27599  1.00000  7.3 TiB 3.8 TiB 3.8 TiB 2.3 GiB 9.8 GiB
3.5 TiB 51.84 1.19  40     up osd.15
    27   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 8.5 GiB
4.2 TiB 42.50 0.98  34     up osd.27
    31   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.2 GiB 8.7 GiB
4.2 TiB 42.61 0.98  35     up osd.31
    35   hdd    7.27599  1.00000  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB  12 GiB
3.8 TiB 48.27 1.11  37     up osd.35
    39   hdd    7.27599  1.00000  7.3 TiB 3.6 TiB 3.6 TiB 2.2 GiB 8.4 GiB
3.7 TiB 49.45 1.14  36     up osd.39
    43   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.4 GiB
4.0 TiB 45.71 1.05  35     up osd.43
    47   hdd    7.27599  1.00000  7.3 TiB 3.8 TiB 3.8 TiB 3.0 GiB  12 GiB
3.5 TiB 52.31 1.20  41     up osd.47
    51   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.3 TiB 2.3 GiB  10 GiB
3.9 TiB 46.13 1.06  34     up osd.51
    55   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 4.1 GiB  11 GiB
4.0 TiB 45.71 1.05  35     up osd.55
    59   hdd    7.27599  1.00000  7.3 TiB 3.8 TiB 3.8 TiB 2.2 GiB  10 GiB
3.5 TiB 52.19 1.20  40     up osd.59
100   hdd    7.27599  1.00000  7.3 TiB 3.8 TiB 3.8 TiB 2.3 GiB  10 GiB
3.5 TiB 52.22 1.20  39     up osd.100
101   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 26 MiB 9.0 GiB
3.9 TiB 45.82 1.05  36     up osd.101
102   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 75 MiB 9.0 GiB
3.9 TiB 45.79 1.05  34     up osd.102
105   hdd    7.27599  1.00000  7.3 TiB 3.6 TiB 3.5 TiB 57 MiB 9.9 GiB
3.7 TiB 48.83 1.12  37     up osd.105
-13        116.41699        -  116 TiB  52 TiB  52 TiB 37 GiB 148 GiB
65 TiB 44.58 1.03   -                host lpnceph06
    19   hdd    7.27699  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.2 GiB 8.8 GiB
3.9 TiB 45.97 1.06  37     up osd.19
    72   hdd    7.27599  1.00000  7.3 TiB 3.6 TiB 3.5 TiB 2.6 GiB 9.4 GiB
3.7 TiB 48.84 1.12  39     up osd.72
    74   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 8.5 GiB
4.2 TiB 42.36 0.97  34     up osd.74
    75   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.4 GiB 8.6 GiB
4.2 TiB 42.85 0.99  36     up osd.75
    76   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.9 GiB 9.9 GiB
4.2 TiB 42.47 0.98  34     up osd.76
    77   hdd    7.27599  1.00000  7.3 TiB 3.2 TiB 3.2 TiB 2.4 GiB 8.7 GiB
4.1 TiB 44.34 1.02  35     up osd.77
    78   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 4.6 GiB  12 GiB
4.0 TiB 45.56 1.05  35     up osd.78
    79   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.4 GiB 8.4 GiB
4.2 TiB 42.94 0.99  35     up osd.79
    80   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 2.4 GiB 8.5 GiB
4.2 TiB 42.47 0.98  34     up osd.80
    81   hdd    7.27599  1.00000  7.3 TiB 3.6 TiB 3.6 TiB 2.3 GiB 9.1 GiB
3.7 TiB 48.99 1.13  38     up osd.81
    82   hdd    7.27599  1.00000  7.3 TiB 3.0 TiB 3.0 TiB 2.3 GiB 8.5 GiB
4.3 TiB 40.98 0.94  34     up osd.82
    83   hdd    7.27599  1.00000  7.3 TiB 3.0 TiB 3.0 TiB 22 MiB 8.3 GiB
4.3 TiB 41.03 0.94  33     up osd.83
    84   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB  11 GiB
3.6 TiB 50.66 1.17  40     up osd.84
    85   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.4 GiB
4.0 TiB 45.66 1.05  34     up osd.85
    86   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 51 MiB 8.7 GiB
4.2 TiB 42.33 0.97  31     up osd.86
    87   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 3.3 GiB  10 GiB
3.9 TiB 45.77 1.05  34     up osd.87
-16        552.99255        -  349 TiB 124 TiB 123 TiB 65 GiB 321 GiB
225 TiB     0    0   -            room 1222-SS-09
-21                0        -      0 B     0 B     0 B     0 B     0
B     0 B     0    0   -                host lpnceph00
    -7        116.41600        -  116 TiB  60 TiB  60 TiB 35 GiB 176 GiB
56 TiB 51.73 1.19   -                host lpnceph03
     2   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 8.9 GiB
3.9 TiB 46.01 1.06  37     up osd.2
     6   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 4.8 GiB  16 GiB
3.8 TiB 47.26 1.09  37     up osd.6
    10   hdd    7.27599  1.00000  7.3 TiB 4.3 TiB 4.2 TiB 2.4 GiB  12 GiB
3.0 TiB 58.59 1.35  45     up osd.10
    14   hdd    7.27599  1.00000  7.3 TiB 3.9 TiB 3.9 TiB 2.4 GiB  12 GiB
3.4 TiB 53.62 1.23  40     up osd.14
    18   hdd    7.27599  1.00000  7.3 TiB 4.0 TiB 4.0 TiB 3.4 GiB  12 GiB
3.2 TiB 55.45 1.28  43     up osd.18
    22   hdd    7.27599  1.00000  7.3 TiB 4.5 TiB 4.5 TiB 2.2 GiB  12 GiB
2.8 TiB 61.64 1.42  46     up osd.22
    26   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.3 GiB  11 GiB
3.6 TiB 51.11 1.18  39     up osd.26
    30   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.6 TiB 2.4 GiB  10 GiB
3.6 TiB 50.23 1.16  39     up osd.30
    34   hdd    7.27599  1.00000  7.3 TiB 4.2 TiB 4.2 TiB 59 MiB  11 GiB
3.1 TiB 58.04 1.33  43     up osd.34
    38   hdd    7.27599  1.00000  7.3 TiB 3.8 TiB 3.8 TiB 44 MiB 9.8 GiB
3.5 TiB 51.86 1.19  38     up osd.38
    42   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.7 GiB  11 GiB
3.5 TiB 51.35 1.18  41     up osd.42
    46   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 3.0 GiB 9.3 GiB
4.0 TiB 45.60 1.05  35     up osd.46
    50   hdd    7.27599  1.00000  7.3 TiB 3.6 TiB 3.6 TiB 2.5 GiB  11 GiB
3.7 TiB 49.59 1.14  40     up osd.50
    54   hdd    7.27599  1.00000  7.3 TiB 3.5 TiB 3.5 TiB 54 MiB 9.8 GiB
3.7 TiB 48.78 1.12  35     up osd.54
    58   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.3 GiB 9.7 GiB
3.6 TiB 50.55 1.16  39     up osd.58
    62   hdd    7.27599  1.00000  7.3 TiB 3.5 TiB 3.5 TiB 2.4 GiB 9.2 GiB
3.8 TiB 48.00 1.10  40     up osd.62
-11                0        -      0 B     0 B     0 B     0 B     0
B     0 B     0    0   -                host lpnceph05
-19         87.31200        -   87 TiB  44 TiB  44 TiB 31 GiB 127 GiB
43 TiB 50.92 1.17   -                host lpnceph07
    89   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 4.2 GiB  11 GiB
3.9 TiB 46.67 1.07  38     up osd.89
    90   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB 9.9 GiB
3.6 TiB 50.71 1.17  39     up osd.90
    91   hdd    7.27599  1.00000  7.3 TiB 4.4 TiB 4.4 TiB 2.4 GiB  11 GiB
2.9 TiB 60.10 1.38  47     up osd.91
    92   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 4.1 GiB  11 GiB
4.0 TiB 45.67 1.05  36     up osd.92
    93   hdd    7.27599  1.00000  7.3 TiB 3.4 TiB 3.4 TiB 2.2 GiB 9.3 GiB
3.9 TiB 46.63 1.07  38     up osd.93
    94   hdd    7.27599  1.00000  7.3 TiB 3.7 TiB 3.7 TiB 2.3 GiB 9.8 GiB
3.6 TiB 50.50 1.16  39     up osd.94
    95   hdd    7.27599  1.00000  7.3 TiB 4.1 TiB 4.1 TiB 2.3 GiB  11 GiB
3.2 TiB 56.33 1.30  44     up osd.95
    97   hdd    7.27599  1.00000  7.3 TiB 3.8 TiB 3.8 TiB 2.4 GiB  10 GiB
3.5 TiB 52.08 1.20  41     up osd.97
    98   hdd    7.27599  1.00000  7.3 TiB 3.1 TiB 3.1 TiB 3.2 GiB 9.4 GiB
4.2 TiB 42.94 0.99  34     up osd.98
    99   hdd    7.27599  1.00000  7.3 TiB 3.9 TiB 3.9 TiB 2.9 GiB  12 GiB
3.3 TiB 54.05 1.24  43     up osd.99
103   hdd    7.27599  1.00000  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 9.7 GiB
3.9 TiB 45.87 1.06  36     up osd.103
104   hdd    7.27599  1.00000  7.3 TiB 4.3 TiB 4.3 TiB 56 MiB  13 GiB
3.0 TiB 59.43 1.37  44     up osd.104
-23        349.26453        -  349 TiB 124 TiB 123 TiB 65 GiB 321 GiB
225 TiB 35.45 0.82   -                host lpnceph09
     3   hdd   14.55269  1.00000   15 TiB 5.3 TiB 5.3 TiB 4.0 GiB  14 GiB
9.2 TiB 36.65 0.84  58     up osd.3
    23   hdd   14.55269  1.00000   15 TiB 5.0 TiB 5.0 TiB 2.3 GiB  12 GiB
9.5 TiB 34.52 0.79  55     up osd.23
    63   hdd   14.55269  1.00000   15 TiB 5.2 TiB 5.2 TiB 2.4 GiB  13 GiB
9.3 TiB 35.98 0.83  55     up osd.63
    64   hdd   14.55269  1.00000   15 TiB 5.2 TiB 5.2 TiB 3.6 GiB  15 GiB
9.3 TiB 35.81 0.82  57     up osd.64
    65   hdd   14.55269  1.00000   15 TiB 5.7 TiB 5.7 TiB 4.6 GiB  16 GiB
8.8 TiB 39.41 0.91  56     up osd.65
    66   hdd   14.55269  1.00000   15 TiB 5.0 TiB 5.0 TiB 2.4 GiB  13 GiB
9.6 TiB 34.33 0.79  55     up osd.66
    67   hdd   14.55269  1.00000   15 TiB 5.1 TiB 5.1 TiB 2.4 GiB  13 GiB
9.4 TiB 35.31 0.81  56     up osd.67
    68   hdd   14.55269  1.00000   15 TiB 5.6 TiB 5.5 TiB 2.3 GiB  14 GiB
9.0 TiB 38.24 0.88  58     up osd.68
    69   hdd   14.55269  1.00000   15 TiB 5.9 TiB 5.8 TiB 2.3 GiB  15 GiB
8.7 TiB 40.30 0.93  58     up osd.69
    70   hdd   14.55269  1.00000   15 TiB 4.8 TiB 4.8 TiB 3.0 GiB  13 GiB
9.7 TiB 33.21 0.76  51     up osd.70
    71   hdd   14.55269  1.00000   15 TiB 5.2 TiB 5.2 TiB 2.2 GiB  13 GiB
9.4 TiB 35.74 0.82  57     up osd.71
    73   hdd   14.55269  1.00000   15 TiB 5.0 TiB 5.0 TiB 2.3 GiB  12 GiB
9.6 TiB 34.24 0.79  55     up osd.73
    88   hdd   14.55269  1.00000   15 TiB 5.0 TiB 5.0 TiB 2.3 GiB  12 GiB
9.5 TiB 34.61 0.80  51     up osd.88
    96   hdd   14.55269  1.00000   15 TiB 5.3 TiB 5.3 TiB 2.3 GiB  13 GiB
9.3 TiB 36.28 0.83  56     up osd.96
106   hdd   14.55269  1.00000   15 TiB 4.9 TiB 4.9 TiB 2.5 GiB  13 GiB
9.6 TiB 33.96 0.78  53     up osd.106
107   hdd   14.55269  1.00000   15 TiB 5.3 TiB 5.3 TiB 3.2 GiB  15 GiB
9.3 TiB 36.28 0.83  54     up osd.107
108   hdd   14.55269  1.00000   15 TiB 5.0 TiB 5.0 TiB 2.3 GiB  13 GiB
9.5 TiB 34.70 0.80  53     up osd.108
109   hdd   14.55269  1.00000   15 TiB 5.1 TiB 5.1 TiB 2.4 GiB  12 GiB
9.5 TiB 34.82 0.80  52     up osd.109
110   hdd   14.55269  1.00000   15 TiB 5.5 TiB 5.5 TiB 2.8 GiB  16 GiB
9.0 TiB 37.91 0.87  55     up osd.110
111   hdd   14.55269  1.00000   15 TiB 5.3 TiB 5.3 TiB 3.2 GiB  14 GiB
9.3 TiB 36.35 0.84  55     up osd.111
112   hdd   14.55269  1.00000   15 TiB 5.0 TiB 5.0 TiB 2.9 GiB  14 GiB
9.6 TiB 34.18 0.79  55     up osd.112
113   hdd   14.55269  1.00000   15 TiB 4.6 TiB 4.6 TiB 2.3 GiB  12 GiB
10 TiB 31.47 0.72  48     up osd.113
114   hdd   14.55269  1.00000   15 TiB 5.0 TiB 4.9 TiB 3.3 GiB  13 GiB
9.6 TiB 34.07 0.78  53     up osd.114
115   hdd   14.55269  1.00000   15 TiB 4.7 TiB 4.7 TiB 2.3 GiB  12 GiB
9.8 TiB 32.47 0.75  51     up osd.115
                           TOTAL 1019 TiB 443 TiB 441 TiB 258 GiB 1.2 TiB
576 TiB 43.48
MIN/MAX VAR: 0.72/1.42  STDDEV: 6.69


and ceph balancer status
{
       "last_optimize_duration": "0:00:02.223977",
       "plans": [],
       "mode": "upmap",
       "active": true,
       "optimize_result": "Unable to find further optimization, or pool(s)
pg_num is decreasing, or distribution is already perfect",
       "last_optimize_started": "Sun Jan 31 17:07:47 2021"
}

Can the crush rules for placement be blamed for the inequal repartition ?

F.

Le 29/01/2021 à 23:44, Dan van der Ster a écrit :
Thanks, and thanks for the log file OTR which simply showed:

       2021-01-29 23:17:32.567 7f6155cae700  4 mgr[balancer] prepared 0/10 
changes

This indeed means that balancer believes those pools are all balanced
according to the config (which you have set to the defaults).

Could you please also share the output of `ceph osd df tree` so we can
see the distribution and OSD weights?

You might need simply to decrease the upmap_max_deviation from the
default of 5. On our clusters we do:

       ceph config set mgr mgr/balancer/upmap_max_deviation 1

Cheers, Dan

On Fri, Jan 29, 2021 at 11:25 PM Francois Legrand <f...@lpnhe.in2p3.fr> wrote:
Hi Dan,

Here is the output of ceph balancer status :

/ceph balancer status//
//{//
//    "last_optimize_duration": "0:00:00.074965", //
//    "plans": [], //
//    "mode": "upmap", //
//    "active": true, //
//    "optimize_result": "Unable to find further optimization, or
pool(s) pg_num is decreasing, or distribution is already perfect", //
//    "last_optimize_started": "Fri Jan 29 23:13:31 2021"//
//}/


F.

Le 29/01/2021 à 10:57, Dan van der Ster a écrit :
Hi Francois,

What is the output of `ceph balancer status` ?
Also, can you increase the debug_mgr to 4/5 then share the log file of
the active mgr?

Best,

Dan

On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand <f...@lpnhe.in2p3.fr> wrote:
Thanks for your suggestion. I will have a look !

But I am a bit surprised that the "official" balancer seems so unefficient !

F.

Le 28/01/2021 à 12:00, Jonas Jelten a écrit :
Hi!

We also suffer heavily from this so I wrote a custom balancer which yields much 
better results:
https://github.com/TheJJ/ceph-balancer

After you run it, it echoes the PG movements it suggests. You can then just run 
those commands the cluster will balance more.
It's kinda work in progress, so I'm glad about your feedback.

Maybe it helps you :)

-- Jonas

On 27/01/2021 17.15, Francois Legrand wrote:
Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to