[ceph-users] Re: Best Practice for OSD Balancing

Anthony D'Atri Tue, 28 Nov 2023 15:27:19 -0800

>> Very small and/or non-uniform clusters can be corner cases for many things, 
>> especially if they don’t have enough PGs.  What is your failure domain — 
>> host or OSD?
> 
> Failure domain is host,


Your host buckets do vary in weight by roughly a factor of two.  They naturally 
will get PGs more or less relative to their aggregate CRUSH weight, and thus 
also the OSDs on each host.

> and PG number should be fairly reasonable.

Reason is in the eye of the beholder.  I make the PG ratio for the cluster as a 
whole to be ~~90.  I would definitely add more, that should help.

>> Are your OSDs sized uniformly?  Please send the output of the following 
>> commands:
> 
> OSDs are definitely not uniform in size.  This might be the issue with
> the automation.
> 
> You asked for it, but I do apologize for the wall of text that follows...

You described a small cluster, so this is peanuts.

>> `ceph osd tree`
> 
> ID   CLASS  WEIGHT     TYPE NAME        STATUS  REWEIGHT  PRI-AFF
> -1         131.65762  root default
> -25          16.46977      host k8s1
> 14    hdd    5.45799          osd.14       up   0.90002  1.00000
> 19    hdd   10.91409          osd.19       up   1.00000  1.00000
> 22    ssd    0.09769          osd.22       up   1.00000  1.00000
> -13          25.56458      host k8s3
>  2    hdd   10.91409          osd.2        up   0.84998  1.00000
>  3    hdd    1.81940          osd.3        up   0.75002  1.00000
> 20    hdd   12.73340          osd.20       up   1.00000  1.00000
> 10    ssd    0.09769          osd.10       up   1.00000  1.00000
> -14          12.83107      host k8s4
>  0    hdd   10.91399          osd.0        up   1.00000  1.00000
>  5    hdd    1.81940          osd.5        up   1.00000  1.00000
> 11    ssd    0.09769          osd.11       up   1.00000  1.00000
> -2          14.65048      host k8s5
>  1    hdd    1.81940          osd.1        up   0.70001  1.00000
> 17    hdd   12.73340          osd.17       up   1.00000  1.00000
> 12    ssd    0.09769          osd.12       up   1.00000  1.00000
> -6          14.65048      host k8s6
>  4    hdd    1.81940          osd.4        up   0.75000  1.00000
> 16    hdd   12.73340          osd.16       up   0.95001  1.00000
> 13    ssd    0.09769          osd.13       up   1.00000  1.00000
> -3          23.74518      host k8s7
>  6    hdd   12.73340          osd.6        up   1.00000  1.00000
> 15    hdd   10.91409          osd.15       up   0.95001  1.00000
>  8    ssd    0.09769          osd.8        up   1.00000  1.00000
> -9          23.74606      host k8s8
>  7    hdd   14.55269          osd.7        up   1.00000  1.00000
> 18    hdd    9.09569          osd.18       up   1.00000  1.00000
>  9    ssd    0.09769          osd.9        up   1.00000  1.00000

Looks like one 100GB SSD OSD per host? This is AIUI the screaming minimum size 
for an OSD.  With WAL, DB, cluster maps, and other overhead there doesn’t end 
up being much space left for payload data.  On larger OSDs the overhead is much 
more into the noise floor.  Given the side of these SSD OSDs, I suspect at 
least one of the following is true?

1) They’re client aka desktop SSDs, not “enterprise”
2) They’re a partition of a larger OSD shared with other purposes

I suspect that this alone would be enough to frustrate the balancer, which 
AFAIK doesn’t take overhead into consideration.  You might disable the balancer 
module, reset the reweights to 1.00, and try the JJ balancer but I dunno that 
it would be night vs day.


> Note this cluster is in the middle of re-creating all the OSDs to
> modify the OSD allocation size

min_alloc_size?  Were they created on an older Ceph release?  Current defaults 
for [non]rotational media are both 4KB; they used to be 64KB but were changed 
with some churn …. around the Pacific / Octopus era IIRC.  If you’re 
re-creating to minimize space amp, does that mean you’re running RGW with a 
significant fraction of small objects?  With RBD — or CephFS with larger files 
— that isn’t so much an issue.


> I have scrubbing disabled since I'm
> basically rewriting just about everything in the cluster weekly right
> now but normally that would be on.
> 
>  cluster:
>    id:     ba455d73-116e-4f24-8a34-a45e3ba9f44c
>    health: HEALTH_WARN
>            noscrub,nodeep-scrub flag(s) set
>            546 pgs not deep-scrubbed in time
>            542 pgs not scrubbed in time
> 
>  services:
>    mon: 3 daemons, quorum e,f,g (age 7d)
>    mgr: a(active, since 7d)
>    mds: 1/1 daemons up, 1 hot standby
>    osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs
>         flags noscrub,nodeep-scrub
>    rgw: 1 daemon active (1 hosts, 1 zones)
> 
>  data:
>    volumes: 1/1 healthy
>    pools:   13 pools, 617 pgs
>    objects: 9.36M objects, 33 TiB
>    usage:   67 TiB used, 65 TiB / 132 TiB avail
>    pgs:     1778936/21708668 objects misplaced (8.195%)
>             516 active+clean
>             100 active+remapped+backfill_wait
>             1   active+remapped+backfilling
> 
>  io:
>    client:   371 KiB/s rd, 2.8 MiB/s wr, 2 op/s rd, 7 op/s wr
>    recovery: 25 MiB/s, 6 objects/s
> 
>  progress:
>    Global Recovery Event (7d)
>      [=======================.....] (remaining: 36h)
> 
>> `ceph osd df`
> 
> Note that these are not in a steady state right now.  OSD 6 in
> particular was just re-created and is repopulating.  A few of the
> reweights were set to deal with some gross issues in balance - when it
> all settles down I plan to optimize them.
> 
> ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP META     AVAIL 
>     %USE       VAR   PGS  STATUS
> 14    hdd   5.45799   0.90002  5.5 TiB  3.0 TiB  3.0 TiB  2.0 MiB   11 GiB   
> 2.4 TiB  55.51  1.09   72      up
> 19    hdd  10.91409   1.00000   11 TiB  6.2 TiB  6.2 TiB  3.1 MiB   16 GiB   
> 4.7 TiB  57.12  1.12  144      up

Unless you were to carefully segregate larger and smaller HDDs into separate 
pools, right-sizing the PG could is tricky.  144 is okay, 72 is a bit low, 
upstream guidance notwithstanding.  I would still bump some of your pg_nums a 
bit.

> 22    ssd   0.09769   1.00000  100 GiB  2.4 GiB  1.8 GiB  167 MiB  504 MiB    
> 98 GiB   2.43  0.05   32      up
> 2    hdd  10.91409   0.84998   11 TiB  4.5 TiB  4.5 TiB  5.0 MiB  9.7  GiB   
> 6.4 TiB  41.11  0.81   99      up
> 3    hdd   1.81940   0.75002  1.8 TiB  1.0 TiB  1.0 TiB  2.3 MiB  3.8  GiB   
> 818 GiB  56.11  1.10   21      up
> 20    hdd  12.73340   1.00000   13 TiB  7.1 TiB  7.1 TiB  3.7 MiB   16 GiB   
> 5.6 TiB  56.01  1.10  165      up
> 10    ssd   0.09769   1.00000  100 GiB  1.3 GiB  299 MiB  185 MiB  835 MiB    
> 99 GiB   1.29  0.03   38      up
> 0    hdd  10.91399   1.00000   11 TiB  6.5 TiB  6.5 TiB  3.7 MiB   15  GiB   
> 4.4 TiB  59.41  1.17  144      up
> 5    hdd   1.81940   1.00000  1.8 TiB  845 GiB  842 GiB  1.7 MiB  3.3  GiB  
> 1018 GiB  45.36  0.89   23      up
> 11    ssd   0.09769   1.00000  100 GiB  3.1 GiB  1.3 GiB  157 MiB  1.6.GiB    
> 97 GiB   3.09  0.06   33      up
> 1    hdd   1.81940   0.70001  1.8 TiB  983 GiB  979 GiB  1.3 MiB  3.4. GiB   
> 880 GiB  52.76  1.04   26      up
> 17    hdd  12.73340   1.00000   13 TiB  7.3 TiB  7.2 TiB  3.6 MiB   15 GiB   
> 5.5 TiB  56.95  1.12  159      up
> 12    ssd   0.09769   1.00000  100 GiB  1.5 GiB  120 MiB   55 MiB  1.3 GiB    
> 99 GiB   1.49  0.03   21      up
> 4    hdd   1.81940   0.75000  1.8 TiB  1.0 TiB  1.0 TiB  2.5 MiB  3.0  GiB   
> 820 GiB  55.98  1.10   24      up
> 16    hdd  12.73340   0.95001   13 TiB  7.6 TiB  7.5 TiB  7.9 MiB   16 GiB   
> 5.2 TiB  59.32  1.17  171      up
> 13    ssd   0.09769   1.00000  100 GiB  2.4 GiB  528 MiB  196 MiB  1.7 GiB    
> 98 GiB   2.38  0.05   33      up
> 6    hdd  12.73340   1.00000   13 TiB  1.7 TiB  1.7 TiB  1.3 MiB  4.5  GiB    
> 11 TiB  13.66  0.27   48      up
> 15    hdd  10.91409   0.95001   11 TiB  6.5 TiB  6.5 TiB  5.2 MiB   13 GiB   
> 4.4 TiB  59.42  1.17  155      up
> 8    ssd   0.09769   1.00000  100 GiB  1.9 GiB  1.1 GiB  116 MiB  788  MiB    
> 98 GiB   1.95  0.04   26      up
> 7    hdd  14.55269   1.00000   15 TiB  7.8 TiB  7.7 TiB  3.9 MiB   16  GiB   
> 6.8 TiB  53.32  1.05  172      up
> 18    hdd   9.09569   1.00000  9.1 TiB  4.9 TiB  4.9 TiB  3.9 MiB   11 GiB   
> 4.2 TiB  53.96  1.06  109      up
> 9    ssd   0.09769   1.00000  100 GiB  2.2 GiB  391 MiB  264 MiB  1.6  GiB    
> 98 GiB   2.25  0.04   40      up
>                        TOTAL  132 TiB   67 TiB   67 TiB  1.2 GiB  164 GiB    
> 65 TiB  50.82
> MIN/MAX VAR: 0.03/1.17  STDDEV: 29.78
> 
> 
>> `ceph osd dump | grep pool`
> 
> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins 
> pg_num 1 pgp_num 1 autoscale_mode on pg_num_max 32 pg_num_min 1 application 
> mgr

Check the CRUSH rule for this pool.  On my clusters Rook creates it without 
specifying a device-class, but the other pools get rules that do specify a 
device class.  By way of the  shadow CRUSH topology, this sort of looks like 
multiple CRUSH roots to the pg_autoscaler, which is why you have no output from 
the status below.  I added a bit to the docs earlier this year to call this 
out.  Perhaps the Rook folks on the list might have thoughts about preventing 
this situation, I don’t recall if I created a github issue for it.

That said, I’m personally not a fan of the pg autoscaler and tend to disable 
it.  ymmv.  Unless you enable the “bulk” option, it may well be that you have 
too few PGs for effective bin packing.  Think about filling a 55 gal drum with 
beach balls vs with golf balls.

So many pools for such a small cluster …. are you actively using CephFS, RBD, 
*and* RGW?  If not, I’d suggest removing whatever you aren’t using and 
adjusting pg_num for the pools you are using.

> pool 2 'myfs-metadata' replicated size 3 min_size 2 crush_rule 25 object_hash 
> rjenkins pg_num 16 pgp_num 16 
> pool 3 'myfs-replicated' replicated size 2 min_size 1 crush_rule 26 
> object_hash rjenkins pg_num 256 pgp_num 256 
> pool 4 'pvc-generic-pool' replicated size 3 min_size 2 crush_rule 17 
> object_hash rjenkins pg_num 128 pgp_num 128 
> pool 13 'myfs-eck2m2' erasure profile myfs-eck2m2_ecprofile size 4 min_size 3 
> crush_rule 8  pg_num 128 pgp_num 128
> pool 22 'my-store.rgw.otp' replicated size 3 min_size 2 crush_rule 24 pg_num 
> 8 pgp_num 8
> pool 23 'my-store.rgw.buckets.index' replicated size 3 min_size 2 pg_num 8 
> pgp_num 8
> pool 24 'my-store.rgw.log' replicated size 3 min_size 2 crush_rule 23 pg_num 
> 8 pgp_num 8
> pool 25 'my-store.rgw.control' replicated size 3 min_size 2 crush_rule 19 
> object_hash rjenkins pg_num 8 pgp_num 8
> pool 26 '.rgw.root' replicated size 3 min_size 2 crush_rule 18 pg_num 8 
> pgp_num 8
> pool 27 'my-store.rgw.buckets.non-ec' replicated size 3 min_size 2 pg_num 8 
> pgp_num 8
> pool 28 'my-store.rgw.meta' replicated size 3 min_size 2 pg_num 8 pgp_num 8
> pool 29 'my-store.rgw.buckets.data' erasure profile 
> my-store.rgw.buckets.data_ecprofile size 4 min_size 3 pg_num 32 pgp_num 32 
> autoscale_mode on

Is that a 2,2 or 3,1 profile? 

> 
>> `ceph balancer status`
> 
> This does have normal output when the cluster isn't in the middle of recovery.

> 
> {
>    "active": true,
>    "last_optimize_duration": "0:00:00.000107",
>    "last_optimize_started": "Tue Nov 28 22:11:56 2023",
>    "mode": "upmap",
>    "no_optimization_needed": true,
>    "optimize_result": "Too many objects (0.081907 > 0.050000) are
> misplaced; try again later",
>    "plans": []
> }
> 
>> `ceph osd pool autoscale-status`
> 
> No output for this.  I'm not sure why

See above, I suspected this.


> - this has given output in the
> past.  Might be due to being in the middle of recovery, or it might be
> a Reef issue (I don't think I've looked at this since upgrading).  In
> any case, PG counts are in the osd dump, and I have the hdd storage
> classes set to warn I think.
> 
>> The balancer module can be confounded by certain complex topologies like 
>> multiple device classes and/or CRUSH roots.
>> 
>> Since you’re using Rook, I wonder if you might be hitting something that 
>> I’ve seen myself; the above commands will tell the tale.
> 
> Yeah, if it is designed for equally-sized OSDs then it isn't going to
> work quite right for me.  I do try to keep hosts reasonably balanced,
> but not individual OSDs. 

Ceph is fantastic for flexibility, but it’s not above giving us enough rope to 
hang ourselves with.

> 
> --
> Rich

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best Practice for OSD Balancing

Reply via email to