[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Anthony D'Atri

Sent to quickly — also note that consumer / client SSDs often don’t have 
powerloss protection, so if your whole cluster were to lose power at the wrong 
time, you might lose data.

> On Nov 28, 2023, at 8:16 PM, Anthony D'Atri  wrote:
> 
> 
>>> 
>>> 1) They’re client aka desktop SSDs, not “enterprise”
>>> 2) They’re a partition of a larger OSD shared with other purposes
>> 
>> Yup.  They're a mix of SATA SSDs and NVMes, but everything is
>> consumer-grade.  They're only 10% full on average and I'm not
>> super-concerned with performance.  If they did get full I'd allocate
>> more space for them.  Performance is more than adequate for the very
>> light loads they have.
> 
> Fair enough.  We sometimes see people bringing a toothpick to a gun fight and 
> expecting a different result, so I had to ask.  Just keep an eye on their 
> endurance burn.
> 
>> 
>> 
>> It is interesting because Quincy had no issues with the autoscaler
>> with the exact same cluster config.  It might be a Rook issue, or it
>> might just be because so many PGs are remapped.  I'll take another
>> look at that once it reaches more of a steady state.
>> 
>> In any case, if the balancer is designed more for equal-sized OSDs I
>> can always just play with reweights to balance things.
> 
> Look into the JJ balancer, I’ve read good things about it.
> 
>> 
>> --
>> Rich
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Anthony D'Atri

>> 
>> 1) They’re client aka desktop SSDs, not “enterprise”
>> 2) They’re a partition of a larger OSD shared with other purposes
> 
> Yup.  They're a mix of SATA SSDs and NVMes, but everything is
> consumer-grade.  They're only 10% full on average and I'm not
> super-concerned with performance.  If they did get full I'd allocate
> more space for them.  Performance is more than adequate for the very
> light loads they have.

Fair enough.  We sometimes see people bringing a toothpick to a gun fight and 
expecting a different result, so I had to ask.  Just keep an eye on their 
endurance burn.

> 
> 
> It is interesting because Quincy had no issues with the autoscaler
> with the exact same cluster config.  It might be a Rook issue, or it
> might just be because so many PGs are remapped.  I'll take another
> look at that once it reaches more of a steady state.
> 
> In any case, if the balancer is designed more for equal-sized OSDs I
> can always just play with reweights to balance things.

Look into the JJ balancer, I’ve read good things about it.

> 
> --
> Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Rich Freeman
On Tue, Nov 28, 2023 at 6:25 PM Anthony D'Atri  wrote:
> Looks like one 100GB SSD OSD per host? This is AIUI the screaming minimum 
> size for an OSD.  With WAL, DB, cluster maps, and other overhead there 
> doesn’t end up being much space left for payload data.  On larger OSDs the 
> overhead is much more into the noise floor.  Given the side of these SSD 
> OSDs, I suspect at least one of the following is true?
>
> 1) They’re client aka desktop SSDs, not “enterprise”
> 2) They’re a partition of a larger OSD shared with other purposes

Yup.  They're a mix of SATA SSDs and NVMes, but everything is
consumer-grade.  They're only 10% full on average and I'm not
super-concerned with performance.  If they did get full I'd allocate
more space for them.  Performance is more than adequate for the very
light loads they have.

>
> I suspect that this alone would be enough to frustrate the balancer, which 
> AFAIK doesn’t take overhead into consideration.  You might disable the 
> balancer module, reset the reweights to 1.00, and try the JJ balancer but I 
> dunno that it would be night vs day.

I'm not really all that concerned with SSD balancing, since if data
needs to be moved around it happens almost instantaneously.  They're
small and on 10GbE.

Also, there are no pools that cross the hdd/ssd device classes, so I
would hope the balancer wouldn't get confused by having both in the
cluster.

> min_alloc_size?  Were they created on an older Ceph release?  Current 
> defaults for [non]rotational media are both 4KB; they used to be 64KB but 
> were changed with some churn …. around the Pacific / Octopus era IIRC.  If 
> you’re re-creating to minimize space amp, does that mean you’re running RGW 
> with a significant fraction of small objects?  With RBD — or CephFS with 
> larger files — that isn’t so much an issue.

They were created with 4k min_alloc_size. I'm increasing this to 64k
for the hdd osds.  I'm hoping that will improve performance a bit on
large files (average file size is multiple MB at least I think), and
if nothing else it seems to greatly reduce OSD RAM consumption so that
alone is useful.

>
> Unless you were to carefully segregate larger and smaller HDDs into separate 
> pools, right-sizing the PG could is tricky.  144 is okay, 72 is a bit low, 
> upstream guidance notwithstanding.  I would still bump some of your pg_nums a 
> bit.

The larger OSDs (which is the bulk of the capacity) have 150+ PGs
right now.  The small ones of course have far less.  I might bump up
one of the CephFS pools as it is starting to accumulate a bit more
data.

>
>> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins 
>> pg_num 1 pgp_num 1 autoscale_mode on pg_num_max 32 pg_num_min 1 application 
>> mgr
>
>
> Check the CRUSH rule for this pool.  On my clusters Rook creates it without 
> specifying a device-class, but the other pools get rules that do specify a 
> device class.

The .mgr pool has 1 pg, and is set to use ssd devices only.  Its 3
OSDs are all SSDs right now.

> So many pools for such a small cluster …. are you actively using CephFS, RBD, 
> *and* RGW?  If not, I’d suggest removing whatever you aren’t using and 
> adjusting pg_num for the pools you are using.

So, I'm using RBD on SSD (128 PGs - maybe a bit overkill for this but
those OSDs don't have anything else going on), and the bulk of the
storage is on CephFS on HDD with two pools.  I've been experimenting a
bit with RGW but those pools are basically empty and mostly have 8 PGs
each.

> Is that a 2,2 or 3,1 profile?

The EC pool?  That is k=2, m=2.  I am thinking about moving that to a
3+2 pool once I'm done with all the migration to be a bit more
space-efficient, but I only have 7 nodes and they aren't completely
balanced so I don't really want to stripe the data more than that.

It is interesting because Quincy had no issues with the autoscaler
with the exact same cluster config.  It might be a Rook issue, or it
might just be because so many PGs are remapped.  I'll take another
look at that once it reaches more of a steady state.

In any case, if the balancer is designed more for equal-sized OSDs I
can always just play with reweights to balance things.

--
Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Anthony D'Atri

>> Very small and/or non-uniform clusters can be corner cases for many things, 
>> especially if they don’t have enough PGs.  What is your failure domain — 
>> host or OSD?
> 
> Failure domain is host,

Your host buckets do vary in weight by roughly a factor of two.  They naturally 
will get PGs more or less relative to their aggregate CRUSH weight, and thus 
also the OSDs on each host.

> and PG number should be fairly reasonable.

Reason is in the eye of the beholder.  I make the PG ratio for the cluster as a 
whole to be ~~90.  I would definitely add more, that should help.

>> Are your OSDs sized uniformly?  Please send the output of the following 
>> commands:
> 
> OSDs are definitely not uniform in size.  This might be the issue with
> the automation.
> 
> You asked for it, but I do apologize for the wall of text that follows...

You described a small cluster, so this is peanuts.

>> `ceph osd tree`
> 
> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
> -1 131.65762  root default
> -25  16.46977  host k8s1
> 14hdd5.45799  osd.14   up   0.90002  1.0
> 19hdd   10.91409  osd.19   up   1.0  1.0
> 22ssd0.09769  osd.22   up   1.0  1.0
> -13  25.56458  host k8s3
>  2hdd   10.91409  osd.2up   0.84998  1.0
>  3hdd1.81940  osd.3up   0.75002  1.0
> 20hdd   12.73340  osd.20   up   1.0  1.0
> 10ssd0.09769  osd.10   up   1.0  1.0
> -14  12.83107  host k8s4
>  0hdd   10.91399  osd.0up   1.0  1.0
>  5hdd1.81940  osd.5up   1.0  1.0
> 11ssd0.09769  osd.11   up   1.0  1.0
> -2  14.65048  host k8s5
>  1hdd1.81940  osd.1up   0.70001  1.0
> 17hdd   12.73340  osd.17   up   1.0  1.0
> 12ssd0.09769  osd.12   up   1.0  1.0
> -6  14.65048  host k8s6
>  4hdd1.81940  osd.4up   0.75000  1.0
> 16hdd   12.73340  osd.16   up   0.95001  1.0
> 13ssd0.09769  osd.13   up   1.0  1.0
> -3  23.74518  host k8s7
>  6hdd   12.73340  osd.6up   1.0  1.0
> 15hdd   10.91409  osd.15   up   0.95001  1.0
>  8ssd0.09769  osd.8up   1.0  1.0
> -9  23.74606  host k8s8
>  7hdd   14.55269  osd.7up   1.0  1.0
> 18hdd9.09569  osd.18   up   1.0  1.0
>  9ssd0.09769  osd.9up   1.0  1.0

Looks like one 100GB SSD OSD per host? This is AIUI the screaming minimum size 
for an OSD.  With WAL, DB, cluster maps, and other overhead there doesn’t end 
up being much space left for payload data.  On larger OSDs the overhead is much 
more into the noise floor.  Given the side of these SSD OSDs, I suspect at 
least one of the following is true?

1) They’re client aka desktop SSDs, not “enterprise”
2) They’re a partition of a larger OSD shared with other purposes

I suspect that this alone would be enough to frustrate the balancer, which 
AFAIK doesn’t take overhead into consideration.  You might disable the balancer 
module, reset the reweights to 1.00, and try the JJ balancer but I dunno that 
it would be night vs day.


> Note this cluster is in the middle of re-creating all the OSDs to
> modify the OSD allocation size

min_alloc_size?  Were they created on an older Ceph release?  Current defaults 
for [non]rotational media are both 4KB; they used to be 64KB but were changed 
with some churn …. around the Pacific / Octopus era IIRC.  If you’re 
re-creating to minimize space amp, does that mean you’re running RGW with a 
significant fraction of small objects?  With RBD — or CephFS with larger files 
— that isn’t so much an issue.


> I have scrubbing disabled since I'm
> basically rewriting just about everything in the cluster weekly right
> now but normally that would be on.
> 
>  cluster:
>id: ba455d73-116e-4f24-8a34-a45e3ba9f44c
>health: HEALTH_WARN
>noscrub,nodeep-scrub flag(s) set
>546 pgs not deep-scrubbed in time
>542 pgs not scrubbed in time
> 
>  services:
>mon: 3 daemons, quorum e,f,g (age 7d)
>mgr: a(active, since 7d)
>mds: 1/1 daemons up, 1 hot standby
>osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs
> flags noscrub,nodeep-scrub
>rgw: 1 daemon active (1 hosts, 1 zones)
> 
>  data:
>volumes: 1/1 healthy
>pools:   13 pools, 617 pgs
>objects: 9.36M objects, 33 TiB
>usage:   67 TiB used, 65 TiB / 132 TiB avail
>pgs: 1778936/21708668 objects misplaced (8.195%)
> 516 active+clean
> 100 active+remapped+backfill_wait
> 1   

[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Rich Freeman
On Tue, Nov 28, 2023 at 3:52 PM Anthony D'Atri  wrote:
>
> Very small and/or non-uniform clusters can be corner cases for many things, 
> especially if they don’t have enough PGs.  What is your failure domain — host 
> or OSD?

Failure domain is host, and PG number should be fairly reasonable.

>
> Are your OSDs sized uniformly?  Please send the output of the following 
> commands:

OSDs are definitely not uniform in size.  This might be the issue with
the automation.

You asked for it, but I do apologize for the wall of text that follows...

>
> `ceph osd tree`

ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
 -1 131.65762  root default
-25  16.46977  host k8s1
 14hdd5.45799  osd.14   up   0.90002  1.0
 19hdd   10.91409  osd.19   up   1.0  1.0
 22ssd0.09769  osd.22   up   1.0  1.0
-13  25.56458  host k8s3
  2hdd   10.91409  osd.2up   0.84998  1.0
  3hdd1.81940  osd.3up   0.75002  1.0
 20hdd   12.73340  osd.20   up   1.0  1.0
 10ssd0.09769  osd.10   up   1.0  1.0
-14  12.83107  host k8s4
  0hdd   10.91399  osd.0up   1.0  1.0
  5hdd1.81940  osd.5up   1.0  1.0
 11ssd0.09769  osd.11   up   1.0  1.0
 -2  14.65048  host k8s5
  1hdd1.81940  osd.1up   0.70001  1.0
 17hdd   12.73340  osd.17   up   1.0  1.0
 12ssd0.09769  osd.12   up   1.0  1.0
 -6  14.65048  host k8s6
  4hdd1.81940  osd.4up   0.75000  1.0
 16hdd   12.73340  osd.16   up   0.95001  1.0
 13ssd0.09769  osd.13   up   1.0  1.0
 -3  23.74518  host k8s7
  6hdd   12.73340  osd.6up   1.0  1.0
 15hdd   10.91409  osd.15   up   0.95001  1.0
  8ssd0.09769  osd.8up   1.0  1.0
 -9  23.74606  host k8s8
  7hdd   14.55269  osd.7up   1.0  1.0
 18hdd9.09569  osd.18   up   1.0  1.0
  9ssd0.09769  osd.9up   1.0  1.0

>
> so that we can see the topology.
>
> `ceph -s`

Note this cluster is in the middle of re-creating all the OSDs to
modify the OSD allocation size - I have scrubbing disabled since I'm
basically rewriting just about everything in the cluster weekly right
now but normally that would be on.

  cluster:
id: ba455d73-116e-4f24-8a34-a45e3ba9f44c
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
546 pgs not deep-scrubbed in time
542 pgs not scrubbed in time

  services:
mon: 3 daemons, quorum e,f,g (age 7d)
mgr: a(active, since 7d)
mds: 1/1 daemons up, 1 hot standby
osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs
 flags noscrub,nodeep-scrub
rgw: 1 daemon active (1 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   13 pools, 617 pgs
objects: 9.36M objects, 33 TiB
usage:   67 TiB used, 65 TiB / 132 TiB avail
pgs: 1778936/21708668 objects misplaced (8.195%)
 516 active+clean
 100 active+remapped+backfill_wait
 1   active+remapped+backfilling

  io:
client:   371 KiB/s rd, 2.8 MiB/s wr, 2 op/s rd, 7 op/s wr
recovery: 25 MiB/s, 6 objects/s

  progress:
Global Recovery Event (7d)
  [===.] (remaining: 36h)

> `ceph osd df`

Note that these are not in a steady state right now.  OSD 6 in
particular was just re-created and is repopulating.  A few of the
reweights were set to deal with some gross issues in balance - when it
all settles down I plan to optimize them.

ID  CLASS  WEIGHTREWEIGHT  SIZE RAW USE  DATA OMAP
META AVAIL %USE   VAR   PGS  STATUS
14hdd   5.45799   0.90002  5.5 TiB  3.0 TiB  3.0 TiB  2.0 MiB   11
GiB   2.4 TiB  55.51  1.09   72  up
19hdd  10.91409   1.0   11 TiB  6.2 TiB  6.2 TiB  3.1 MiB   16
GiB   4.7 TiB  57.12  1.12  144  up
22ssd   0.09769   1.0  100 GiB  2.4 GiB  1.8 GiB  167 MiB  504
MiB98 GiB   2.43  0.05   32  up
 2hdd  10.91409   0.84998   11 TiB  4.5 TiB  4.5 TiB  5.0 MiB  9.7
GiB   6.4 TiB  41.11  0.81   99  up
 3hdd   1.81940   0.75002  1.8 TiB  1.0 TiB  1.0 TiB  2.3 MiB  3.8
GiB   818 GiB  56.11  1.10   21  up
20hdd  12.73340   1.0   13 TiB  7.1 TiB  7.1 TiB  3.7 MiB   16
GiB   5.6 TiB  56.01  1.10  165  up
10ssd   0.09769   1.0  100 GiB  1.3 GiB  299 MiB  185 MiB  835
MiB99 GiB   1.29  0.03   38  up
 0hdd  10.91399   1.0   11 TiB  6.5 TiB  6.5 TiB  3.7 MiB   15
GiB   4.4 TiB  59.41  1.17  144  up
 5hdd   1.81940   1.0  1.8 TiB  

[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Wesley Dillingham
It's a complicated topic and there is no one answer, it varies for each
cluster and depends. You have a good lay of the land.

I just wanted to mention that the correct "foundation" for equally utilized
OSDs within a cluster relies on two important factors:

- Symmetry of disk/osd quantity and capacity (weight) between hosts.
- Achieving the correct amount of PGs-per-osd (typically between 100 and
200).

Without having reasonable settings/configurations for these two factors the
various higher-level balancing techniques wont work too well/at all.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Tue, Nov 28, 2023 at 3:27 PM Rich Freeman  wrote:

> I'm fairly new to Ceph and running Rook on a fairly small cluster
> (half a dozen nodes, about 15 OSDs).  I notice that OSD space use can
> vary quite a bit - upwards of 10-20%.
>
> In the documentation I see multiple ways of managing this, but no
> guidance on what the "correct" or best way to go about this is.  As
> far as I can tell there is the balancer, manual manipulation of upmaps
> via the command line tools, and OSD reweight.  The last two can be
> optimized with tools to calculate appropriate corrections.  There is
> also the new read/active upmap (at least for non-EC pools), which is
> manually triggered.
>
> The balancer alone is leaving fairly wide deviations in space use, and
> at times during recovery this can become more significant.  I've seen
> OSDs hit the 80% threshold and start impacting IO when the entire
> cluster is only 50-60% full during recovery.
>
> I've started using ceph osd reweight-by-utilization and that seems
> much more effective at balancing things, but this seems redundant with
> the balancer which I have turned on.
>
> What is generally considered the best practice for OSD balancing?
>
> --
> Rich
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Anthony D'Atri


> 
> I'm fairly new to Ceph and running Rook on a fairly small cluster
> (half a dozen nodes, about 15 OSDs).

Very small and/or non-uniform clusters can be corner cases for many things, 
especially if they don’t have enough PGs.  What is your failure domain — host 
or OSD?

Are your OSDs sized uniformly?  Please send the output of the following 
commands:

`ceph osd tree`

so that we can see the topology.

`ceph -s`
`ceph osd df`
`ceph osd dump | grep pool`
`ceph balancer status`
`ceph osd pool autoscale-status`


>  I notice that OSD space use can
> vary quite a bit - upwards of 10-20%.
> 
> In the documentation I see multiple ways of managing this, but no
> guidance on what the "correct" or best way to go about this is.

Assuming that you’re running a recent release, and that the balancer module is 
enabled, that *should* be the right way.

The balancer module can be confounded by certain complex topologies like 
multiple device classes and/or CRUSH roots.

Since you’re using Rook, I wonder if you might be hitting something that I’ve 
seen myself; the above commands will tell the tale.


>  As far as I can tell there is the balancer, manual manipulation of upmaps
> via the command line tools, and OSD reweight.  The last two can be
> optimized with tools to calculate appropriate corrections.  There is
> also the new read/active upmap (at least for non-EC pools), which is
> manually triggered.
> 
> The balancer alone is leaving fairly wide deviations in space use, and
> at times during recovery this can become more significant.  I've seen
> OSDs hit the 80% threshold and start impacting IO when the entire
> cluster is only 50-60% full during recovery.
> 
> I've started using ceph osd reweight-by-utilization and that seems
> much more effective at balancing things, but this seems redundant with
> the balancer which I have turned on.

RBU was widely used before the balancer module.  I personally haven’t had to 
use it since Luminous.  It adjusts the override reweights, which will content 
with the balancer module if both are enabled.

There’s an alternative “JJ Balancer” out on the net that some report success 
with, but let’s see what your cluster looks like before we go there.


> 
> What is generally considered the best practice for OSD balancing?
> 
> --
> Rich
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io