[ceph-users] Re: Balancer Distribution Help

2022-09-22 Thread Bailey Allison
Hi Reed,

Just taking a quick glance at the Pastebin provided I have to say your cluster 
balance is already pretty damn good all things considered. 

We've seen the upmap balancer at it's best in practice provides a deviation of 
about 10-20% percent across OSDs which seems to be matching up on your cluster. 
It's something that as the more nodes and OSDs you add that are equal in size 
to the cluster, and as the PGs increase on the cluster it can do a better and 
better job of, but in practice about a 10% difference in OSDs is  very normal.

Something to note in the video provided is that they were using a cluster with 
28PB of storage available, so who knows how many OSDs/nodes/PGs per pool/etc., 
that their cluster has the luxury and ability to balance across.

The only thing I can think to suggest is just increasing the PG count as you've 
already mentioned. The ideal setting is about 100 PGs per OSD, and looking at 
your cluster both the SSDs and the smaller HDDs have only about 50 PGs per OSD.

If you're able to get both of those devices to a closer to 100 PG per OSD ratio 
it should help a lot more with the balancing. More PGs means more places to 
distribute data. 

It will be tricky in that I am just noticing for the HDDs you have some 
hosts/chassis with 24 OSDs per and others with 6 HDDs per so getting the PG 
distribution more even for those will be challenging, but for the SSDs it 
should be quite simple to get those to be 100 PGs per OSD.
 
Just taking a further look it does appear on some OSDs although I will say 
across the entire cluster the actual data stored is balanced good, there are a 
couple of OSDs where the OMAP/metadata is not balanced as well as the others.

Where you are using EC pools for CephFS, any OMAP data cannot be stored within 
EC so it will store all of that within a replication data cephfs pool, most 
likely your hdd_cephfs pool. 

Just something to keep in mind as not only is it important to make sure the 
data is balanced, but the OMAP data and metadata are balanced as well.

Otherwise though I would recommended just trying to get your cluster to a point 
where each of the OSDs have roughly 100 PGs per OSD, or at least as close to 
this as you are able to given your clusters crush rulesets. 

This should then help the balancer spread the data across the cluster, but 
again unless I overlooked something your cluster already appears to be 
extremely well balanced.

There is a PG calculator you can use online at: 

https://old.ceph.com/pgcalc/

There is also a PG calc on the Redhat website but it requires a subscription. 

Both calculators are essentially the same but I have noticed the free one will 
round down the PGs and the Redhat one will round up the PGs.

Regards,

Bailey

-Original Message-
From: Reed Dier  
Sent: September 22, 2022 4:48 PM
To: ceph-users 
Subject: [ceph-users] Balancer Distribution Help

Hoping someone can point me to possible tunables that could hopefully better 
tighten my OSD distribution.

Cluster is currently
> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) 
> octopus (stable)": 307
With plans to begin moving to pacific before end of year, with a possible 
interim stop at octopus.17 on the way.

Cluster was born on jewel, and is fully bluestore/straw2.
The upmap balancer works/is working, but not to the degree that I believe it 
could/should work, which seems should be much closer to near perfect than what 
I’m seeing.

https://imgur.com/a/lhtZswo  <- Histograms of my 
OSD distribution

https://pastebin.com/raw/dk3fd4GH  <- 
pastebin of cluster/pool/crush relevant bits

To put it succinctly, I’m hoping to get much tighter OSD distribution, but I’m 
not sure what knobs to try turning next, as the upmap balancer has gone as far 
as it can, and I end up playing “reweight the most full OSD whack-a-mole as 
OSD’s get nearful.”

My goal is obviously something akin to this perfect distribution like here: 
https://www.youtube.com/watch?v=niFNZN5EKvE&t=1353s 


I am looking to tweak the PG counts for a few pool.
Namely the ssd-radosobj has shrunk in size and needs far fewer PGs now.
Similarly hdd-cephfs shrunk in size as well and needs fewer PGs (as ceph health 
shows).
And on the flip side, ec*-cephfs likely need more PGs as they have grown in 
size.
However I was hoping to get more breathing room of free space on my most full 
OSDs before starting to do big PG expand/shrink.

I am assuming that my whacky mix of replicated vs multiple EC storage pools 
coupled with hybrid SSD+HDD pools is throwing off the balance more than if it 
was a more homogenous crush ruleset, but this is what exists and is what I’m 
working with.
Also, since it will look odd in the tree view, the crush rulesets for hdd pools 
are chooseleaf chassis, while ssd pools are chooseleaf host.

Any tips or help would be greatly appreciated.

[ceph-users] Re: Balancer Distribution Help

2022-09-22 Thread Stefan Kooman

On 9/22/22 21:48, Reed Dier wrote:



Any tips or help would be greatly appreciated.


Try JJ's Ceph balancer [1]. In our case it turned out to be *way* more 
efficient than built-in balancer (faster conversion, less movements 
involved). And able to achieve a very good PG distribution and "reclaim" 
lot's of space. I Highly recommended it.


Gr. Stefan

[1]: https://github.com/TheJJ/ceph-balancer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-22 Thread Eugen Block

+1 for increasing PG numbers, those are quite low.

Zitat von Bailey Allison :


Hi Reed,

Just taking a quick glance at the Pastebin provided I have to say  
your cluster balance is already pretty damn good all things  
considered.


We've seen the upmap balancer at it's best in practice provides a  
deviation of about 10-20% percent across OSDs which seems to be  
matching up on your cluster. It's something that as the more nodes  
and OSDs you add that are equal in size to the cluster, and as the  
PGs increase on the cluster it can do a better and better job of,  
but in practice about a 10% difference in OSDs is  very normal.


Something to note in the video provided is that they were using a  
cluster with 28PB of storage available, so who knows how many  
OSDs/nodes/PGs per pool/etc., that their cluster has the luxury and  
ability to balance across.


The only thing I can think to suggest is just increasing the PG  
count as you've already mentioned. The ideal setting is about 100  
PGs per OSD, and looking at your cluster both the SSDs and the  
smaller HDDs have only about 50 PGs per OSD.


If you're able to get both of those devices to a closer to 100 PG  
per OSD ratio it should help a lot more with the balancing. More PGs  
means more places to distribute data.


It will be tricky in that I am just noticing for the HDDs you have  
some hosts/chassis with 24 OSDs per and others with 6 HDDs per so  
getting the PG distribution more even for those will be challenging,  
but for the SSDs it should be quite simple to get those to be 100  
PGs per OSD.


Just taking a further look it does appear on some OSDs although I  
will say across the entire cluster the actual data stored is  
balanced good, there are a couple of OSDs where the OMAP/metadata is  
not balanced as well as the others.


Where you are using EC pools for CephFS, any OMAP data cannot be  
stored within EC so it will store all of that within a replication  
data cephfs pool, most likely your hdd_cephfs pool.


Just something to keep in mind as not only is it important to make  
sure the data is balanced, but the OMAP data and metadata are  
balanced as well.


Otherwise though I would recommended just trying to get your cluster  
to a point where each of the OSDs have roughly 100 PGs per OSD, or  
at least as close to this as you are able to given your clusters  
crush rulesets.


This should then help the balancer spread the data across the  
cluster, but again unless I overlooked something your cluster  
already appears to be extremely well balanced.


There is a PG calculator you can use online at:

https://old.ceph.com/pgcalc/

There is also a PG calc on the Redhat website but it requires a subscription.

Both calculators are essentially the same but I have noticed the  
free one will round down the PGs and the Redhat one will round up  
the PGs.


Regards,

Bailey

-Original Message-
From: Reed Dier 
Sent: September 22, 2022 4:48 PM
To: ceph-users 
Subject: [ceph-users] Balancer Distribution Help

Hoping someone can point me to possible tunables that could  
hopefully better tighten my OSD distribution.


Cluster is currently

"ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974)
octopus (stable)": 307
With plans to begin moving to pacific before end of year, with a  
possible interim stop at octopus.17 on the way.


Cluster was born on jewel, and is fully bluestore/straw2.
The upmap balancer works/is working, but not to the degree that I  
believe it could/should work, which seems should be much closer to  
near perfect than what I’m seeing.


https://imgur.com/a/lhtZswo  <-  
Histograms of my OSD distribution


https://pastebin.com/raw/dk3fd4GH  
 <- pastebin of  
cluster/pool/crush relevant bits


To put it succinctly, I’m hoping to get much tighter OSD  
distribution, but I’m not sure what knobs to try turning next, as  
the upmap balancer has gone as far as it can, and I end up playing  
“reweight the most full OSD whack-a-mole as OSD’s get nearful.”


My goal is obviously something akin to this perfect distribution  
like here: https://www.youtube.com/watch?v=niFNZN5EKvE&t=1353s  



I am looking to tweak the PG counts for a few pool.
Namely the ssd-radosobj has shrunk in size and needs far fewer PGs now.
Similarly hdd-cephfs shrunk in size as well and needs fewer PGs (as  
ceph health shows).
And on the flip side, ec*-cephfs likely need more PGs as they have  
grown in size.
However I was hoping to get more breathing room of free space on my  
most full OSDs before starting to do big PG expand/shrink.


I am assuming that my whacky mix of replicated vs multiple EC  
storage pools coupled with hybrid SSD+HDD pools is throwing off the  
balance more than if it was a more homogenous crush ruleset, but  
this is what exists and is what I’m working with.
Also, since it will look odd in the tr

[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Wyll Ingersoll


When doing manual remapping/rebalancing with tools like pgremapper and 
placementoptimizer, what are the recommended settings for norebalance, 
norecover, nobackfill?
Should the balancer module be disabled if we are manually issuing the pg remap 
commands generated by those scripts so it doesn't interfere?

Something like this:

$ ceph osd set norebalance
$ ceph osd set norecover
$ ceph osd set nobackfill
$ ceph balancer off

$ pgremapper cancel-backfill --yes   # to stop all pending operations
$ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
$ bash upmap-moves

Repeat the above 3 steps until balance is achieved, then re-enable the balancer 
and unset the "no" flags set earlier?



From: Eugen Block 
Sent: Friday, September 23, 2022 2:21 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Balancer Distribution Help

+1 for increasing PG numbers, those are quite low.

Zitat von Bailey Allison :

> Hi Reed,
>
> Just taking a quick glance at the Pastebin provided I have to say
> your cluster balance is already pretty damn good all things
> considered.
>
> We've seen the upmap balancer at it's best in practice provides a
> deviation of about 10-20% percent across OSDs which seems to be
> matching up on your cluster. It's something that as the more nodes
> and OSDs you add that are equal in size to the cluster, and as the
> PGs increase on the cluster it can do a better and better job of,
> but in practice about a 10% difference in OSDs is  very normal.
>
> Something to note in the video provided is that they were using a
> cluster with 28PB of storage available, so who knows how many
> OSDs/nodes/PGs per pool/etc., that their cluster has the luxury and
> ability to balance across.
>
> The only thing I can think to suggest is just increasing the PG
> count as you've already mentioned. The ideal setting is about 100
> PGs per OSD, and looking at your cluster both the SSDs and the
> smaller HDDs have only about 50 PGs per OSD.
>
> If you're able to get both of those devices to a closer to 100 PG
> per OSD ratio it should help a lot more with the balancing. More PGs
> means more places to distribute data.
>
> It will be tricky in that I am just noticing for the HDDs you have
> some hosts/chassis with 24 OSDs per and others with 6 HDDs per so
> getting the PG distribution more even for those will be challenging,
> but for the SSDs it should be quite simple to get those to be 100
> PGs per OSD.
>
> Just taking a further look it does appear on some OSDs although I
> will say across the entire cluster the actual data stored is
> balanced good, there are a couple of OSDs where the OMAP/metadata is
> not balanced as well as the others.
>
> Where you are using EC pools for CephFS, any OMAP data cannot be
> stored within EC so it will store all of that within a replication
> data cephfs pool, most likely your hdd_cephfs pool.
>
> Just something to keep in mind as not only is it important to make
> sure the data is balanced, but the OMAP data and metadata are
> balanced as well.
>
> Otherwise though I would recommended just trying to get your cluster
> to a point where each of the OSDs have roughly 100 PGs per OSD, or
> at least as close to this as you are able to given your clusters
> crush rulesets.
>
> This should then help the balancer spread the data across the
> cluster, but again unless I overlooked something your cluster
> already appears to be extremely well balanced.
>
> There is a PG calculator you can use online at:
>
> https://old.ceph.com/pgcalc/
>
> There is also a PG calc on the Redhat website but it requires a subscription.
>
> Both calculators are essentially the same but I have noticed the
> free one will round down the PGs and the Redhat one will round up
> the PGs.
>
> Regards,
>
> Bailey
>
> -Original Message-
> From: Reed Dier 
> Sent: September 22, 2022 4:48 PM
> To: ceph-users 
> Subject: [ceph-users] Balancer Distribution Help
>
> Hoping someone can point me to possible tunables that could
> hopefully better tighten my OSD distribution.
>
> Cluster is currently
>> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974)
>> octopus (stable)": 307
> With plans to begin moving to pacific before end of year, with a
> possible interim stop at octopus.17 on the way.
>
> Cluster was born on jewel, and is fully bluestore/straw2.
> The upmap balancer works/is working, but not to the degree that I
> believe it could/should work, which seems should be much closer to
> near perfect than what I’m seeing.
>
> https://imgur.com/a/lhtZswo <https://imgur.com/a/lhtZswo> <-
> Histograms of my OSD distribution
&

[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Stefan Kooman

On 9/23/22 17:05, Wyll Ingersoll wrote:


When doing manual remapping/rebalancing with tools like pgremapper and 
placementoptimizer, what are the recommended settings for norebalance, 
norecover, nobackfill?
Should the balancer module be disabled if we are manually issuing the pg remap 
commands generated by those scripts so it doesn't interfere >
Something like this:

$ ceph osd set norebalance
$ ceph osd set norecover
$ ceph osd set nobackfill
$ ceph balancer off

$ pgremapper cancel-backfill --yes   # to stop all pending operations
$ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
$ bash upmap-moves


Disabling the manager is a good idea. After you are finished with 
placementoptimizer, it won't be able to do any work anyway, so you can 
safely turn it back on :-).


Setting the flags you suggested makes sense for the pgremapper phase. 
But as soon as everything is mapped back, you need to unset those. 
Because you need to be able to move data around when optimizing. So it 
won't work (might be a check on cluster state, not sure), or it only 
start moving data when you unset those flags.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Josh Baergen
Hey Wyll,

> $ pgremapper cancel-backfill --yes   # to stop all pending operations
> $ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
> $ bash upmap-moves
>
> Repeat the above 3 steps until balance is achieved, then re-enable the 
> balancer and unset the "no" flags set earlier?

You don't want to run cancel-backfill after placementoptimizer,
otherwise it will undo the balancing backfill.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Wyll Ingersoll
Understood, that was a typo on my part.

Definitely dont cancel-backfill after generating the moves from 
placementoptimizer.


From: Josh Baergen 
Sent: Friday, September 23, 2022 11:31 AM
To: Wyll Ingersoll 
Cc: Eugen Block ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Balancer Distribution Help

Hey Wyll,

> $ pgremapper cancel-backfill --yes   # to stop all pending operations
> $ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
> $ bash upmap-moves
>
> Repeat the above 3 steps until balance is achieved, then re-enable the 
> balancer and unset the "no" flags set earlier?

You don't want to run cancel-backfill after placementoptimizer,
otherwise it will undo the balancing backfill.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io