Jonas, would you be interested in joining one of our performance
meetings and presenting some of your work there? Seems like we can
have a good discussion about further improvements to the balancer.
Thanks,
Neha
On Mon, Oct 25, 2021 at 11:39 AM Josh Salomon wrote:
>
> Hi Jonas,
>
> I have some
Hi Josh,
yes, there's many factors to optimize... which makes it kinda hard to achieve
an optimal solution.
I think we have to consider all these things, in ascending priority:
* 1: Minimize distance to CRUSH (prefer fewest upmaps, and remove upmap items
if balance is better)
* 2: Relocation
Hi Erich!
Yes, in most cases the mgr-balancer will happily accept jj-balancer movements
and neither reverts nor worsens its optimizations.
It just generates new upmap items or removes existing ones, just like the
mgr-balancer (which has to be in upmap mode of course).
So the intended usage is
Hi Jonas,
I'm impressed, Thanks!
I have a question about the usage: do I have to turn off the automatic
balancing feature (ceph balancer off)? Do the upmap balancer and your
customizations get in each other's way, or can I run your script from time
to time?
Thanks
Erich
Am Mo., 25. Okt. 2021
Hi Dan,
basically it's this: when you have a server that is so big, crush can't utilize
it the same way as the other smaller servers because of the placement
constraints,
the balancer doesn't balance data on the smaller servers any more, because it
just "sees" the big one to be too empty.
To
Hi!
How would you balance the workload? We could distribute PGs independently of the OSD sizes, assuming that a HDD can
handle a low-and-constant number of iops, say 250, no matter how big it is. If we distribute pgs just by predicted
device iops, we would optimize for workload better.
My
> On Oct 20, 2021, at 1:49 PM, Josh Salomon wrote:
>
> but in the extreme case (some capacity on 1TB devices and some on 6TB
> devices) the workload can't be balanced. I
It’s also super easy in such a scenario to
a) Have the larger drives not uniformly spread across failure domains, which
> Doesn't the existing mgr balancer already balance the PGs for each pool
> individually? So in your example, the PGs from the loaded pool will be
> balanced across all osds, as will the idle pool's PGs. So the net load is
> uniform, right?
If there’s a single CRUSH root and all pools share
Hi Josh,
Okay, but do you agree that for any given pool, the load is uniform across
it's PGs, right?
Doesn't the existing mgr balancer already balance the PGs for each pool
individually? So in your example, the PGs from the loaded pool will be
balanced across all osds, as will the idle pool's
Hi Josh,
That's another interesting dimension...
Indeed a cluster that has plenty of free capacity could indeed be balanced
by workload/iops, but once it reaches maybe 60 or 70% full, then I think
capacity would need to take priority.
But to be honest I don't really understand the workload/iops
Hi,
I don't quite understand your "huge server" scenario, other than a basic
understanding that the balancer cannot do magic in some impossible cases.
But anyway, I wonder if this sort of higher order balancing could/should be
added as a "part two" to the mgr balancer. The existing code does a
Hi Dan,
I'm not kidding, these were real-world observations, hence my motivation to
create this balancer :)
First I tried "fixing" the mgr balancer, but after understanding the exact
algorithm there I thought of a completely different approach.
For us the main reason things got out of balance
Hi Jonas,
>From your readme:
"the best possible solution is some OSDs having an offset of 1 PG to the
ideal count. As a PG-distribution-optimization is done per pool, without
checking other pool's distribution at all, some devices will be the +1 more
often than others. At worst one OSD is the +1
13 matches
Mail list logo