[ceph-users] Re: jj's "improved" ceph balancer

2021-10-27 Thread Neha Ojha
Jonas, would you be interested in joining one of our performance meetings and presenting some of your work there? Seems like we can have a good discussion about further improvements to the balancer. Thanks, Neha On Mon, Oct 25, 2021 at 11:39 AM Josh Salomon wrote: > > Hi Jonas, > > I have some

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-25 Thread Jonas Jelten
Hi Josh, yes, there's many factors to optimize... which makes it kinda hard to achieve an optimal solution. I think we have to consider all these things, in ascending priority: * 1: Minimize distance to CRUSH (prefer fewest upmaps, and remove upmap items if balance is better) * 2: Relocation

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-25 Thread Jonas Jelten
Hi Erich! Yes, in most cases the mgr-balancer will happily accept jj-balancer movements and neither reverts nor worsens its optimizations. It just generates new upmap items or removes existing ones, just like the mgr-balancer (which has to be in upmap mode of course). So the intended usage is

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-25 Thread E Taka
Hi Jonas, I'm impressed, Thanks! I have a question about the usage: do I have to turn off the automatic balancing feature (ceph balancer off)? Do the upmap balancer and your customizations get in each other's way, or can I run your script from time to time? Thanks Erich Am Mo., 25. Okt. 2021

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-25 Thread Jonas Jelten
Hi Dan, basically it's this: when you have a server that is so big, crush can't utilize it the same way as the other smaller servers because of the placement constraints, the balancer doesn't balance data on the smaller servers any more, because it just "sees" the big one to be too empty. To

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-22 Thread Jonas Jelten
Hi! How would you balance the workload? We could distribute PGs independently of the OSD sizes, assuming that a HDD can handle a low-and-constant number of iops, say 250, no matter how big it is. If we distribute pgs just by predicted device iops, we would optimize for workload better. My

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Anthony D'Atri
> On Oct 20, 2021, at 1:49 PM, Josh Salomon wrote: > > but in the extreme case (some capacity on 1TB devices and some on 6TB > devices) the workload can't be balanced. I It’s also super easy in such a scenario to a) Have the larger drives not uniformly spread across failure domains, which

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Anthony D'Atri
> Doesn't the existing mgr balancer already balance the PGs for each pool > individually? So in your example, the PGs from the loaded pool will be > balanced across all osds, as will the idle pool's PGs. So the net load is > uniform, right? If there’s a single CRUSH root and all pools share

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Josh, Okay, but do you agree that for any given pool, the load is uniform across it's PGs, right? Doesn't the existing mgr balancer already balance the PGs for each pool individually? So in your example, the PGs from the loaded pool will be balanced across all osds, as will the idle pool's

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Josh, That's another interesting dimension... Indeed a cluster that has plenty of free capacity could indeed be balanced by workload/iops, but once it reaches maybe 60 or 70% full, then I think capacity would need to take priority. But to be honest I don't really understand the workload/iops

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi, I don't quite understand your "huge server" scenario, other than a basic understanding that the balancer cannot do magic in some impossible cases. But anyway, I wonder if this sort of higher order balancing could/should be added as a "part two" to the mgr balancer. The existing code does a

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Jonas Jelten
Hi Dan, I'm not kidding, these were real-world observations, hence my motivation to create this balancer :) First I tried "fixing" the mgr balancer, but after understanding the exact algorithm there I thought of a completely different approach. For us the main reason things got out of balance

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Jonas, >From your readme: "the best possible solution is some OSDs having an offset of 1 PG to the ideal count. As a PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will be the +1 more often than others. At worst one OSD is the +1