Hi Morten, On 19 May 2016 at 17:36, Morten Rasmussen <morten.rasmus...@arm.com> wrote: > On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote: >> On 12 May 2016 at 23:48, Yuyang Du <yuyang...@intel.com> wrote: >> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen >> > wrote: >> >> Commit-ID: cfa10334318d8212d007da8c771187643c9cef35 >> >> Gitweb: >> >> http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35 >> >> Author: Morten Rasmussen <morten.rasmus...@arm.com> >> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100 >> >> Committer: Ingo Molnar <mi...@kernel.org> >> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200 >> >> >> >> sched/fair: Correct unit of load_above_capacity >> >> >> >> In calculate_imbalance() load_above_capacity currently has the unit >> >> [capacity] while it is used as being [load/capacity]. Not only is it >> >> wrong it also makes it unlikely that load_above_capacity is ever used >> >> as the subsequent code picks the smaller of load_above_capacity and >> >> the avg_load >> >> >> >> This patch ensures that load_above_capacity has the right unit >> >> [load/capacity]. >> >> load_above_capacity has the unit [load] and is then compared with other load > > I'm not sure if talk about the same code, I'm referring to: > > max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); > > a bit further down. Here the first option has unit [load/capacity], and > the subsequent code multiplies the result with [capacity] to get back to > [load]:
My understand is that it multiplies and divides by capacity so we select the minimum between max_pull * busiest->group_capacity/SCHED_CAPACITY_SCALE and (sds->avg_load - local->avg_load) * local->group_capacity / SCHED_CAPACITY_SCALE so the unit of imbalance is the same as max_pull because we multiply and divide by [capacity] the imbalance's unit is [load] so max_pull also has a unit [load] and max_pull has the same unit of load_above_capacity > > /* How much load to actually move to equalise the imbalance */ > env->imbalance = min( > max_pull * busiest->group_capacity, > (sds->avg_load - local->avg_load) * local->group_capacity > ) / SCHED_CAPACITY_SCALE; > > That lead me to the conclusion that load_above_capacity would have to be > 'per capacity', i.e. [load/capacity], as well for the code to make > sense. > >> >> >> >> >> Signed-off-by: Morten Rasmussen <morten.rasmus...@arm.com> >> >> [ Changed changelog to note it was in capacity unit; +rebase. ] >> >> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> >> >> Cc: Dietmar Eggemann <dietmar.eggem...@arm.com> >> >> Cc: Linus Torvalds <torva...@linux-foundation.org> >> >> Cc: Mike Galbraith <efa...@gmx.de> >> >> Cc: Peter Zijlstra <pet...@infradead.org> >> >> Cc: Thomas Gleixner <t...@linutronix.de> >> >> Cc: linux-kernel@vger.kernel.org >> >> Link: >> >> http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggem...@arm.com >> >> Signed-off-by: Ingo Molnar <mi...@kernel.org> >> >> --- >> >> kernel/sched/fair.c | 6 ++++-- >> >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> >> index 2338105..218f8e8 100644 >> >> --- a/kernel/sched/fair.c >> >> +++ b/kernel/sched/fair.c >> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct >> >> lb_env *env, struct sd_lb_stats *s >> >> if (busiest->group_type == group_overloaded && >> >> local->group_type == group_overloaded) { >> >> load_above_capacity = busiest->sum_nr_running * >> >> SCHED_CAPACITY_SCALE; >> >> - if (load_above_capacity > busiest->group_capacity) >> >> + if (load_above_capacity > busiest->group_capacity) { >> >> load_above_capacity -= busiest->group_capacity; >> >> - else >> >> + load_above_capacity *= NICE_0_LOAD; >> >> + load_above_capacity /= busiest->group_capacity; >> >> If I follow correctly the sequence, >> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE >> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity >> so >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE / >> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD >> >> so the unit is [load] > > I'm with you for the equation, but not for the unit and I find it hard > convince myself what the unit really is :-( the sole NICE_0_LOAD in the equation above tends to confirms that it's only [load] > > I think it is down to the fact that we are comparing [load] directly > with [capacity] here, basically saying [load] == [capacity]. Most other > places we scale [load] by [capacity], we don't compare the two or > otherwise imply that they are the same unit. Do we have any other > examples of this? IIUC the goal of this part of calculate_imbalance, we make the assumption that one task with NICE_0_LOAD should fit in one core with SCHED_CAPACITY_SCALE capacity and we want to ensure that we will not make a CPU idle > > The name load_above_capacity suggests that it should have unit [load], > but we actually compute it by subtracting to values, where the latter > clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97 > sched/fair: Clean up scale confusion) changes the former to be > [capacity]. I have made a comment on this. The original equation was load_above_capacity -= busiest->group_capacity which come from the optimization of load_above_capacity -= busiest->group_capacity * SCHED_LOAD_SCALE / SCHED_CAPACITY_SCALE The assumption of the optimization was the SCHED_LOAD_SCALE == SCHED_CAPACITY_SCALE and SCHED_LOAD_SCALE == NICE_0_LOAD but it's not always true if we enable the extra precision for 64bits system > > load_above_capacity is later multiplied by [capacity] to determine the > imbalance which must have unit [load]. So working our way backwards, it's multiplied by xxx->group_capacity and divided by / SCHED_CAPACITY_SCALE; > load_above_capacity must have unit [load/capacity]. However, if [load] = > [capacity] then what we really have is just group-normalized [load]. > > As said in my other reply, the code only really makes sense for under > special circumstances where [load] == [capacity]. The easy, and possibly > best, way out of this mess would be to replace the code with something > like PeterZ's suggestion that I tried to implement in the patch included > in my other reply. I'm fine with replacing this part by something more simple > >> Lets take the example of a group made of 2 cores with 2 threads per >> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the >> capacity of the group is 3*SCHED_CAPACITY_SCALE. >> If we have 6 tasks on this group : >> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no >> more than 1 tasks from the group and let 5 tasks in the group whereas >> we don't expect to have more than 4 tasks as we have 4 CPUs and >> probably less because the compute capacity of each CPU is less than >> the default one >> >> So i would have expected >> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD - >> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE >> load_above capacity = 3*NICE_0_LOAD which means we want to remove no >> more than 3 tasks and let 3 tasks in the group > > And this is exactly you get with this patch :-) load_above_capacity > (through max_pull) is multiplied by the group capacity to compute that > actual amount of [load] to remove: I forgot that additional weight of the load by group's capacity/SCHED_CAPACITY_SCALE > > env->imbalance = load_above_capacity * busiest->group_capacity / > SCHED_CAPACITY_SCALE > > = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE / > SCHED_CAPACITY_SCALE > > = 3*NICE_0_LOAD > > I don't think we disagree on how it should work :-) Without the capacity > scaling in this patch you get: > > env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) * > 3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE > > = 9*SCHED_CAPACITY_SCALE > > Coming back to Yuyang's question. I think it should be NICE_0_LOAD to > ensure that the resulting imbalance has the proper unit [load]. yes, i agree it's should NICE_0_LOAD > > I'm happy to scrap this patch and replace the whole thing with something > else that makes more sense, but IMHO it is needed if the current mess > should make any sense at all.