Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-30 Thread Valentin Schneider
On 29/08/2019 15:26, Vincent Guittot wrote:
[...]
>> Seeing how much stuff we already do in just computing the stats, do we
>> really save that much by doing this? I'd expect it to be negligible with
>> modern architectures and all of the OoO/voodoo, but maybe I need a
>> refresher course.
> 
> We are not only running on top/latest architecture
> 

I know, and I'm not going to argue for a mere division either. I think I
made my point.

[...]> + if (busiest->group_type == group_misfit_task) {
> + /* Set imbalance to allow misfit task to be balanced. */
> + env->balance_type = migrate_misfit;
> + env->imbalance = busiest->group_misfit_task_load;

 AFAICT we don't ever use this value, other than setting it to 0 in
 detach_tasks(), so what we actually set it to doesn't matter (as long as
 it's > 0).
>>>
>>> not yet.
>>> it's only in patch 8/8 that we check if the tasks fits the cpu's
>>> capacity during the detach_tasks
>>>
>>
>> But that doesn't use env->imbalance, right? With that v3 patch it's just
>> the task util's, so AFAICT my comment still stands.
> 
> no, misfit case keeps using load and imbalance like the current
> implementation in this patch.
> The modifications on the way to handle misfit task are all in patch 8
> 
Right, my reply was a bit too terse. What I meant is that with patch 8 the
value of env->imbalance is irrelevant when dealing with misfit tasks - we
only check the task's utilization in detach_tasks(), we don't do any
comparison of the task's signals with env->imbalance.

Whether we set the imbalance to the load value and set it to 0 in
detach_tasks() or set it to 1 and decrement it in detach_tasks() gives the
same result. That's why I was saying it conceptually fits with the
migrate_task logic, since we can set the imbalance to 1 (we only want to
migrate one task).



Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-29 Thread Vincent Guittot
On Wed, 28 Aug 2019 at 16:19, Valentin Schneider
 wrote:
>
> On 26/08/2019 11:11, Vincent Guittot wrote:
> >>> + case group_fully_busy:
> >>> + /*
> >>> +  * Select the fully busy group with highest avg_load.
> >>> +  * In theory, there is no need to pull task from such
> >>> +  * kind of group because tasks have all compute
> >>> +  * capacity that they need but we can still improve the
> >>> +  * overall throughput by reducing contention when
> >>> +  * accessing shared HW resources.
> >>> +  * XXX for now avg_load is not computed and always 0 so
> >>> +  * we select the 1st one.
> >>> +  */
> >>
> >> What's wrong with unconditionally computing avg_load in 
> >> update_sg_lb_stats()?
> >
> > removing useless division which can be expensive
> >
>
> Seeing how much stuff we already do in just computing the stats, do we
> really save that much by doing this? I'd expect it to be negligible with
> modern architectures and all of the OoO/voodoo, but maybe I need a
> refresher course.

We are not only running on top/latest architecture

>
> >> We already unconditionally accumulate group_load anyway.
> >
> > accumulation must be done while looping on the group whereas computing
> > avg_load can be done only when needed
> >
> >>
> >> If it's to make way for patch 6/8 (using load instead of runnable load),
> >> then I think you are doing things in the wrong order. IMO in this patch we
> >> should unconditionally compute avg_load (using runnable load), and then
> >> you could tweak it up in a subsequent patch.
> >
> > In fact, it's not link to patch 6/8.
> > It's only that I initially wanted to used load only when overloaded
> > but then I got this case and thought that comparing avg_load could be
> > interesting but without any proof that it's worth.
> > As mentioned in the comment, tasks in this group have enough capacity
> > and there is no need to move task in theory. This is there mainly to
> > trigger the discuss and keep in mind a possible area of improvement in
> > a next step.
> > I haven't run tests or done more study on this particular case to make
> > sure that there would be some benefit to compare avg_load.
> >
> > So in the future, we might end up always computing avg_load and
> > comparing it for selecting busiest fully busy group
> >
>
> Okay, that definitely wants testing then.
>
> [...]
> >>> + if (busiest->group_type == group_misfit_task) {
> >>> + /* Set imbalance to allow misfit task to be balanced. */
> >>> + env->balance_type = migrate_misfit;
> >>> + env->imbalance = busiest->group_misfit_task_load;
> >>
> >> AFAICT we don't ever use this value, other than setting it to 0 in
> >> detach_tasks(), so what we actually set it to doesn't matter (as long as
> >> it's > 0).
> >
> > not yet.
> > it's only in patch 8/8 that we check if the tasks fits the cpu's
> > capacity during the detach_tasks
> >
>
> But that doesn't use env->imbalance, right? With that v3 patch it's just
> the task util's, so AFAICT my comment still stands.

no, misfit case keeps using load and imbalance like the current
implementation in this patch.
The modifications on the way to handle misfit task are all in patch 8

>
> >>
> >> I'd re-suggest folding migrate_misfit into migrate_task, which is doable if
> >> we re-instore lb_env.src_grp_type (or rather, not delete it in this patch),
> >> though it makes some other places somewhat uglier. The good thing is that
> >> it actually does end up being implemented as a special kind of task
> >> migration, rather than being loosely related.
> >
> > I prefer to keep it separate instead of adding a sub case in migrate_task
> >
>
> My argument here is that ideally they shouldn't be separated, since the misfit
> migration is a subcase of task migration (or an extension of it - in any
> case, they're related). I haven't found a nicer way to express it though,
> and I agree that the special casing in detach_tasks()/fbq()/etc is meh.
>
> [...]
> >>> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq 
> >>> *this_rq,
> >>>   env.src_rq = busiest;
> >>>
> >>>   ld_moved = 0;
> >>> - if (busiest->cfs.h_nr_running > 1) {
> >>> + if (busiest->nr_running > 1) {
> >>
> >> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
> >> tasks.
> >
> > There is the case raised by srikar where we have for example 1 RT task
> > and 1 CFS task. cfs.h_nr_running==1 but we don't need active balance
> > because CFS is not the running task
> >
> >>
> >>>   /*
> >>>* Attempt to move tasks. If find_busiest_group has found
> >>>* an imbalance but busiest->nr_running <= 1, the group is
> >>>


Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-28 Thread Valentin Schneider
On 26/08/2019 11:11, Vincent Guittot wrote:
>>> + case group_fully_busy:
>>> + /*
>>> +  * Select the fully busy group with highest avg_load.
>>> +  * In theory, there is no need to pull task from such
>>> +  * kind of group because tasks have all compute
>>> +  * capacity that they need but we can still improve the
>>> +  * overall throughput by reducing contention when
>>> +  * accessing shared HW resources.
>>> +  * XXX for now avg_load is not computed and always 0 so
>>> +  * we select the 1st one.
>>> +  */
>>
>> What's wrong with unconditionally computing avg_load in update_sg_lb_stats()?
> 
> removing useless division which can be expensive
> 

Seeing how much stuff we already do in just computing the stats, do we
really save that much by doing this? I'd expect it to be negligible with
modern architectures and all of the OoO/voodoo, but maybe I need a 
refresher course.

>> We already unconditionally accumulate group_load anyway.
> 
> accumulation must be done while looping on the group whereas computing
> avg_load can be done only when needed
> 
>>
>> If it's to make way for patch 6/8 (using load instead of runnable load),
>> then I think you are doing things in the wrong order. IMO in this patch we
>> should unconditionally compute avg_load (using runnable load), and then
>> you could tweak it up in a subsequent patch.
> 
> In fact, it's not link to patch 6/8.
> It's only that I initially wanted to used load only when overloaded
> but then I got this case and thought that comparing avg_load could be
> interesting but without any proof that it's worth.
> As mentioned in the comment, tasks in this group have enough capacity
> and there is no need to move task in theory. This is there mainly to
> trigger the discuss and keep in mind a possible area of improvement in
> a next step.
> I haven't run tests or done more study on this particular case to make
> sure that there would be some benefit to compare avg_load.
> 
> So in the future, we might end up always computing avg_load and
> comparing it for selecting busiest fully busy group
> 

Okay, that definitely wants testing then.

[...]
>>> + if (busiest->group_type == group_misfit_task) {
>>> + /* Set imbalance to allow misfit task to be balanced. */
>>> + env->balance_type = migrate_misfit;
>>> + env->imbalance = busiest->group_misfit_task_load;
>>
>> AFAICT we don't ever use this value, other than setting it to 0 in
>> detach_tasks(), so what we actually set it to doesn't matter (as long as
>> it's > 0).
> 
> not yet.
> it's only in patch 8/8 that we check if the tasks fits the cpu's
> capacity during the detach_tasks
> 

But that doesn't use env->imbalance, right? With that v3 patch it's just
the task util's, so AFAICT my comment still stands.

>>
>> I'd re-suggest folding migrate_misfit into migrate_task, which is doable if
>> we re-instore lb_env.src_grp_type (or rather, not delete it in this patch),
>> though it makes some other places somewhat uglier. The good thing is that
>> it actually does end up being implemented as a special kind of task
>> migration, rather than being loosely related.
> 
> I prefer to keep it separate instead of adding a sub case in migrate_task
> 

My argument here is that ideally they shouldn't be separated, since the misfit
migration is a subcase of task migration (or an extension of it - in any
case, they're related). I haven't found a nicer way to express it though,
and I agree that the special casing in detach_tasks()/fbq()/etc is meh.

[...]
>>> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq 
>>> *this_rq,
>>>   env.src_rq = busiest;
>>>
>>>   ld_moved = 0;
>>> - if (busiest->cfs.h_nr_running > 1) {
>>> + if (busiest->nr_running > 1) {
>>
>> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
>> tasks.
> 
> There is the case raised by srikar where we have for example 1 RT task
> and 1 CFS task. cfs.h_nr_running==1 but we don't need active balance
> because CFS is not the running task
> 
>>
>>>   /*
>>>* Attempt to move tasks. If find_busiest_group has found
>>>* an imbalance but busiest->nr_running <= 1, the group is
>>>


Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-28 Thread Valentin Schneider
On 26/08/2019 10:26, Vincent Guittot wrote:
[...]
>>>   busiest group.
>>> - calculate_imbalance() decides what have to be moved.
>>
>> That's nothing new, isn't it? I think what you mean there is that the
> 
> There is 2 things:
> -part of the algorithm is new and fixes wrong task placement
> -everything has been consolidated in the 3 functions above whereas
> there were some bypasses and hack in the current code
> 

Right, something like that could be added in the changelog then.

[...]
>>> @@ -7745,10 +7793,10 @@ struct sg_lb_stats {
>>>  struct sd_lb_stats {
>>>   struct sched_group *busiest;/* Busiest group in this sd */
>>>   struct sched_group *local;  /* Local group in this sd */
>>> - unsigned long total_running;
>>
>> Could be worth calling out in the log that this gets snipped out. Or it
>> could go into its own small cleanup patch, since it's just an unused field.
> 
> I can mention it more specifically in the log but that's part of those
> meaningless metrics which is no more used

I'm a git blame addict so I like having things split up as much as possible
(within reason). Since that cleanup can live in its own patch, it should
be split as such IMO.


Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-26 Thread Vincent Guittot
On Tue, 6 Aug 2019 at 19:17, Valentin Schneider
 wrote:
>
> Second batch, get it while it's hot...
>
> On 01/08/2019 15:40, Vincent Guittot wrote:
> [...]
> > @@ -7438,19 +7453,53 @@ static int detach_tasks(struct lb_env *env)
> >   if (!can_migrate_task(p, env))
> >   goto next;
> >
> > - load = task_h_load(p);
> > + switch (env->balance_type) {
> > + case migrate_task:
> > + /* Migrate task */
> > + env->imbalance--;
> > + break;
> >
> > - if (sched_feat(LB_MIN) && load < 16 && 
> > !env->sd->nr_balance_failed)
> > - goto next;
> > + case migrate_util:
> > + util = task_util_est(p);
> >
> > - if ((load / 2) > env->imbalance)
> > - goto next;
> > + if (util > env->imbalance)
> > + goto next;
> > +
> > + env->imbalance -= util;
> > + break;
> > +
> > + case migrate_load:
> > + load = task_h_load(p);
> > +
> > + if (sched_feat(LB_MIN) &&
> > + load < 16 && !env->sd->nr_balance_failed)
> > + goto next;
> > +
> > + if ((load / 2) > env->imbalance)
> > + goto next;
> > +
> > + env->imbalance -= load;
> > + break;
> > +
> > + case migrate_misfit:
> > + load = task_h_load(p);
> > +
> > + /*
> > +  * utilization of misfit task might decrease a bit
> > +  * since it has been recorded. Be conservative in the
> > +  * condition.
> > +  */
> > + if (load < env->imbalance)
> > + goto next;
> > +
> > + env->imbalance = 0;
> > + break;
> > + }
> >
> >   detach_task(p, env);
> >   list_add(>se.group_node, >tasks);
> >
> >   detached++;
> > - env->imbalance -= load;
> >
>
> It's missing something like this:

yes

>
> -8<-
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0631ff2a4bd..055563e19090 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7506,7 +7506,7 @@ static int detach_tasks(struct lb_env *env)
>
> /*
>  * We only want to steal up to the prescribed amount of
> -* runnable load.
> +* load/util/tasks
>  */
> if (env->imbalance <= 0)
> break;
> ->8-
>
> [...]
> > @@ -8013,19 +8059,26 @@ group_smaller_max_cpu_capacity(struct sched_group 
> > *sg, struct sched_group *ref)
> >  }
> >
> >  static inline enum
> > -group_type group_classify(struct sched_group *group,
> > +group_type group_classify(struct lb_env *env,
> > +   struct sched_group *group,
> > struct sg_lb_stats *sgs)
> >  {
> > - if (sgs->group_no_capacity)
> > + if (group_is_overloaded(env, sgs))
> >   return group_overloaded;
> >
> >   if (sg_imbalanced(group))
> >   return group_imbalanced;
> >
> > + if (sgs->group_asym_capacity)
> > + return group_asym_capacity;
> > +
> >   if (sgs->group_misfit_task_load)
> >   return group_misfit_task;
> >
> > - return group_other;
> > + if (!group_has_capacity(env, sgs))
> > + return group_fully_busy;
>
> If I got my booleans right, reaching group_fully_busy means
> !group_is_overloaded() && !group_has_capacity(), which gives us:
>
> - If nr_running > group_weight, then we had to have
>   sgs->group_capacity * 100 == sgs->group_util * env->sd->imbalance_pct
>
> - If nr_running == group_weight, then we had to have
>   sgs->group_capacity * 100 <= sgs->group_util * env->sd->imbalance_pct
>
> - (if nr_running < group_weight we go to group_has_spare)
>
> That first equality seems somewhat odd considering how rarely it will
> occur, but then we still want the util check to fall down to
> group_has_spare when
>
>   nr_running >= group_weight &&
>   sgs->group_capacity * 100 > sgs->group_util * env->sd->imbalance_pct
>
> Maybe we could change group_has_capacity()'s util comparison from '>' to
> '>=' to avoid this weird "artefact".
>
> > +
> > + return group_has_spare;
> >  }
> >
> >  static bool update_nohz_stats(struct rq *rq, bool force)
>
> [...]
> > @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env 
> > *env,
> >   if (sgs->group_type < busiest->group_type)
> >   return false;
> >
> > - if (sgs->avg_load <= busiest->avg_load)
> > + /*
> > +  * The candidate and the current busiest group are the 

Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-26 Thread Vincent Guittot
On Tue, 6 Aug 2019 at 17:56, Peter Zijlstra  wrote:
>
> On Thu, Aug 01, 2019 at 04:40:20PM +0200, Vincent Guittot wrote:
> > The load_balance algorithm contains some heuristics which have becomes
> > meaningless since the rework of metrics and the introduction of PELT.
> >
> > Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> > because everything is based on load whereas some imbalances are not
> > related to the load. The current algorithm ends up to create virtual and
> > meaningless value like the avg_load_per_task or tweaks the state of a
> > group to make it overloaded whereas it's not, in order to try to migrate
> > tasks.
> >
> > load_balance should better qualify the imbalance of the group and define
> > cleary what has to be moved to fix this imbalance.
> >
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> >   group_has_spare
> >   group_fully_busy
> >   group_misfit_task
> >   group_asym_capacity
> >   group_imbalanced
> >   group_overloaded
> >
> > Based on the type of sched_group, load_balance now sets what it wants to
> > move in order to fix the imnbalance. It can be some load as before but
> > also some utilization, a number of task or a type of task:
> >   migrate_task
> >   migrate_util
> >   migrate_load
> >   migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymetrics system
> > - the case of cfs task preempted by other class
> > - the case of tasks not evenly spread on groups with spare capacity
> >
> > The load balance decisions have been gathered in 3 functions:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imabalance between local and
> >   busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> I like this lots, very good stuff.
>
> It is weird how misfit is still load, but istr later patches cure that.
>
> Here's a whole pile of nits, some overlap with what Valentin already
> noted. His suggestions for the changelog also make sense to me.

ok.
Thanks

>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8205,10 +8205,10 @@ static bool update_sd_pick_busiest(struc
>  {
> struct sg_lb_stats *busiest = >busiest_stat;
>
> -
> /* Make sure that there is at least one task to pull */
> if (!sgs->sum_h_nr_running)
> return false;
> +
> /*
>  * Don't try to pull misfit tasks we can't help.
>  * We can use max_capacity here as reduction in capacity on some
> @@ -8239,11 +8239,11 @@ static bool update_sd_pick_busiest(struc
> break;
>
> case group_imbalanced:
> -   /* Select the 1st imbalanced group as we don't have
> -* any way to choose one more than another
> +   /*
> +* Select the 1st imbalanced group as we don't have any way to
> +* choose one more than another
>  */
> return false;
> -   break;
>
> case group_asym_capacity:
> /* Prefer to move from lowest priority CPU's work */
> @@ -8253,8 +8253,8 @@ static bool update_sd_pick_busiest(struc
>
> case group_misfit_task:
> /*
> -* If we have more than one misfit sg go with the
> -* biggest misfit.
> +* If we have more than one misfit sg go with the biggest
> +* misfit.
>  */
> if (sgs->group_misfit_task_load < 
> busiest->group_misfit_task_load)
> return false;
> @@ -8262,14 +8262,14 @@ static bool update_sd_pick_busiest(struc
>
> case group_fully_busy:
> /*
> -* Select the fully busy group with highest avg_load.
> -* In theory, there is no need to pull task from such
> -* kind of group because tasks have all compute
> -* capacity that they need but we can still improve the
> -* overall throughput by reducing contention when
> -* accessing shared HW resources.
> -* XXX for now avg_load is not computed and always 0 so
> -* we select the 1st one.
> +* Select the fully busy group with highest avg_load.  In
> +* theory, there is no need to pull task from such kind of
> +* group because tasks have all compute capacity that they 
> need
> +* but we can still improve the overall throughput by reducing
> +* contention when accessing shared HW resources.
> +*
> +* XXX for now avg_load is not computed and always 0 so we
> +* select the 1st one.
>  */
> if 

Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-26 Thread Vincent Guittot
On Mon, 5 Aug 2019 at 19:07, Valentin Schneider
 wrote:
>
> Hi Vincent,
>
> Here's another batch of comments, still need to go through some more of it.
>
> On 01/08/2019 15:40, Vincent Guittot wrote:
> > The load_balance algorithm contains some heuristics which have becomes
>
> s/becomes/become/

yep for this one and other typo mistakes

>
> > meaningless since the rework of metrics and the introduction of PELT.
>^^
> Which metrics? I suppose you mean the s*_lb_stats structs?
>
> >
> > Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> > because everything is based on load whereas some imbalances are not
> > related to the load.
>
> Hmm, well, they don't start as wrong decisions, right? It's just that
> certain imbalance scenarios can't be solved by looking only at load.
>
> What about something along those lines?
>
> """
> Furthermore, load is an ill-suited metric for solving certain task
> placement imbalance scenarios. For instance, in the presence of idle CPUs,
> we should simply try to get at least one task per CPU, whereas the current
> load-based algorithm can actually leave idle CPUs alone simply because the
> load is somewhat balanced.
> """
>
> > The current algorithm ends up to create virtual and
>
> s/to create/creating/
>
> > meaningless value like the avg_load_per_task or tweaks the state of a
> > group to make it overloaded whereas it's not, in order to try to migrate
> > tasks.
> >
> > load_balance should better qualify the imbalance of the group and define
> > cleary what has to be moved to fix this imbalance.
>
> s/define cleary/clearly define/
>
> >
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> >   group_has_spare
> >   group_fully_busy
> >   group_misfit_task
> >   group_asym_capacity
> >   group_imbalanced
> >   group_overloaded
> >
> > Based on the type of sched_group, load_balance now sets what it wants to
> > move in order to fix the imnbalance. It can be some load as before but
>
> s/imnbalance/imbalance/
>
> > also some utilization, a number of task or a type of task:
> >   migrate_task
> >   migrate_util
> >   migrate_load
> >   migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymetrics system
>
> s/asymetrics/asymmetric/
>
> > - the case of cfs task preempted by other class
> > - the case of tasks not evenly spread on groups with spare capacity
> >
> > The load balance decisions have been gathered in 3 functions:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imabalance between local and
>
> s/imabalance/imbalance/
>
> >   busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> That's nothing new, isn't it? I think what you mean there is that the

There is 2 things:
-part of the algorithm is new and fixes wrong task placement
-everything has been consolidated in the 3 functions above whereas
there were some bypasses and hack in the current code

> decisions have been consolidated in those areas, rather than being spread

I would not say that the code was spread all over the place because
90% was already correctly placed but there were few cases that have
been added outside these functions

> all over the place.
>
> >
> > Signed-off-by: Vincent Guittot 
> > ---
> >  kernel/sched/fair.c | 581 
> > ++--
> >  1 file changed, 379 insertions(+), 202 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d7f4a7e..a8681c3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7136,13 +7136,28 @@ static unsigned long __read_mostly 
> > max_load_balance_interval = HZ/10;
> >
> >  enum fbq_type { regular, remote, all };
> >
> > +/*
> > + * group_type describes the group of CPUs at the moment of the load 
> > balance.
> > + * The enum is ordered by pulling priority, with the group with lowest 
> > priority
> > + * first so the groupe_type can be simply compared when selecting the 
> > busiest
> > + * group. see update_sd_pick_busiest().
> > + */
> >  enum group_type {
> > - group_other = 0,
> > + group_has_spare = 0,
> > + group_fully_busy,
> >   group_misfit_task,
> > + group_asym_capacity,
>
> That one got me confused - it's about asym packing, not asym capacity, and
> the name should reflect that. I'd vote for group_asym_packing to stay in
> line with what Quentin did for the sd shortcut pointers in

yep asym_packing is probably better

>
>   011b27bb5d31 ("sched/topology: Add lowest CPU asymmetry sched_domain level 
> pointer")
>
> >   group_imbalanced,
> >   group_overloaded,
> >  };
> >
> > +enum migration_type {
> > + migrate_task = 0,
> > + migrate_util,
> > + migrate_load,
> > + migrate_misfit,
>
> 

Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-07 Thread Valentin Schneider
On 06/08/2019 18:17, Valentin Schneider wrote:
>> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>>  env.src_rq = busiest;
>>  
>>  ld_moved = 0;
>> -if (busiest->cfs.h_nr_running > 1) {
>> +if (busiest->nr_running > 1) {
> 
> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
> tasks.
> 

Wait, so that seems to be a correction of an over-zealous rename in patch
2/8, but I think we actually *do* want it to be a cfs.h_nr_running check
here.

And actually this made me have a think about our active balance checks,
I'm cooking something up in that regards.


Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-06 Thread Valentin Schneider
Second batch, get it while it's hot...

On 01/08/2019 15:40, Vincent Guittot wrote:
[...]
> @@ -7438,19 +7453,53 @@ static int detach_tasks(struct lb_env *env)
>   if (!can_migrate_task(p, env))
>   goto next;
>  
> - load = task_h_load(p);
> + switch (env->balance_type) {
> + case migrate_task:
> + /* Migrate task */
> + env->imbalance--;
> + break;
>  
> - if (sched_feat(LB_MIN) && load < 16 && 
> !env->sd->nr_balance_failed)
> - goto next;
> + case migrate_util:
> + util = task_util_est(p);
>  
> - if ((load / 2) > env->imbalance)
> - goto next;
> + if (util > env->imbalance)
> + goto next;
> +
> + env->imbalance -= util;
> + break;
> +
> + case migrate_load:
> + load = task_h_load(p);
> +
> + if (sched_feat(LB_MIN) &&
> + load < 16 && !env->sd->nr_balance_failed)
> + goto next;
> +
> + if ((load / 2) > env->imbalance)
> + goto next;
> +
> + env->imbalance -= load;
> + break;
> +
> + case migrate_misfit:
> + load = task_h_load(p);
> +
> + /*
> +  * utilization of misfit task might decrease a bit
> +  * since it has been recorded. Be conservative in the
> +  * condition.
> +  */
> + if (load < env->imbalance)
> + goto next;
> +
> + env->imbalance = 0;
> + break;
> + }
>  
>   detach_task(p, env);
>   list_add(>se.group_node, >tasks);
>  
>   detached++;
> - env->imbalance -= load;
>  

It's missing something like this:

-8<-
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0631ff2a4bd..055563e19090 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7506,7 +7506,7 @@ static int detach_tasks(struct lb_env *env)
 
/*
 * We only want to steal up to the prescribed amount of
-* runnable load.
+* load/util/tasks
 */
if (env->imbalance <= 0)
break;
->8-

[...]
> @@ -8013,19 +8059,26 @@ group_smaller_max_cpu_capacity(struct sched_group 
> *sg, struct sched_group *ref)
>  }
>  
>  static inline enum
> -group_type group_classify(struct sched_group *group,
> +group_type group_classify(struct lb_env *env,
> +   struct sched_group *group,
> struct sg_lb_stats *sgs)
>  {
> - if (sgs->group_no_capacity)
> + if (group_is_overloaded(env, sgs))
>   return group_overloaded;
>  
>   if (sg_imbalanced(group))
>   return group_imbalanced;
>  
> + if (sgs->group_asym_capacity)
> + return group_asym_capacity;
> +
>   if (sgs->group_misfit_task_load)
>   return group_misfit_task;
>  
> - return group_other;
> + if (!group_has_capacity(env, sgs))
> + return group_fully_busy;

If I got my booleans right, reaching group_fully_busy means
!group_is_overloaded() && !group_has_capacity(), which gives us:

- If nr_running > group_weight, then we had to have
  sgs->group_capacity * 100 == sgs->group_util * env->sd->imbalance_pct

- If nr_running == group_weight, then we had to have
  sgs->group_capacity * 100 <= sgs->group_util * env->sd->imbalance_pct

- (if nr_running < group_weight we go to group_has_spare)

That first equality seems somewhat odd considering how rarely it will
occur, but then we still want the util check to fall down to
group_has_spare when

  nr_running >= group_weight && 
  sgs->group_capacity * 100 > sgs->group_util * env->sd->imbalance_pct

Maybe we could change group_has_capacity()'s util comparison from '>' to
'>=' to avoid this weird "artefact".

> +
> + return group_has_spare;
>  }
>  
>  static bool update_nohz_stats(struct rq *rq, bool force)

[...]
> @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   if (sgs->group_type < busiest->group_type)
>   return false;
>  
> - if (sgs->avg_load <= busiest->avg_load)
> + /*
> +  * The candidate and the current busiest group are the same type of
> +  * group. Let check which one is the busiest according to the type.
> +  */
> +
> + switch (sgs->group_type) {
> + case group_overloaded:
> + /* Select the overloaded group with highest avg_load. */
> + if (sgs->avg_load <= busiest->avg_load)
> +

Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-06 Thread Peter Zijlstra
On Thu, Aug 01, 2019 at 04:40:20PM +0200, Vincent Guittot wrote:
> The load_balance algorithm contains some heuristics which have becomes
> meaningless since the rework of metrics and the introduction of PELT.
> 
> Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> because everything is based on load whereas some imbalances are not
> related to the load. The current algorithm ends up to create virtual and
> meaningless value like the avg_load_per_task or tweaks the state of a
> group to make it overloaded whereas it's not, in order to try to migrate
> tasks.
> 
> load_balance should better qualify the imbalance of the group and define
> cleary what has to be moved to fix this imbalance.
> 
> The type of sched_group has been extended to better reflect the type of
> imbalance. We now have :
>   group_has_spare
>   group_fully_busy
>   group_misfit_task
>   group_asym_capacity
>   group_imbalanced
>   group_overloaded
> 
> Based on the type of sched_group, load_balance now sets what it wants to
> move in order to fix the imnbalance. It can be some load as before but
> also some utilization, a number of task or a type of task:
>   migrate_task
>   migrate_util
>   migrate_load
>   migrate_misfit
> 
> This new load_balance algorithm fixes several pending wrong tasks
> placement:
> - the 1 task per CPU case with asymetrics system
> - the case of cfs task preempted by other class
> - the case of tasks not evenly spread on groups with spare capacity
> 
> The load balance decisions have been gathered in 3 functions:
> - update_sd_pick_busiest() select the busiest sched_group.
> - find_busiest_group() checks if there is an imabalance between local and
>   busiest group.
> - calculate_imbalance() decides what have to be moved.

I like this lots, very good stuff.

It is weird how misfit is still load, but istr later patches cure that.

Here's a whole pile of nits, some overlap with what Valentin already
noted. His suggestions for the changelog also make sense to me.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8205,10 +8205,10 @@ static bool update_sd_pick_busiest(struc
 {
struct sg_lb_stats *busiest = >busiest_stat;
 
-
/* Make sure that there is at least one task to pull */
if (!sgs->sum_h_nr_running)
return false;
+
/*
 * Don't try to pull misfit tasks we can't help.
 * We can use max_capacity here as reduction in capacity on some
@@ -8239,11 +8239,11 @@ static bool update_sd_pick_busiest(struc
break;
 
case group_imbalanced:
-   /* Select the 1st imbalanced group as we don't have
-* any way to choose one more than another
+   /*
+* Select the 1st imbalanced group as we don't have any way to
+* choose one more than another
 */
return false;
-   break;
 
case group_asym_capacity:
/* Prefer to move from lowest priority CPU's work */
@@ -8253,8 +8253,8 @@ static bool update_sd_pick_busiest(struc
 
case group_misfit_task:
/*
-* If we have more than one misfit sg go with the
-* biggest misfit.
+* If we have more than one misfit sg go with the biggest
+* misfit.
 */
if (sgs->group_misfit_task_load < 
busiest->group_misfit_task_load)
return false;
@@ -8262,14 +8262,14 @@ static bool update_sd_pick_busiest(struc
 
case group_fully_busy:
/*
-* Select the fully busy group with highest avg_load.
-* In theory, there is no need to pull task from such
-* kind of group because tasks have all compute
-* capacity that they need but we can still improve the
-* overall throughput by reducing contention when
-* accessing shared HW resources.
-* XXX for now avg_load is not computed and always 0 so
-* we select the 1st one.
+* Select the fully busy group with highest avg_load.  In
+* theory, there is no need to pull task from such kind of
+* group because tasks have all compute capacity that they need
+* but we can still improve the overall throughput by reducing
+* contention when accessing shared HW resources.
+*
+* XXX for now avg_load is not computed and always 0 so we
+* select the 1st one.
 */
if (sgs->avg_load <= busiest->avg_load)
return false;
@@ -8277,11 +8277,11 @@ static bool update_sd_pick_busiest(struc
 
case group_has_spare:
/*
-* Select not overloaded group with lowest number of
-* idle cpus. 

Re: [PATCH v2 4/8] sched/fair: rework load_balance

2019-08-05 Thread Valentin Schneider
Hi Vincent,

Here's another batch of comments, still need to go through some more of it.

On 01/08/2019 15:40, Vincent Guittot wrote:
> The load_balance algorithm contains some heuristics which have becomes

s/becomes/become/

> meaningless since the rework of metrics and the introduction of PELT.
   ^^
Which metrics? I suppose you mean the s*_lb_stats structs?

> 
> Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> because everything is based on load whereas some imbalances are not
> related to the load.

Hmm, well, they don't start as wrong decisions, right? It's just that
certain imbalance scenarios can't be solved by looking only at load.

What about something along those lines?

"""
Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced.
"""

> The current algorithm ends up to create virtual and

s/to create/creating/

> meaningless value like the avg_load_per_task or tweaks the state of a
> group to make it overloaded whereas it's not, in order to try to migrate
> tasks.
> 
> load_balance should better qualify the imbalance of the group and define
> cleary what has to be moved to fix this imbalance.

s/define cleary/clearly define/

> 
> The type of sched_group has been extended to better reflect the type of
> imbalance. We now have :
>   group_has_spare
>   group_fully_busy
>   group_misfit_task
>   group_asym_capacity
>   group_imbalanced
>   group_overloaded
> 
> Based on the type of sched_group, load_balance now sets what it wants to
> move in order to fix the imnbalance. It can be some load as before but

s/imnbalance/imbalance/

> also some utilization, a number of task or a type of task:
>   migrate_task
>   migrate_util
>   migrate_load
>   migrate_misfit
> 
> This new load_balance algorithm fixes several pending wrong tasks
> placement:
> - the 1 task per CPU case with asymetrics system

s/asymetrics/asymmetric/

> - the case of cfs task preempted by other class
> - the case of tasks not evenly spread on groups with spare capacity
> 
> The load balance decisions have been gathered in 3 functions:
> - update_sd_pick_busiest() select the busiest sched_group.
> - find_busiest_group() checks if there is an imabalance between local and

s/imabalance/imbalance/

>   busiest group.
> - calculate_imbalance() decides what have to be moved.

That's nothing new, isn't it? I think what you mean there is that the
decisions have been consolidated in those areas, rather than being spread
all over the place.

> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 581 
> ++--
>  1 file changed, 379 insertions(+), 202 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7f4a7e..a8681c3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7136,13 +7136,28 @@ static unsigned long __read_mostly 
> max_load_balance_interval = HZ/10;
>  
>  enum fbq_type { regular, remote, all };
>  
> +/*
> + * group_type describes the group of CPUs at the moment of the load balance.
> + * The enum is ordered by pulling priority, with the group with lowest 
> priority
> + * first so the groupe_type can be simply compared when selecting the busiest
> + * group. see update_sd_pick_busiest().
> + */
>  enum group_type {
> - group_other = 0,
> + group_has_spare = 0,
> + group_fully_busy,
>   group_misfit_task,
> + group_asym_capacity,

That one got me confused - it's about asym packing, not asym capacity, and
the name should reflect that. I'd vote for group_asym_packing to stay in
line with what Quentin did for the sd shortcut pointers in

  011b27bb5d31 ("sched/topology: Add lowest CPU asymmetry sched_domain level 
pointer")

>   group_imbalanced,
>   group_overloaded,
>  };
>  
> +enum migration_type {
> + migrate_task = 0,
> + migrate_util,
> + migrate_load,
> + migrate_misfit,

nitpicking here: AFAICT the ordering of this doesn't really matter,
could we place migrate_misfit next to migrate_task? As you call out in the
header, we can migrate a number of tasks or a type of task, so these should
be somewhat together.

If we're afraid that we'll add other types of tasks later on and that this
won't result in a neat append-to-the-end, we could reverse the order like
so:

migrate_load
migrate_util
migrate_task
migrate_misfit

[...]
> @@ -7745,10 +7793,10 @@ struct sg_lb_stats {
>  struct sd_lb_stats {
>   struct sched_group *busiest;/* Busiest group in this sd */
>   struct sched_group *local;  /* Local group in this sd */
> - unsigned long total_running;

Could be worth calling out in the log that this gets