Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-06-07 Thread Ning, Hongyu
On 2020/5/14 21:02, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
>> With this said, I realized a workaround for the issue described above:
>> when the core went from 'compatible mode'(step 1-3) to 'incompatible
>> mode'(step 4), reset all root level sched entities' vruntime to be the
>> same as the core wide min_vruntime. After all, the core is transforming
>> from two runqueue mode to single runqueue mode... I think this can solve
>> the issue to some extent but I may miss other scenarios.
> 
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.
> 
> This very much retains the horrible approximation of S we always do.
> 
> Also, it is _completely_ untested...
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -102,7 +102,6 @@ static inline int __task_prio(struct tas
>  /* real prio, less is less */
>  static inline bool prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -
>   int pa = __task_prio(a), pb = __task_prio(b);
>  
>   if (-pa < -pb)
> @@ -114,19 +113,8 @@ static inline bool prio_less(struct task
>   if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>   return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> - if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> - u64 vruntime = b->se.vruntime;
> -
> - /*
> -  * Normalize the vruntime if tasks are in different cpus.
> -  */
> - if (task_cpu(a) != task_cpu(b)) {
> - vruntime -= task_cfs_rq(b)->min_vruntime;
> - vruntime += task_cfs_rq(a)->min_vruntime;
> - }
> -
> - return !((s64)(a->se.vruntime - vruntime) <= 0);
> - }
> + if (pa == MAX_RT_PRIO + MAX_NICE)
> + return cfs_prio_less(a, b);
>  
>   return false;
>  }
> @@ -4293,10 +4281,11 @@ static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>   struct task_struct *next, *max = NULL;
> + int old_active = 0, new_active = 0;
>   const struct sched_class *class;
>   const struct cpumask *smt_mask;
> - int i, j, cpu;
>   bool need_sync = false;
> + int i, j, cpu;
>  
>   cpu = cpu_of(rq);
>   if (cpu_is_offline(cpu))
> @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
>   rq_i->core_pick = NULL;
>  
>   if (rq_i->core_forceidle) {
> + // XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
>   need_sync = true;
>   rq_i->core_forceidle = false;
>   }
>  
> + if (!is_idle_task(rq_i->curr))
> + old_active++;
> +
>   if (i != cpu)
>   update_rq_clock(rq_i);
>   }
> @@ -4463,8 +4456,12 @@ next_class:;
>  
>   WARN_ON_ONCE(!rq_i->core_pick);
>  
> - if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> - rq_i->core_forceidle = true;
> + if (is_idle_task(rq_i->core_pick)) {
> + if (rq_i->nr_running)
> + rq_i->core_forceidle = true;
> + } else {
> + new_active++;
> + }
>  
>   if (i == cpu)
>   continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>   }
>  
> + /* XXX SMT2 only */
> + if (new_active == 1 && old_active > 1) {
> + /*
> +  * We just dropped into single-rq mode, increment the sequence
> +  * count to trigger the vruntime sync.
> +  */
> + rq->core->core_sync_seq++;
> + }
> + rq->core->core_active = new_active;
> +
>  done:
>   set_next_task(rq, next);
>   return next;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s
>   return NULL;
>  }
>  
> +static inline bool
> +is_same_tg(struct sched_entity *se, struct sched_entity *pse)
> +{
> + return se->cfs_rq->tg == pse->cfs_rq->tg;
> +}
> +
>  static inline struct sched_entity *parent_entity(struct sched_entity *se)
>  {
>   return se->parent;
> @@ -394,8 +400,6 @@ static inline struct sched_entity *paren
>  static void
>  find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
> - int se_depth, pse_depth;
> -
>   /*
>* preemption test can be made between sibling entities who are in the
>* same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> @@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s
>* parent.
>*/
>  
> - /* First walk up until both entities are at same depth */

Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-22 Thread Aaron Lu
On Sat, May 16, 2020 at 11:42:30AM +0800, Aaron Lu wrote:
> On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4476,6 +4473,16 @@ next_class:;
> > WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
> > }
> >  
> > +   /* XXX SMT2 only */
> > +   if (new_active == 1 && old_active > 1) {
> 
> There is a case when incompatible task appears but we failed to 'drop
> into single-rq mode' per the above condition check. The TLDR is: when
> there is a task that sits on the sibling rq with the same cookie as
> 'max', new_active will be 2 instead of 1 and that would cause us missing
> the chance to do a sync of core min_vruntime.

FWIW: when I disable the feature of running cookie_pick task on sibling
and thus enforce a strict single-rq mode, Peter's patch works well for
the scenario described below.

> This is how it happens:
> 1) 2 tasks of the same cgroup with different weight running on 2 siblings,
>say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound
>at cpu1(assume cpu0 and cpu1 are siblings);
> 2) Since new_active == 2, we didn't trigger min_vruntime sync. For
>simplicity, let's assume both siblings' root cfs_rq's min_vruntime and
>core_vruntime are all at 0 now;
> 3) let the two tasks run a while;
> 4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's
>existing task has a very small weight, its cfs_rq's min_vruntime can
>be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is
>much larger than cg0_A's and the 'max' of the core wide task
>selection goes to cg0_A;
> 5) Now I suppose we should drop into single-rq mode and by doing a sync
>of core min_vruntime, cg1_C's turn shall come. But the problem is, our
>current selection logic prefer not to waste CPU time so after decides
>cg0_A as the 'max', the sibling will also do a cookie_pick() and
>get cg0_B to run. This is where problem asises: new_active is 2
>instead of the expected 1.
> 6) Due to we didn't do the sync of core min_vruntime, the newly queued
>cg1_C shall wait a long time before cg0_A's vruntime catches up.

P.S. this is what I did to enforce a strict single-rq mode:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1fa5b48b742a..0f5580bc7e96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4411,7 +4411,7 @@ pick_task(struct rq *rq, const struct sched_class *class, 
struct task_struct *ma
(!max || prio_less(max, class_pick)))
return class_pick;
 
-   return cookie_pick;
+   return NULL;
 }
 
 static struct task_struct *


Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-15 Thread Aaron Lu
On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> > With this said, I realized a workaround for the issue described above:
> > when the core went from 'compatible mode'(step 1-3) to 'incompatible
> > mode'(step 4), reset all root level sched entities' vruntime to be the
> > same as the core wide min_vruntime. After all, the core is transforming
> > from two runqueue mode to single runqueue mode... I think this can solve
> > the issue to some extent but I may miss other scenarios.
> 
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.

Thanks a lot for the patch, I now see that "there is no need to adjust
every se's vruntime". :-)

> This very much retains the horrible approximation of S we always do.
> 
> Also, it is _completely_ untested...

I've been testing it.

One problem below.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4293,10 +4281,11 @@ static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>   struct task_struct *next, *max = NULL;
> + int old_active = 0, new_active = 0;
>   const struct sched_class *class;
>   const struct cpumask *smt_mask;
> - int i, j, cpu;
>   bool need_sync = false;
> + int i, j, cpu;
>  
>   cpu = cpu_of(rq);
>   if (cpu_is_offline(cpu))
> @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
>   rq_i->core_pick = NULL;
>  
>   if (rq_i->core_forceidle) {
> + // XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
>   need_sync = true;
>   rq_i->core_forceidle = false;
>   }
>  
> + if (!is_idle_task(rq_i->curr))
> + old_active++;
> +
>   if (i != cpu)
>   update_rq_clock(rq_i);
>   }
> @@ -4463,8 +4456,12 @@ next_class:;
>  
>   WARN_ON_ONCE(!rq_i->core_pick);
>  
> - if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> - rq_i->core_forceidle = true;
> + if (is_idle_task(rq_i->core_pick)) {
> + if (rq_i->nr_running)
> + rq_i->core_forceidle = true;
> + } else {
> + new_active++;
> + }
>  
>   if (i == cpu)
>   continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>   }
>  
> + /* XXX SMT2 only */
> + if (new_active == 1 && old_active > 1) {

There is a case when incompatible task appears but we failed to 'drop
into single-rq mode' per the above condition check. The TLDR is: when
there is a task that sits on the sibling rq with the same cookie as
'max', new_active will be 2 instead of 1 and that would cause us missing
the chance to do a sync of core min_vruntime.

This is how it happens:
1) 2 tasks of the same cgroup with different weight running on 2 siblings,
   say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound
   at cpu1(assume cpu0 and cpu1 are siblings);
2) Since new_active == 2, we didn't trigger min_vruntime sync. For
   simplicity, let's assume both siblings' root cfs_rq's min_vruntime and
   core_vruntime are all at 0 now;
3) let the two tasks run a while;
4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's
   existing task has a very small weight, its cfs_rq's min_vruntime can
   be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is
   much larger than cg0_A's and the 'max' of the core wide task
   selection goes to cg0_A;
5) Now I suppose we should drop into single-rq mode and by doing a sync
   of core min_vruntime, cg1_C's turn shall come. But the problem is, our
   current selection logic prefer not to waste CPU time so after decides
   cg0_A as the 'max', the sibling will also do a cookie_pick() and
   get cg0_B to run. This is where problem asises: new_active is 2
   instead of the expected 1.
6) Due to we didn't do the sync of core min_vruntime, the newly queued
   cg1_C shall wait a long time before cg0_A's vruntime catches up.

One naive way of precisely determine when to drop into single-rq mode is
to track how many tasks of a particular tag exists and use that to
decide if the core is in compatible mode(all tasks belong to the same
cgroup, IOW, have the same core_cookie) or not and act accordingly,
except that: does this sound too complex and inefficient?...

> + /*
> +  * We just dropped into single-rq mode, increment the sequence
> +  * count to trigger the vruntime sync.
> +  */
> + rq->core->core_sync_seq++;
> + }
> + rq->core->core_active = new_active;
> +

Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-15 Thread Vineeth Remanan Pillai
On Fri, May 15, 2020 at 6:39 AM Peter Zijlstra  wrote:
>
> It's complicated ;-)
>
> So this sync is basically a relative reset of S to 0.
>
> So with 2 queues, when one goes idle, we drop them both to 0 and one
> then increases due to not being idle, and the idle one builds up lag to
> get re-elected. So far so simple, right?
>
> When there's 3, we can have the situation where 2 run and one is idle,
> we sync to 0 and let the idle one build up lag to get re-election. Now
> suppose another one also drops idle. At this point dropping all to 0
> again would destroy the built-up lag from the queue that was already
> idle, not good.
>
Thanks for the clarification :-).

I was suggesting an idea of corewide force_idle. We sync the core_vruntime
on first force_idle of a sibling in the core and start using core_vruntime
for priority comparison from then on. That way, we don't reset the lag on
every force_idle and the lag builds up from the first sibling that was
forced_idle. I think this would work with infeasible weights as well,
but needs to think more to see if it would break. A sample check to enter
this core wide force_idle state is:
(cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

And we exit the core wide force_idle state when the last sibling goes out
of force_idle and can start using min_vruntime for priority comparison
from then on.

When there is a cookie match on all siblings, we don't do priority comparison
now. But I think we need to do priority comparison for cookie matches
also, so that we update 'max' in the loop. And for this comparison during
a no forced_idle scenario, I hope it should be fine to use the min_vruntime.
Updating 'max' in the loop when cookie matches is not really needed for SMT2,
but would be needed for SMTn.

This is just a wild idea on top of your patches. Might not be accurate
in all cases and need to think more about the corner cases. I thought I
would think it loud here :-)

> So instead of syncing everything, we can:
>
>   less := !((s64)(s_a - s_b) <= 0)
>
>   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
> == v_a - (v_b - S_a + S_b)
>
> IOW, we can recast the (lag) comparison to a one-sided difference.
> So if then, instead of syncing the whole queue, sync the idle queue
> against the active queue with S_a + S_b at the point where we sync.
>
> (XXX consider the implication of living in a cyclic group: N / 2^n N)
>
> This gives us means of syncing single queues against the active queue,
> and for already idle queues to preseve their build-up lag.
>
> Of course, then we get the situation where there's 2 active and one
> going idle, who do we pick to sync against? Theory would have us sync
> against the combined S, but as we've already demonstated, there is no
> such thing in infeasible weight scenarios.
>
> One thing I've considered; and this is where that core_active rudiment
> came from, is having active queues sync up between themselves after
> every tick. This limits the observed divergence due to the work
> conservance.
>
> On top of that, we can improve upon things by moving away from our
> horrible (10) hack and moving to (9) and employing (13) here.
>
> Anyway, I got partway through that in the past days, but then my head
> hurt. I'll consider it some more :-)
This sounds much better and a more accurate approach then the one I
mentioned above. Please share the code when you have it in some form :-)

>
> > > +   new_active++;
> > I think we need to reset new_active on restarting the selection.
>
> But this loop is after selection has been done; we don't modify
> new_active during selection.
My bad, sorry about this false alarm!

> > > +
> > > +   vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > > +   vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> > Should we be using core_vruntime conditionally? should it be min_vruntime 
> > for
> > default comparisons and core_vruntime during force_idle?
>
> At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
> ie. when we're on the same CPU.
>
yes, this makes sense.

The issue that I was thinking about is, when there is no force_idle and
all siblings run compatible tasks for a while, min_vruntime progresses,
but core_vruntime lags behind. And when a new task gets enqueued, it gets
the min_vruntime. But now during comparison it might be treated unfairly.

Consider a small example of two rqs rq1 and rq2.
rq1->cfs->min_vruntime = 1000
rq2->cfs->min_vruntime = 2000

During a force_idle, core_vruntime gets synced and

rq1->cfs->core_vruntime = 1000
rq2->cfs->core_vruntime = 2000

Now, suppose the core is out of force_idle and runs two compatible tasks
for a while, where the task on rq1 has more weight. min_vruntime progresses
on both, but slowly on rq1. Say the progress looks like:
rq1->cfs->min_vruntime = 1200, se1->vruntime = 1200
rq2->cfs->min_vruntime = 2500, se2->vruntime = 2500

If a new incompatible task

Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-15 Thread Peter Zijlstra
On Fri, May 15, 2020 at 12:38:44PM +0200, Peter Zijlstra wrote:
>   less := !((s64)(s_a - s_b) <= 0)
> 
>   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
> == v_a - (v_b - S_a + S_b)
> 

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a
>  {
>   struct sched_entity *se_a = &a->se, *se_b = &b->se;
>   struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> - u64 vruntime_a, vruntime_b;
> + u64 s_a, s_b, S_a, S_b;
>  
>   while (!is_same_tg(se_a, se_b)) {
>   int se_a_depth = se_a->depth;
> @@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a
>   cfs_rq_a = cfs_rq_of(se_a);
>   cfs_rq_b = cfs_rq_of(se_b);
>  
> - vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> - vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> + S_a = cfs_rq_a->core_vruntime;
> + S_b = cfs_rq_b->core_vruntime;
>  
> - return !((s64)(vruntime_a - vruntime_b) <= 0);
> + if (cfs_rq_a == cfs_rq_b)
> + S_a = S_b = cfs_rq_a->min_vruntime;
> +
> + s_a = se_a->vruntime - S_a;
> + s_b = se_b->vruntime - S_b;
> +
> + return !((s64)(s_a - s_b) <= 0);
>  }

Clearly I'm not awake yet; 's/s_/l_/g', 's/v_/s_/g', IOW:

  l_a = s_a - S_a




Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-15 Thread Peter Zijlstra
On Thu, May 14, 2020 at 06:51:27PM -0400, Vineeth Remanan Pillai wrote:
> On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra  wrote:
> >
> > A little something like so, this syncs min_vruntime when we switch to
> > single queue mode. This is very much SMT2 only, I got my head in twist
> > when thikning about more siblings, I'll have to try again later.
> >
> Thanks for the quick patch! :-)
> 
> For SMT-n, would it work if sync vruntime if atleast one sibling is
> forced idle? Since force_idle is for all the rqs, I think it would
> be correct to sync the vruntime if atleast one cpu is forced idle.

It's complicated ;-)

So this sync is basically a relative reset of S to 0.

So with 2 queues, when one goes idle, we drop them both to 0 and one
then increases due to not being idle, and the idle one builds up lag to
get re-elected. So far so simple, right?

When there's 3, we can have the situation where 2 run and one is idle,
we sync to 0 and let the idle one build up lag to get re-election. Now
suppose another one also drops idle. At this point dropping all to 0
again would destroy the built-up lag from the queue that was already
idle, not good.

So instead of syncing everything, we can:

  less := !((s64)(s_a - s_b) <= 0)

  (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
== v_a - (v_b - S_a + S_b)

IOW, we can recast the (lag) comparison to a one-sided difference.
So if then, instead of syncing the whole queue, sync the idle queue
against the active queue with S_a + S_b at the point where we sync.

(XXX consider the implication of living in a cyclic group: N / 2^n N)

This gives us means of syncing single queues against the active queue,
and for already idle queues to preseve their build-up lag.

Of course, then we get the situation where there's 2 active and one
going idle, who do we pick to sync against? Theory would have us sync
against the combined S, but as we've already demonstated, there is no
such thing in infeasible weight scenarios.

One thing I've considered; and this is where that core_active rudiment
came from, is having active queues sync up between themselves after
every tick. This limits the observed divergence due to the work
conservance.

On top of that, we can improve upon things by moving away from our
horrible (10) hack and moving to (9) and employing (13) here.

Anyway, I got partway through that in the past days, but then my head
hurt. I'll consider it some more :-)

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > -   if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> > -   rq_i->core_forceidle = true;
> > +   if (is_idle_task(rq_i->core_pick)) {
> > +   if (rq_i->nr_running)
> > +   rq_i->core_forceidle = true;
> > +   } else {
> > +   new_active++;
> I think we need to reset new_active on restarting the selection.

But this loop is after selection has been done; we don't modify
new_active during selection.

> > +   /*
> > +* We just dropped into single-rq mode, increment the 
> > sequence
> > +* count to trigger the vruntime sync.
> > +*/
> > +   rq->core->core_sync_seq++;
> > +   }
> > +   rq->core->core_active = new_active;
> core_active seems to be unused.

Correct; that's rudiments from an SMT-n attempt.

> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +   struct sched_entity *se_a = &a->se, *se_b = &b->se;
> > +   struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> > +   u64 vruntime_a, vruntime_b;
> > +
> > +   while (!is_same_tg(se_a, se_b)) {
> > +   int se_a_depth = se_a->depth;
> > +   int se_b_depth = se_b->depth;
> > +
> > +   if (se_a_depth <= se_b_depth)
> > +   se_b = parent_entity(se_b);
> > +   if (se_a_depth >= se_b_depth)
> > +   se_a = parent_entity(se_a);
> > +   }
> > +
> > +   cfs_rq_a = cfs_rq_of(se_a);
> > +   cfs_rq_b = cfs_rq_of(se_b);
> > +
> > +   vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > +   vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> Should we be using core_vruntime conditionally? should it be min_vruntime for
> default comparisons and core_vruntime during force_idle?

At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
ie. when we're on the same CPU.

For the other case I was considering that tick based active sync, but
never got that finished and admittedly it all looks a bit weird. But I
figured I'd send it out so we can at least advance the discussion.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a
 {
struct sched_entity *se_a = &a->se, *se_b = &b->se;
struct cfs_rq *cfs_rq_a, *cfa_rq_b;
-   u64 vruntime_a, vruntime_b;
+  

Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-14 Thread Vineeth Remanan Pillai
Hi Peter,

On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra  wrote:
>
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.
>
Thanks for the quick patch! :-)

For SMT-n, would it work if sync vruntime if atleast one sibling is
forced idle? Since force_idle is for all the rqs, I think it would
be correct to sync the vruntime if atleast one cpu is forced idle.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> -   if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -   rq_i->core_forceidle = true;
> +   if (is_idle_task(rq_i->core_pick)) {
> +   if (rq_i->nr_running)
> +   rq_i->core_forceidle = true;
> +   } else {
> +   new_active++;
I think we need to reset new_active on restarting the selection.

> +   }
>
> if (i == cpu)
> continue;
> @@ -4476,6 +4473,16 @@ next_class:;
> WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
> }
>
> +   /* XXX SMT2 only */
> +   if (new_active == 1 && old_active > 1) {
As I mentioned above, would it be correct to check if atleast one sibling is
forced_idle? Something like:
if (cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

> +   /*
> +* We just dropped into single-rq mode, increment the sequence
> +* count to trigger the vruntime sync.
> +*/
> +   rq->core->core_sync_seq++;
> +   }
> +   rq->core->core_active = new_active;
core_active seems to be unused.

> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +   struct sched_entity *se_a = &a->se, *se_b = &b->se;
> +   struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> +   u64 vruntime_a, vruntime_b;
> +
> +   while (!is_same_tg(se_a, se_b)) {
> +   int se_a_depth = se_a->depth;
> +   int se_b_depth = se_b->depth;
> +
> +   if (se_a_depth <= se_b_depth)
> +   se_b = parent_entity(se_b);
> +   if (se_a_depth >= se_b_depth)
> +   se_a = parent_entity(se_a);
> +   }
> +
> +   cfs_rq_a = cfs_rq_of(se_a);
> +   cfs_rq_b = cfs_rq_of(se_b);
> +
> +   vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> +   vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
Should we be using core_vruntime conditionally? should it be min_vruntime for
default comparisons and core_vruntime during force_idle?

Thanks,
Vineeth


Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-14 Thread Peter Zijlstra
On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> With this said, I realized a workaround for the issue described above:
> when the core went from 'compatible mode'(step 1-3) to 'incompatible
> mode'(step 4), reset all root level sched entities' vruntime to be the
> same as the core wide min_vruntime. After all, the core is transforming
> from two runqueue mode to single runqueue mode... I think this can solve
> the issue to some extent but I may miss other scenarios.

A little something like so, this syncs min_vruntime when we switch to
single queue mode. This is very much SMT2 only, I got my head in twist
when thikning about more siblings, I'll have to try again later.

This very much retains the horrible approximation of S we always do.

Also, it is _completely_ untested...

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,7 +102,6 @@ static inline int __task_prio(struct tas
 /* real prio, less is less */
 static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 {
-
int pa = __task_prio(a), pb = __task_prio(b);
 
if (-pa < -pb)
@@ -114,19 +113,8 @@ static inline bool prio_less(struct task
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-   u64 vruntime = b->se.vruntime;
-
-   /*
-* Normalize the vruntime if tasks are in different cpus.
-*/
-   if (task_cpu(a) != task_cpu(b)) {
-   vruntime -= task_cfs_rq(b)->min_vruntime;
-   vruntime += task_cfs_rq(a)->min_vruntime;
-   }
-
-   return !((s64)(a->se.vruntime - vruntime) <= 0);
-   }
+   if (pa == MAX_RT_PRIO + MAX_NICE)
+   return cfs_prio_less(a, b);
 
return false;
 }
@@ -4293,10 +4281,11 @@ static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
struct task_struct *next, *max = NULL;
+   int old_active = 0, new_active = 0;
const struct sched_class *class;
const struct cpumask *smt_mask;
-   int i, j, cpu;
bool need_sync = false;
+   int i, j, cpu;
 
cpu = cpu_of(rq);
if (cpu_is_offline(cpu))
@@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
rq_i->core_pick = NULL;
 
if (rq_i->core_forceidle) {
+   // XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
need_sync = true;
rq_i->core_forceidle = false;
}
 
+   if (!is_idle_task(rq_i->curr))
+   old_active++;
+
if (i != cpu)
update_rq_clock(rq_i);
}
@@ -4463,8 +4456,12 @@ next_class:;
 
WARN_ON_ONCE(!rq_i->core_pick);
 
-   if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
-   rq_i->core_forceidle = true;
+   if (is_idle_task(rq_i->core_pick)) {
+   if (rq_i->nr_running)
+   rq_i->core_forceidle = true;
+   } else {
+   new_active++;
+   }
 
if (i == cpu)
continue;
@@ -4476,6 +4473,16 @@ next_class:;
WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
}
 
+   /* XXX SMT2 only */
+   if (new_active == 1 && old_active > 1) {
+   /*
+* We just dropped into single-rq mode, increment the sequence
+* count to trigger the vruntime sync.
+*/
+   rq->core->core_sync_seq++;
+   }
+   rq->core->core_active = new_active;
+
 done:
set_next_task(rq, next);
return next;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s
return NULL;
 }
 
+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+   return se->cfs_rq->tg == pse->cfs_rq->tg;
+}
+
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
return se->parent;
@@ -394,8 +400,6 @@ static inline struct sched_entity *paren
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
-   int se_depth, pse_depth;
-
/*
 * preemption test can be made between sibling entities who are in the
 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
@@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s
 * parent.
 */
 
-   /* First walk up until both entities are at same depth */
-   se_depth = (*se)->depth;
-   pse_depth = (*pse)->depth;
-
-   while (se_depth > pse_depth) {
-   se_depth--;
-   *se = parent

Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-08 Thread Aaron Lu
On Fri, May 08, 2020 at 11:09:25AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote:
> > On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
> 
> > > Aside from this being way to complicated for what it does -- you
> > > could've saved the min_vruntime for each rq and compared them with
> > > subtraction -- it is also terminally broken afaict.
> > >
> > > Consider any infeasible weight scenario. Take for instance two tasks,
> > > each bound to their respective sibling, one with weight 1 and one with
> > > weight 2. Then the lower weight task will run ahead of the higher weight
> > > task without bound.
> > 
> > I don't follow how this could happen. Even the lower weight task runs
> > first, after some time, the higher weight task will get its turn and
> > from then on, the higher weight task will get more chance to run(due to
> > its higher weight and thus, slower accumulation of vruntime).
> 
> That seems to assume they're mutually exclusive. In that case, as I
> argued, we only have a single runqueue and then yes it works. But if
> they're not exclusive, and can run concurrently, it comes apart.

Ah right, now I see what you mean. Sorry for misunderstanding.

And yes, that 'utterly destroys the concept of a shared time base' and
then bad things can happen:
1) two same tagged tasks(t1 and t2) running on two siblings, with t1's
   weight lower than t2's;
2) both tasks are cpu intensive;
3) over time, the lower weight task(t1)'s vruntime becomes bigger and
   bigger than t2's vruntime and the core wide min_vruntime is the
   same as t1's vruntime per this patch;
4) a new task enqueued on the same sibling as t1, if the new task has
   an incompatible tag, it will be starved by t2 because t2's vruntime
   is way smaller than the core wide min_vruntime.

With this said, I realized a workaround for the issue described above:
when the core went from 'compatible mode'(step 1-3) to 'incompatible
mode'(step 4), reset all root level sched entities' vruntime to be the
same as the core wide min_vruntime. After all, the core is transforming
from two runqueue mode to single runqueue mode... I think this can solve
the issue to some extent but I may miss other scenarios.

I'll also re-read your last email about the 'lag' idea.


Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-08 Thread Peter Zijlstra
On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote:
> On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:

> > Aside from this being way to complicated for what it does -- you
> > could've saved the min_vruntime for each rq and compared them with
> > subtraction -- it is also terminally broken afaict.
> >
> > Consider any infeasible weight scenario. Take for instance two tasks,
> > each bound to their respective sibling, one with weight 1 and one with
> > weight 2. Then the lower weight task will run ahead of the higher weight
> > task without bound.
> 
> I don't follow how this could happen. Even the lower weight task runs
> first, after some time, the higher weight task will get its turn and
> from then on, the higher weight task will get more chance to run(due to
> its higher weight and thus, slower accumulation of vruntime).

That seems to assume they're mutually exclusive. In that case, as I
argued, we only have a single runqueue and then yes it works. But if
they're not exclusive, and can run concurrently, it comes apart.



Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-08 Thread Aaron Lu
On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
> 
> Sorry for being verbose; I've been procrastinating replying, and in
> doing so the things I wanted to say kept growing.
> 
> On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:
> 
> > To make this work, the root level sched entities' vruntime of the two
> > threads must be directly comparable. So one of the hyperthread's root
> > cfs_rq's min_vruntime is chosen as the core wide one and all root level
> > sched entities' vruntime is normalized against it.
> 
> > +/*
> > + * This is called in stop machine context so no need to take the rq lock.
> > + *
> > + * Core scheduling is going to be enabled and the root level sched entities
> > + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> > + * min_vruntime, so it's necessary to normalize vruntime of existing root
> > + * level sched entities in sibling_cfs_rq.
> > + *
> > + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> > + * only using cfs_rq->min_vruntime during the entire run of core 
> > scheduling.
> > + */
> > +void sched_core_normalize_se_vruntime(int cpu)
> > +{
> > +   struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> > +   int i;
> > +
> > +   for_each_cpu(i, cpu_smt_mask(cpu)) {
> > +   struct sched_entity *se, *next;
> > +   struct cfs_rq *sibling_cfs_rq;
> > +   s64 delta;
> > +
> > +   if (i == cpu)
> > +   continue;
> > +
> > +   sibling_cfs_rq = &cpu_rq(i)->cfs;
> > +   if (!sibling_cfs_rq->nr_running)
> > +   continue;
> > +
> > +   delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> > +   rbtree_postorder_for_each_entry_safe(se, next,
> > +   &sibling_cfs_rq->tasks_timeline.rb_root,
> > +   run_node) {
> > +   se->vruntime += delta;
> > +   }
> > +   }
> > +}
> 
> Aside from this being way to complicated for what it does -- you
> could've saved the min_vruntime for each rq and compared them with
> subtraction -- it is also terminally broken afaict.
>
> Consider any infeasible weight scenario. Take for instance two tasks,
> each bound to their respective sibling, one with weight 1 and one with
> weight 2. Then the lower weight task will run ahead of the higher weight
> task without bound.

I don't follow how this could happen. Even the lower weight task runs
first, after some time, the higher weight task will get its turn and
from then on, the higher weight task will get more chance to run(due to
its higher weight and thus, slower accumulation of vruntime).

We used to have the following patch as a standalone one in v4:
sched/fair : Wake up forced idle siblings if needed
https://lore.kernel.org/lkml/cover.1572437285.git.vpil...@digitalocean.com/T/#md22d25d0e2932d059013e9b56600d8a847b02a13
Which originates from:
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

And in this series, it seems to be merged in:
[RFC PATCH 07/13] sched: Add core wide task selection and scheduling
https://lore.kernel.org/lkml/e942da7fd881977923463f19648085c1bfaa37f8.1583332765.git.vpil...@digitalocean.com/

My local test shows that when two cgroup's share are both set to 1024
and each bound to one sibling of a core, start a cpu intensive task in
each cgroup, then the cpu intensive task will each consume 50% cpu. When
one cgroup's share set to 512, it will consume about 33% while the other
consumes 67%, as expected.

I think the current patch works fine when 2 differently tagged tasks are
competing CPU, but when there are 3 tasks or more, things can get less
fair.


Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-06 Thread Peter Zijlstra


Sorry for being verbose; I've been procrastinating replying, and in
doing so the things I wanted to say kept growing.

On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:

> To make this work, the root level sched entities' vruntime of the two
> threads must be directly comparable. So one of the hyperthread's root
> cfs_rq's min_vruntime is chosen as the core wide one and all root level
> sched entities' vruntime is normalized against it.

> +/*
> + * This is called in stop machine context so no need to take the rq lock.
> + *
> + * Core scheduling is going to be enabled and the root level sched entities
> + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> + * min_vruntime, so it's necessary to normalize vruntime of existing root
> + * level sched entities in sibling_cfs_rq.
> + *
> + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> + * only using cfs_rq->min_vruntime during the entire run of core scheduling.
> + */
> +void sched_core_normalize_se_vruntime(int cpu)
> +{
> + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> + int i;
> +
> + for_each_cpu(i, cpu_smt_mask(cpu)) {
> + struct sched_entity *se, *next;
> + struct cfs_rq *sibling_cfs_rq;
> + s64 delta;
> +
> + if (i == cpu)
> + continue;
> +
> + sibling_cfs_rq = &cpu_rq(i)->cfs;
> + if (!sibling_cfs_rq->nr_running)
> + continue;
> +
> + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> + rbtree_postorder_for_each_entry_safe(se, next,
> + &sibling_cfs_rq->tasks_timeline.rb_root,
> + run_node) {
> + se->vruntime += delta;
> + }
> + }
> +}

Aside from this being way to complicated for what it does -- you
could've saved the min_vruntime for each rq and compared them with
subtraction -- it is also terminally broken afaict.

Consider any infeasible weight scenario. Take for instance two tasks,
each bound to their respective sibling, one with weight 1 and one with
weight 2. Then the lower weight task will run ahead of the higher weight
task without bound.

This utterly destroys the concept of a shared time base.

Remember; all this is about a proportionally fair scheduling, where each
tasks receives:

 w_i
  dt_i = -- dt (1)
 \Sum_j w_j

which we do by tracking a virtual time, s_i:

 1
  s_i = --- d[t]_i (2)
w_i

Where d[t] is a delta of discrete time, while dt is an infinitesimal.
The immediate corrolary is that the ideal schedule S, where (2) to use
an infnitesimal delta, is:

  1
  S = -- dt(3)
  \Sum_i w_i

>From which we can define the lag, or deviation from the ideal, as:

  lag(i) = S - s_i (4)

And since the one and only purpose is to approximate S, we get that:

  \Sum_i w_i lag(i) := 0   (5)

If this were not so, we no longer converge to S, and we can no longer
claim our scheduler has any of the properties we derive from S. This is
exactly what you did above, you broke it!


Let's continue for a while though; to see if there is anything useful to
be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:

  \Sum_i w_i s_i
  S = --   (6)
\Sum_i w_i

Which gives us a way to compute S, given our s_i. Now, if you've read
our code, you know that we do not in fact do this, the reason for this
is two-fold. Firstly, computing S in that way requires a 64bit division
for every time we'd use it (see 12), and secondly, this only describes
the steady-state, it doesn't handle dynamics.

Anyway, in (6):  s_i -> x + (s_i - x), to get:

  \Sum_i w_i (s_i - x)
  S - x =  (7)
   \Sum_i w_i

Which shows that S and s_i transform alike (which makes perfect sense
given that S is basically the (weighted) average of s_i).

Then:

  x -> s_min := min{s_i}   (8)

to obtain:

  \Sum_i w_i (s_i - s_min)
  S = s_min +  (9)
\Sum_i w_i

Which already looks familiar, and is the basis for our current
approximation:

  S ~= s_min  (10)

Now, obviously, (10) is absolute crap :-), but it sorta works.

So the thing to remember is that the above is strictly UP. It is
possible to generalize to multiple runqueues -- however it gets really
yuck when you have to add affinity support, as illustrated by our very
first counter-example.

XXX segue into the load-balance issues related to this:

  - how a negative lag task on a 'heavy' runqueue should not
remai