Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-21 Thread Vineeth Pillai
> > Please try this and see how it compares with the vanilla v2. I think its
> > time for a v3 now and we shall be posting it soon after some more
> > testing and benchmarking.
>
> Is there any potential change between pre v3 and v3? I prefer working
> based on v3 so that everyone are on the same page.
>
Makes sense, testing can wait until v3 is posted. I don't expect many
changes from above, but its better to test on the posted v3.

Thanks,


Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-21 Thread Aubrey Li
On Mon, May 20, 2019 at 10:04 PM Vineeth Pillai
 wrote:
>
> > > The following patch improved my test cases.
> > > Welcome any comments.
> > >
> >
> > This is certainly better than violating the point of the core scheduler :)
> >
> > If I'm understanding this right what will happen in this case is instead
> > of using the idle process selected by the sibling we do the core scheduling
> > again. This may start with a newidle_balance which might bring over 
> > something
> > to run that matches what we want to put on the sibling. If that works then I
> > can see this helping.
> >
> > But I'd be a little concerned that we could end up thrashing. Once we do 
> > core
> > scheduling again here we'd force the sibling to resched and if we got a 
> > different
> > result which "helped" him pick idle we'd go around again.

Thrashing means more IPIs right? That's not what I observed, because idle task
has less chance onto CPU, rescheduling is reduced accordingly.

> > I think inherent in the concept of core scheduling (barring a perfectly 
> > aligned set
> > of jobs) is some extra idle time on siblings.

Yeah, I understand and agree with this, but 10-15% idle time on an overloaded
system makes me to try to figure out how this could happen and if we
can improve it.

> >
> >
> I was also thinking along the same lines. This change basically always
> tries to avoid idle and there by constantly interrupting the sibling.
> While this change might benefit a very small subset of workloads, it
> might introduce thrashing more often.

Thrashing is not observed under an overloaded case but may happen under a
light load or a mid load case, I need more investigation.

>
> One other reason you might be seeing performance improvement is
> because of the bugs that caused both siblings to go idle even though
> there are runnable and compatible threads in the queue. Most of the
> issues are fixed based on all the feedback received in v2. We have a
> github repo with the pre v3 changes here:
> https://github.com/digitalocean/linux-coresched/tree/coresched

Okay, thanks, it looks like the core functions pick_next_task() and pick_task()
have a lot of changes against v2. Need more brain power..

>
> Please try this and see how it compares with the vanilla v2. I think its
> time for a v3 now and we shall be posting it soon after some more
> testing and benchmarking.

Is there any potential change between pre v3 and v3? I prefer working
based on v3 so that everyone are on the same page.

Thanks,
-Aubrey


Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-20 Thread Vineeth Pillai
> > The following patch improved my test cases.
> > Welcome any comments.
> >
>
> This is certainly better than violating the point of the core scheduler :)
>
> If I'm understanding this right what will happen in this case is instead
> of using the idle process selected by the sibling we do the core scheduling
> again. This may start with a newidle_balance which might bring over something
> to run that matches what we want to put on the sibling. If that works then I
> can see this helping.
>
> But I'd be a little concerned that we could end up thrashing. Once we do core
> scheduling again here we'd force the sibling to resched and if we got a 
> different
> result which "helped" him pick idle we'd go around again.
>
> I think inherent in the concept of core scheduling (barring a perfectly 
> aligned set
> of jobs) is some extra idle time on siblings.
>
I was also thinking along the same lines. This change basically always
tries to avoid idle and there by constantly interrupting the sibling.
While this change might benefit a very small subset of workloads, it
might introduce thrashing more often.

One other reason you might be seeing performance improvement is
because of the bugs that caused both siblings to go idle even though
there are runnable and compatible threads in the queue. Most of the
issues are fixed based on all the feedback received in v2. We have a
github repo with the pre v3 changes here:
https://github.com/digitalocean/linux-coresched/tree/coresched

Please try this and see how it compares with the vanilla v2. I think its
time for a v3 now and we shall be posting it soon after some more
testing and benchmarking.

Thanks,


Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-20 Thread Phil Auld
On Sat, May 18, 2019 at 11:37:56PM +0800 Aubrey Li wrote:
> On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
>  wrote:
> >
> > From: Peter Zijlstra (Intel) 
> >
> > Instead of only selecting a local task, select a task for all SMT
> > siblings for every reschedule on the core (irrespective which logical
> > CPU does the reschedule).
> >
> > NOTE: there is still potential for siblings rivalry.
> > NOTE: this is far too complicated; but thus far I've failed to
> >   simplify it further.
> >
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> >  kernel/sched/core.c  | 222 ++-
> >  kernel/sched/sched.h |   5 +-
> >  2 files changed, 224 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e5bdc1c4d8d7..9e6e90c6f9b9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct 
> > *prev)
> >   * Pick up the highest-prio task:
> >   */
> >  static inline struct task_struct *
> > -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> > *rf)
> > +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> > *rf)
> >  {
> > const struct sched_class *class;
> > struct task_struct *p;
> > @@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct 
> > *prev, struct rq_flags *rf)
> > BUG();
> >  }
> >
> > +#ifdef CONFIG_SCHED_CORE
> > +
> > +static inline bool cookie_match(struct task_struct *a, struct task_struct 
> > *b)
> > +{
> > +   if (is_idle_task(a) || is_idle_task(b))
> > +   return true;
> > +
> > +   return a->core_cookie == b->core_cookie;
> > +}
> > +
> > +// XXX fairness/fwd progress conditions
> > +static struct task_struct *
> > +pick_task(struct rq *rq, const struct sched_class *class, struct 
> > task_struct *max)
> > +{
> > +   struct task_struct *class_pick, *cookie_pick;
> > +   unsigned long cookie = 0UL;
> > +
> > +   /*
> > +* We must not rely on rq->core->core_cookie here, because we fail 
> > to reset
> > +* rq->core->core_cookie on new picks, such that we can detect if 
> > we need
> > +* to do single vs multi rq task selection.
> > +*/
> > +
> > +   if (max && max->core_cookie) {
> > +   WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
> > +   cookie = max->core_cookie;
> > +   }
> > +
> > +   class_pick = class->pick_task(rq);
> > +   if (!cookie)
> > +   return class_pick;
> > +
> > +   cookie_pick = sched_core_find(rq, cookie);
> > +   if (!class_pick)
> > +   return cookie_pick;
> > +
> > +   /*
> > +* If class > max && class > cookie, it is the highest priority 
> > task on
> > +* the core (so far) and it must be selected, otherwise we must go 
> > with
> > +* the cookie pick in order to satisfy the constraint.
> > +*/
> > +   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
> > class_pick))
> > +   return class_pick;
> > +
> > +   return cookie_pick;
> > +}
> > +
> > +static struct task_struct *
> > +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> > *rf)
> > +{
> > +   struct task_struct *next, *max = NULL;
> > +   const struct sched_class *class;
> > +   const struct cpumask *smt_mask;
> > +   int i, j, cpu;
> > +
> > +   if (!sched_core_enabled(rq))
> > +   return __pick_next_task(rq, prev, rf);
> > +
> > +   /*
> > +* If there were no {en,de}queues since we picked (IOW, the task
> > +* pointers are all still valid), and we haven't scheduled the last
> > +* pick yet, do so now.
> > +*/
> > +   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> > +   rq->core->core_pick_seq != rq->core_sched_seq) {
> > +   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
> > +
> > +   next = rq->core_pick;
> > +   if (next != prev) {
> > +   put_prev_task(rq, prev);
> > +   set_next_task(rq, next);
> > +   }
> > +   return next;
> > +   }
> > +
> 
> The following patch improved my test cases.
> Welcome any comments.
> 

This is certainly better than violating the point of the core scheduler :)

If I'm understanding this right what will happen in this case is instead
of using the idle process selected by the sibling we do the core scheduling
again. This may start with a newidle_balance which might bring over something
to run that matches what we want to put on the sibling. If that works then I 
can see this helping. 

But I'd be a little concerned that we could end up thrashing. Once we do core 
scheduling again here we'd force the sibling to resched and if we got a 
different 
result which "helped" him 

Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-18 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> From: Peter Zijlstra (Intel) 
>
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
>
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
>   simplify it further.
>
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c  | 222 ++-
>  kernel/sched/sched.h |   5 +-
>  2 files changed, 224 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e5bdc1c4d8d7..9e6e90c6f9b9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct 
> *prev)
>   * Pick up the highest-prio task:
>   */
>  static inline struct task_struct *
> -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> *rf)
>  {
> const struct sched_class *class;
> struct task_struct *p;
> @@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct 
> *prev, struct rq_flags *rf)
> BUG();
>  }
>
> +#ifdef CONFIG_SCHED_CORE
> +
> +static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
> +{
> +   if (is_idle_task(a) || is_idle_task(b))
> +   return true;
> +
> +   return a->core_cookie == b->core_cookie;
> +}
> +
> +// XXX fairness/fwd progress conditions
> +static struct task_struct *
> +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
> *max)
> +{
> +   struct task_struct *class_pick, *cookie_pick;
> +   unsigned long cookie = 0UL;
> +
> +   /*
> +* We must not rely on rq->core->core_cookie here, because we fail to 
> reset
> +* rq->core->core_cookie on new picks, such that we can detect if we 
> need
> +* to do single vs multi rq task selection.
> +*/
> +
> +   if (max && max->core_cookie) {
> +   WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
> +   cookie = max->core_cookie;
> +   }
> +
> +   class_pick = class->pick_task(rq);
> +   if (!cookie)
> +   return class_pick;
> +
> +   cookie_pick = sched_core_find(rq, cookie);
> +   if (!class_pick)
> +   return cookie_pick;
> +
> +   /*
> +* If class > max && class > cookie, it is the highest priority task 
> on
> +* the core (so far) and it must be selected, otherwise we must go 
> with
> +* the cookie pick in order to satisfy the constraint.
> +*/
> +   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
> class_pick))
> +   return class_pick;
> +
> +   return cookie_pick;
> +}
> +
> +static struct task_struct *
> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> +   struct task_struct *next, *max = NULL;
> +   const struct sched_class *class;
> +   const struct cpumask *smt_mask;
> +   int i, j, cpu;
> +
> +   if (!sched_core_enabled(rq))
> +   return __pick_next_task(rq, prev, rf);
> +
> +   /*
> +* If there were no {en,de}queues since we picked (IOW, the task
> +* pointers are all still valid), and we haven't scheduled the last
> +* pick yet, do so now.
> +*/
> +   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> +   rq->core->core_pick_seq != rq->core_sched_seq) {
> +   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
> +
> +   next = rq->core_pick;
> +   if (next != prev) {
> +   put_prev_task(rq, prev);
> +   set_next_task(rq, next);
> +   }
> +   return next;
> +   }
> +

The following patch improved my test cases.
Welcome any comments.

Thanks,
-Aubrey

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e3162f..86031f4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3685,10 +3685,12 @@ pick_next_task(struct rq *rq, struct
task_struct *prev, struct rq_flags *rf)
/*
 * If there were no {en,de}queues since we picked (IOW, the task
 * pointers are all still valid), and we haven't scheduled the last
-* pick yet, do so now.
+* pick yet, do so now. If the last pick is idle task, we abandon
+* last pick and try to pick up task this time.
 */
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
-   rq->core->core_pick_seq != rq->core_sched_seq) {
+   rq->core->core_pick_seq != rq->core_sched_seq &&
+   !is_idle_task(rq->core_pick)) {
WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);

next = 

Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-04-29 Thread Aaron Lu
On Tue, Apr 23, 2019 at 04:18:18PM +, Vineeth Remanan Pillai wrote:
> +// XXX fairness/fwd progress conditions
> +static struct task_struct *
> +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
> *max)
> +{
> + struct task_struct *class_pick, *cookie_pick;
> + unsigned long cookie = 0UL;
> +
> + /*
> +  * We must not rely on rq->core->core_cookie here, because we fail to 
> reset
> +  * rq->core->core_cookie on new picks, such that we can detect if we 
> need
> +  * to do single vs multi rq task selection.
> +  */
> +
> + if (max && max->core_cookie) {
> + WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
> + cookie = max->core_cookie;
> + }
> +
> + class_pick = class->pick_task(rq);
> + if (!cookie)
> + return class_pick;
> +
> + cookie_pick = sched_core_find(rq, cookie);
> + if (!class_pick)
> + return cookie_pick;
> +
> + /*
> +  * If class > max && class > cookie, it is the highest priority task on
> +  * the core (so far) and it must be selected, otherwise we must go with
> +  * the cookie pick in order to satisfy the constraint.
> +  */
> + if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
> class_pick))

It apapears to me the cpu_prio_less(cookie_pick, class_pick) isn't
needed.

If cookie_pick is idle task, then cpu_prio_less(cookie_pick, class_pick)
is always true;
If cookie_pick is not idle task and has the same sched class as
class_pick, then class_pick is the best candidate to run accoring to
their sched class. In this case, cpu_prio_less(cookie_pick, class_pick)
shouldn't return false or it feels like a bug;
If cookie_pick is not idle task and has a different sched class as
class_pick:
 - if cookie_pick's sched class has higher priority than class_pick's
   sched class, then cookie_pick should have been selected in previous
   sched class iteration; and since its cookie matches with max,
   everything should have been finished already;
 - if cookie_pick's sched class has lower priority than class_pick's
   sched class, then cpu_prio_less(cookie_pick, class_pick) will still
   returns true.

So looks like cpu_prio_less(cookie_pick, class_pick) should always
return true and thus not needed.

> + return class_pick;
> +
> + return cookie_pick;
> +}


[RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-04-23 Thread Vineeth Remanan Pillai
From: Peter Zijlstra (Intel) 

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
  simplify it further.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  | 222 ++-
 kernel/sched/sched.h |   5 +-
 2 files changed, 224 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5bdc1c4d8d7..9e6e90c6f9b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct 
*prev)
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+   if (is_idle_task(a) || is_idle_task(b))
+   return true;
+
+   return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+{
+   struct task_struct *class_pick, *cookie_pick;
+   unsigned long cookie = 0UL;
+
+   /*
+* We must not rely on rq->core->core_cookie here, because we fail to 
reset
+* rq->core->core_cookie on new picks, such that we can detect if we 
need
+* to do single vs multi rq task selection.
+*/
+
+   if (max && max->core_cookie) {
+   WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
+   cookie = max->core_cookie;
+   }
+
+   class_pick = class->pick_task(rq);
+   if (!cookie)
+   return class_pick;
+
+   cookie_pick = sched_core_find(rq, cookie);
+   if (!class_pick)
+   return cookie_pick;
+
+   /*
+* If class > max && class > cookie, it is the highest priority task on
+* the core (so far) and it must be selected, otherwise we must go with
+* the cookie pick in order to satisfy the constraint.
+*/
+   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
class_pick))
+   return class_pick;
+
+   return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *next, *max = NULL;
+   const struct sched_class *class;
+   const struct cpumask *smt_mask;
+   int i, j, cpu;
+
+   if (!sched_core_enabled(rq))
+   return __pick_next_task(rq, prev, rf);
+
+   /*
+* If there were no {en,de}queues since we picked (IOW, the task
+* pointers are all still valid), and we haven't scheduled the last
+* pick yet, do so now.
+*/
+   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+   rq->core->core_pick_seq != rq->core_sched_seq) {
+   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+   next = rq->core_pick;
+   if (next != prev) {
+   put_prev_task(rq, prev);
+   set_next_task(rq, next);
+   }
+   return next;
+   }
+
+   prev->sched_class->put_prev_task(rq, prev, rf);
+   if (!rq->nr_running)
+   newidle_balance(rq, rf);
+
+   cpu = cpu_of(rq);
+   smt_mask = cpu_smt_mask(cpu);
+
+   /*
+* core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+*
+* @task_seq guards the task state ({en,de}queues)
+* @pick_seq is the @task_seq we did a selection on
+* @sched_seq is the @pick_seq we scheduled
+*
+* However, preemptions can cause multiple picks on the same task set.
+* 'Fix' this by also increasing @task_seq for every pick.
+*/
+   rq->core->core_task_seq++;
+
+   /* reset state */
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+
+   rq_i->core_pick = NULL;
+
+   if (i != cpu)
+   update_rq_clock(rq_i);
+   }
+
+   /*
+* Try and select tasks for each sibling in decending sched_class
+* order.
+*/
+   for_each_class(class) {
+again:
+   for_each_cpu_wrap(i, smt_mask, cpu) {
+   struct rq *rq_i = cpu_rq(i);
+   struct task_struct *p;
+
+   if (rq_i->core_pick)
+