Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-17 Thread Li, Aubrey
On 2019/5/18 8:58, Li, Aubrey wrote:
> On 2019/4/30 12:42, Ingo Molnar wrote:
>>
 What's interesting is how in the over-saturated case (the last three
 rows: 128, 256 and 512 total threads) coresched-SMT leaves 20-30% CPU
 performance on the floor according to the load figures.
>>>
> 
> Sorry for a delay, I got a chance to obtain some profiling results. Here
> is the story on my side. I still used the previous testing 128/128 case
> (256 threads totally), and focus on CPU53(randomly pickup) only.
> 
> Firstly, mpstat reports cpu utilization,
> - baseline is 100%,
> - coresched-SMT is 87.51%
> 
> Then I traced sched_switch trace point, in 100s sampling period,
> - baseline context switch 14083 times, next task idle 0 times
> - coresched-SMT context switch 15101 times, next task idle 880 times
> 
> So I guess pick_next_task() is mostly the interesting place, then I
> dig into the trace log on coresched-SMT case:
> - CPU53 selected idle task 767 times (matched with the data of sched_switch)
> 
> There are 3 branches of CPU53 selecting idle task in pick_next_task():
> - pick pre selected 765 times
> - unconstrained pick 1 times
> - picked: swapper/53/0 1 times
> 
> Where CPU53's "pick pre selected idle task" from? I guess its from its
> brother CPU1, so I checked CPU1's trace log and found:
> - CPU1 helped its sibling CPU53 select idle task 800 times
> 
> So for CPU53, the most interesting part occurs in pick_task(), that is:
> -The sibling CPU1 helped to select idle task in pick_task()
> 
> Forgive me to paste this routine() here:
> =
> +// XXX fairness/fwd progress conditions
> +static struct task_struct *
> +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
> *max)
> +{
> + struct task_struct *class_pick, *cookie_pick;
> + unsigned long cookie = 0UL;
> +
> + /*
> +  * We must not rely on rq->core->core_cookie here, because we fail to 
> reset
> +  * rq->core->core_cookie on new picks, such that we can detect if we 
> need
> +  * to do single vs multi rq task selection.
> +  */
> +
> + if (max && max->core_cookie) {
> + WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
> + cookie = max->core_cookie;
> + }
> +
> + class_pick = class->pick_task(rq);
> + if (!cookie)
> + return class_pick;
> +
> + cookie_pick = sched_core_find(rq, cookie);
> + if (!class_pick)
> + return cookie_pick;
> +
> + /*
> +  * If class > max && class > cookie, it is the highest priority task on
> +  * the core (so far) and it must be selected, otherwise we must go with
> +  * the cookie pick in order to satisfy the constraint.
> +  */
> + if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
> class_pick))
> + return class_pick;
> +
> + return cookie_pick;
> +}
> =
> 
> And the most related log of the case:
> =
> <...>-21553 [001] dN.. 87341.514992: __schedule: cpu(1): selected: 
> gemmbench/21294 23df8900
> <...>-21553 [001] dN.. 87341.514992: __schedule: max: gemmbench/21294 
> 23df8900
> <...>-21553 [001] dN.. 87341.514995: __schedule: (swapper/53/0;140,0,0) ?< 
> (sysbench/21503;140,457178607302,0)
> <...>-21553 [001] dN.. 87341.514996: __schedule: 
> (gemmbench/21294;119,219715519947,0) ?< (sysbench/21503;119,457178607302,0)
> <...>-21553 [001] dN.. 87341.514996: __schedule: cpu(53): selected: 
> swapper/53/0 0
> 
> It said,
> - CPU1 selected gemmbench for itself
> - and gemmbench was assigned to max of this core
> - then CPU1 helped CPU53 to pick_task()
> -- CPU1 used class->pick_task(), selected sysbench for CPU53
> -- CPU1 used cookie_pick, selected swapper(idle task) for CPU53
> -- the class_pick(sysbench) unfortunately didn't pass the priority check
> - idle task picked up at the end(sadly).
> 
> So, I think if we want to improve CPU utilization under this scenario,
> the straightforward tweak is picking up class_pick if cookie_pick is idle.

Another quick thought is, in CPU53's own path of pick_next_task, give up
pre selected(by CPU1) if pre selected is idle?


> But I know, this is a violation of the design philosophy(avoid L1TF) of
> this proposal.
> 
> Does it make sense to add a knob to switch security/performance?
> Welcome any comments!
> 
> Thanks,
> -Aubrey
> 



Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-17 Thread Li, Aubrey
On 2019/4/30 12:42, Ingo Molnar wrote:
> 
>>> What's interesting is how in the over-saturated case (the last three
>>> rows: 128, 256 and 512 total threads) coresched-SMT leaves 20-30% CPU
>>> performance on the floor according to the load figures.
>>

Sorry for a delay, I got a chance to obtain some profiling results. Here
is the story on my side. I still used the previous testing 128/128 case
(256 threads totally), and focus on CPU53(randomly pickup) only.

Firstly, mpstat reports cpu utilization,
- baseline is 100%,
- coresched-SMT is 87.51%

Then I traced sched_switch trace point, in 100s sampling period,
- baseline context switch 14083 times, next task idle 0 times
- coresched-SMT context switch 15101 times, next task idle 880 times

So I guess pick_next_task() is mostly the interesting place, then I
dig into the trace log on coresched-SMT case:
- CPU53 selected idle task 767 times (matched with the data of sched_switch)

There are 3 branches of CPU53 selecting idle task in pick_next_task():
- pick pre selected 765 times
- unconstrained pick 1 times
- picked: swapper/53/0 1 times

Where CPU53's "pick pre selected idle task" from? I guess its from its
brother CPU1, so I checked CPU1's trace log and found:
- CPU1 helped its sibling CPU53 select idle task 800 times

So for CPU53, the most interesting part occurs in pick_task(), that is:
-The sibling CPU1 helped to select idle task in pick_task()

Forgive me to paste this routine() here:
=
+// XXX fairness/fwd progress conditions
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+{
+   struct task_struct *class_pick, *cookie_pick;
+   unsigned long cookie = 0UL;
+
+   /*
+* We must not rely on rq->core->core_cookie here, because we fail to 
reset
+* rq->core->core_cookie on new picks, such that we can detect if we 
need
+* to do single vs multi rq task selection.
+*/
+
+   if (max && max->core_cookie) {
+   WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
+   cookie = max->core_cookie;
+   }
+
+   class_pick = class->pick_task(rq);
+   if (!cookie)
+   return class_pick;
+
+   cookie_pick = sched_core_find(rq, cookie);
+   if (!class_pick)
+   return cookie_pick;
+
+   /*
+* If class > max && class > cookie, it is the highest priority task on
+* the core (so far) and it must be selected, otherwise we must go with
+* the cookie pick in order to satisfy the constraint.
+*/
+   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
class_pick))
+   return class_pick;
+
+   return cookie_pick;
+}
=

And the most related log of the case:
=
<...>-21553 [001] dN.. 87341.514992: __schedule: cpu(1): selected: 
gemmbench/21294 23df8900
<...>-21553 [001] dN.. 87341.514992: __schedule: max: gemmbench/21294 
23df8900
<...>-21553 [001] dN.. 87341.514995: __schedule: (swapper/53/0;140,0,0) ?< 
(sysbench/21503;140,457178607302,0)
<...>-21553 [001] dN.. 87341.514996: __schedule: 
(gemmbench/21294;119,219715519947,0) ?< (sysbench/21503;119,457178607302,0)
<...>-21553 [001] dN.. 87341.514996: __schedule: cpu(53): selected: 
swapper/53/0 0

It said,
- CPU1 selected gemmbench for itself
- and gemmbench was assigned to max of this core
- then CPU1 helped CPU53 to pick_task()
-- CPU1 used class->pick_task(), selected sysbench for CPU53
-- CPU1 used cookie_pick, selected swapper(idle task) for CPU53
-- the class_pick(sysbench) unfortunately didn't pass the priority check
- idle task picked up at the end(sadly).

So, I think if we want to improve CPU utilization under this scenario,
the straightforward tweak is picking up class_pick if cookie_pick is idle.
But I know, this is a violation of the design philosophy(avoid L1TF) of
this proposal.

Does it make sense to add a knob to switch security/performance?
Welcome any comments!

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-15 Thread Vineeth Remanan Pillai
> Thanks for pointing this out. I think the ideal fix would be to
> correctly initialize/cleanup the coresched attributes in the cpu
> hotplug code path so that lock could be taken successfully if the
> sibling is offlined/onlined after coresched was enabled. We are
> working on another bug related to hotplugpath and shall introduce
> the fix in v3.
>
A possible fix for handling the runqueues during cpu offline/online
is attached here with.

Thanks,
Vineeth

---
 kernel/sched/core.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8e5f26db052..1a809849a1e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -253,7 +253,7 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;
 
-   for_each_possible_cpu(cpu)
+   for_each_online_cpu(cpu)
cpu_rq(cpu)->core_enabled = enabled;
 
return 0;
@@ -3764,6 +3764,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;
 
+   if (cpu_is_offline(i))
+   continue;
+
if (rq_i->core_pick)
continue;
 
@@ -3866,6 +3869,9 @@ next_class:;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
 
+   if (cpu_is_offline(i))
+   continue;
+
WARN_ON_ONCE(!rq_i->core_pick);
 
rq_i->core_pick->core_occupation = occ;
@@ -6410,8 +6416,14 @@ int sched_cpu_activate(unsigned int cpu)
/*
 * When going up, increment the number of cores with SMT present.
 */
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
static_branch_inc_cpuslocked(&sched_smt_present);
+#ifdef CONFIG_SCHED_CORE
+   if (static_branch_unlikely(&__sched_core_enabled)) {
+   rq->core_enabled = true;
+   }
+#endif
+   }
 #endif
set_cpu_active(cpu, true);
 
@@ -6459,8 +6471,15 @@ int sched_cpu_deactivate(unsigned int cpu)
/*
 * When going down, decrement the number of cores with SMT present.
 */
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+   struct rq *rq = cpu_rq(cpu);
+   if (static_branch_unlikely(&__sched_core_enabled)) {
+   rq->core_enabled = false;
+   }
+#endif
static_branch_dec_cpuslocked(&sched_smt_present);
+   }
 #endif
 
if (!sched_smp_initialized)
@@ -6537,6 +6556,9 @@ int sched_cpu_dying(unsigned int cpu)
update_max_interval();
nohz_balance_exit_idle(rq);
hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+   rq->core = NULL;
+#endif
return 0;
 }
 #endif
-- 
2.17.1


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-15 Thread Vineeth Remanan Pillai
> It's clear now, thanks.
> I don't immediately see how my isolation fix would make your fix stop
> working, will need to check. But I'm busy with other stuffs so it will
> take a while.
>
We have identified the issue and have a fix for this. The issue is
same as before, forced idle sibling has a runnable process which
is starved due to an unconstrained pick bug.

One sample scenario is like this:
cpu0 and cpu1 are siblings. cpu0 selects an untagged process 'a'
which forces idle on cpu1 even though it had a runnable tagged
process 'b' which is determined by the code to be of lesser priority.
cpu1 can go to deep idle.

During the next schedule in cpu0, the following could happen:
 - cpu0 selects swapper as there is nothing to run and hence
   prev_cookie is 0, it does an unconstrained pick of swapper.
   So both cpu0 and 1 are idling and cpu1 might be deep idle.
 - cpu0 again goes to schedule and selects 'a' which is runnable
   now. since prev_cookie is 0, 'a' is an unconstrained pick and
   'b' on cpu1 is forgotten again.

This continues with swapper and process 'a' taking turns without
considering sibling until a tagged process becomes runnable in cpu0
and then we don't get into unconstrained pick.

The above is one of the couple of scenarios we have seen and each
have a slightly different path, which ultimately leads to an
unconstrianed pick, starving the sibling's runnable thread.

The fix is to mark if a core has gone forced idle when there was a
runnable process and then do not do uncontrained pick if a forced
idle happened in the last pick.

I am attaching here wth, the patch that fixes the above issue. Patch
is on top of Peter's fix and your correctness fix that we modified for
v2. We have a public reposiory with all the changes including this
fix as well:
https://github.com/digitalocean/linux-coresched/tree/coresched

We are working on a v3 where the last 3 commits will be squashed to
their related patches in v2. We hope to come up with a v3 next week
with all the suggestions and fixes posted in v2.

Thanks,
Vineeth

---
 kernel/sched/core.c  | 26 ++
 kernel/sched/sched.h |  1 +
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 413d46bde17d..3aba0f8fe384 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3653,8 +3653,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
-   unsigned long prev_cookie;
int i, j, cpu, occ = 0;
+   bool need_sync = false;
 
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
@@ -3702,7 +3702,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * 'Fix' this by also increasing @task_seq for every pick.
 */
rq->core->core_task_seq++;
-   prev_cookie = rq->core->core_cookie;
+   need_sync = !!rq->core->core_cookie;
 
/* reset state */
rq->core->core_cookie = 0UL;
@@ -3711,6 +3711,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = NULL;
 
+   if (rq_i->core_forceidle) {
+   need_sync = true;
+   rq_i->core_forceidle = false;
+   }
+
if (i != cpu)
update_rq_clock(rq_i);
}
@@ -3743,7 +3748,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * If there weren't no cookies; we don't need
 * to bother with the other siblings.
 */
-   if (i == cpu && !prev_cookie)
+   if (i == cpu && !need_sync)
goto next_class;
 
continue;
@@ -3753,7 +3758,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * Optimize the 'normal' case where there aren't any
 * cookies and we don't need to sync up.
 */
-   if (i == cpu && !prev_cookie && !p->core_cookie) {
+   if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
rq->core_pick = NULL;
rq->core->core_cookie = 0UL;
@@ -3816,7 +3821,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
occ = 1;
goto again;
+   } else {
+   /*
+* Once we select a

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-08 Thread Aaron Lu
On Wed, May 08, 2019 at 01:49:09PM -0400, Julien Desfossez wrote:
> On 08-May-2019 10:30:09 AM, Aaron Lu wrote:
> > On Mon, May 06, 2019 at 03:39:37PM -0400, Julien Desfossez wrote:
> > > On 29-Apr-2019 11:53:21 AM, Aaron Lu wrote:
> > > > This is what I have used to make sure no two unmatched tasks being
> > > > scheduled on the same core: (on top of v1, I thinks it's easier to just
> > > > show the diff instead of commenting on various places of the patches :-)
> > > 
> > > We imported this fix in v2 and made some small changes and optimizations
> > > (with and without Peter’s fix from https://lkml.org/lkml/2019/4/26/658)
> > > and in both cases, the performance problem where the core can end up
> > 
> > By 'core', do you mean a logical CPU(hyperthread) or the entire core?
> No I really meant the entire core.
> 
> I’m sorry, I should have added a little bit more context. This relates
> to a performance issue we saw in v1 and discussed here:
> https://lore.kernel.org/lkml/20190410150116.gi2...@worktop.programming.kicks-ass.net/T/#mb9f1f54a99bac468fc5c55b06a9da306ff48e90b
> 
> We proposed a fix that solved this, Peter came up with a better one
> (https://lkml.org/lkml/2019/4/26/658), but if we add your isolation fix
> as posted above, the same problem reappears. Hope this clarifies your
> ask.

It's clear now, thanks.
I don't immediately see how my isolation fix would make your fix stop
working, will need to check. But I'm busy with other stuffs so it will
take a while.

> 
> I hope that we did not miss anything crucial while integrating your fix
> on top of v2 + Peter’s fix. The changes are conceptually similar, but we
> refactored it slightly to make the logic clear. Please have a look and
> let us know

I suppose you already have a branch that have all the bits there? I
wonder if you can share that branch somewhere so I can start working on
top of it to make sure we are on the same page?

Also, it would be good if you can share the workload, cmdline options,
how many workers need to start etc. to reproduce this issue.

Thanks.


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-08 Thread Julien Desfossez
On 08-May-2019 10:30:09 AM, Aaron Lu wrote:
> On Mon, May 06, 2019 at 03:39:37PM -0400, Julien Desfossez wrote:
> > On 29-Apr-2019 11:53:21 AM, Aaron Lu wrote:
> > > This is what I have used to make sure no two unmatched tasks being
> > > scheduled on the same core: (on top of v1, I thinks it's easier to just
> > > show the diff instead of commenting on various places of the patches :-)
> > 
> > We imported this fix in v2 and made some small changes and optimizations
> > (with and without Peter’s fix from https://lkml.org/lkml/2019/4/26/658)
> > and in both cases, the performance problem where the core can end up
> 
> By 'core', do you mean a logical CPU(hyperthread) or the entire core?
No I really meant the entire core.

I’m sorry, I should have added a little bit more context. This relates
to a performance issue we saw in v1 and discussed here:
https://lore.kernel.org/lkml/20190410150116.gi2...@worktop.programming.kicks-ass.net/T/#mb9f1f54a99bac468fc5c55b06a9da306ff48e90b

We proposed a fix that solved this, Peter came up with a better one
(https://lkml.org/lkml/2019/4/26/658), but if we add your isolation fix
as posted above, the same problem reappears. Hope this clarifies your
ask.

I hope that we did not miss anything crucial while integrating your fix
on top of v2 + Peter’s fix. The changes are conceptually similar, but we
refactored it slightly to make the logic clear. Please have a look and
let us know

Thanks,

Julien


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-07 Thread Aaron Lu
On Mon, May 06, 2019 at 03:39:37PM -0400, Julien Desfossez wrote:
> On 29-Apr-2019 11:53:21 AM, Aaron Lu wrote:
> > This is what I have used to make sure no two unmatched tasks being
> > scheduled on the same core: (on top of v1, I thinks it's easier to just
> > show the diff instead of commenting on various places of the patches :-)
> 
> We imported this fix in v2 and made some small changes and optimizations
> (with and without Peter’s fix from https://lkml.org/lkml/2019/4/26/658)
> and in both cases, the performance problem where the core can end up

By 'core', do you mean a logical CPU(hyperthread) or the entire core?

> idle with tasks in its runqueues came back.

Assume you meant a hyperthread, then the question is: when a hyperthread
is idle with tasks sitting in its runqueue, do these tasks match with the
other hyperthread's rq->curr? If so, then it is a problem that need to
be addressed; if not, then this is due to the constraint imposed by the
mitigation of L1TF.

Thanks.


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-06 Thread Julien Desfossez
On 29-Apr-2019 11:53:21 AM, Aaron Lu wrote:
> On Tue, Apr 23, 2019 at 06:45:27PM +, Vineeth Remanan Pillai wrote:
> > >> - Processes with different tags can still share the core
> > 
> > > I may have missed something... Could you explain this statement?
> > 
> > > This, to me, is the whole point of the patch series. If it's not
> > > doing this then ... what?
> > 
> > What I meant was, the patch needs some more work to be accurate.
> > There are some race conditions where the core violation can still
> > happen. In our testing, we saw around 1 to 5% of the time being
> > shared with incompatible processes. One example of this happening
> > is as follows(let cpu 0 and 1 be siblings):
> > - cpu 0 selects a process with a cookie
> > - cpu 1 selects a higher priority process without cookie
> > - Selection process restarts for cpu 0 and it might select a
> >   process with cookie but with lesser priority.
> > - Since it is lesser priority, the logic in pick_next_task
> >   doesn't compare again for the cookie(trusts pick_task) and
> >   proceeds.
> > 
> > This is one of the scenarios that we saw from traces, but there
> > might be other race conditions as well. Fix seems a little
> > involved and We are working on that.
> 
> This is what I have used to make sure no two unmatched tasks being
> scheduled on the same core: (on top of v1, I thinks it's easier to just
> show the diff instead of commenting on various places of the patches :-)

We imported this fix in v2 and made some small changes and optimizations
(with and without Peter’s fix from https://lkml.org/lkml/2019/4/26/658)
and in both cases, the performance problem where the core can end up
idle with tasks in its runqueues came back.

This is pretty easy to reproduce with a multi-file disk write benchmark.

Here is the patch based on your changes applied on v2 (on top of Peter’s
fix):

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07f3f0c..e09fa25 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3653,6 +3653,13 @@ static inline bool cookie_match(struct task_struct *a, 
struct task_struct *b)
 }
 
 // XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
 static struct task_struct *
 pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
 {
@@ -3660,19 +3667,36 @@ pick_task(struct rq *rq, const struct sched_class 
*class, struct task_struct *ma
unsigned long cookie = rq->core->core_cookie;
 
class_pick = class->pick_task(rq);
-   if (!cookie)
+   if (!class_pick)
+   return NULL;
+
+   if (!cookie) {
+   /*
+* If class_pick is tagged, return it only if it has
+* higher priority than max.
+*/
+   if (max && class_pick->core_cookie &&
+   core_prio_less(class_pick, max))
+   return idle_sched_class.pick_task(rq);
+
+   return class_pick;
+   }
+
+   /*
+* If there is a cooke match here, return early.
+*/
+   if (class_pick->core_cookie == cookie)
return class_pick;
 
cookie_pick = sched_core_find(rq, cookie);
-   if (!class_pick)
-   return cookie_pick;
 
/*
 * If class > max && class > cookie, it is the highest priority task on
 * the core (so far) and it must be selected, otherwise we must go with
 * the cookie pick in order to satisfy the constraint.
 */
-   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
class_pick))
+   if (cpu_prio_less(cookie_pick, class_pick) &&
+   (!max || core_prio_less(max, class_pick)))
return class_pick;
 
return cookie_pick;
@@ -3742,8 +3766,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = NULL;
 
-   if (i != cpu)
+   if (i != cpu) {
update_rq_clock(rq_i);
+
+   /*
+* If a sibling is idle, we can initiate an
+* unconstrained pick.
+*/
+   if (is_idle_task(rq_i->curr) && prev_cookie)
+   prev_cookie = 0UL;
+   }
}
 
/*
@@ -3820,12 +3852,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
-* the slate and start over.
+* the slate and start over. pick_task makes sure that
+

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Ingo Molnar


* Aubrey Li  wrote:

> On Tue, Apr 30, 2019 at 12:01 AM Ingo Molnar  wrote:
> > * Li, Aubrey  wrote:
> >
> > > > I.e. showing the approximate CPU thread-load figure column would be
> > > > very useful too, where '50%' shows half-loaded, '100%' fully-loaded,
> > > > '200%' over-saturated, etc. - for each row?
> > >
> > > See below, hope this helps.
> > > .--.
> > > |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / 
> > > sem%] +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> > > |--|
> > > |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 
> > > 0.1%]-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> > > |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 
> > > 0.2%] 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> > > |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 
> > > 0.1%]-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> > > |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 
> > > 0.0%]-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> > > | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 
> > > 0.2%]-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> > > | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 
> > > 0.4%]-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> > > | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 
> > > 0.8%]-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> > > |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 
> > > 1.4%]-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> > > |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 
> > > 1.0%]   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> > > '--'
> >
> > Very nice, thank you!
> >
> > What's interesting is how in the over-saturated case (the last three
> > rows: 128, 256 and 512 total threads) coresched-SMT leaves 20-30% CPU
> > performance on the floor according to the load figures.
> 
> Yeah, I found the next focus.
> 
> > Is this true idle time (which shows up as 'id' during 'top'), or some 
> > load average artifact?
> 
> vmstat periodically reported intermediate CPU utilization in one 
> second, it was running simultaneously when the benchmarks run. The cpu% 
> is computed by the average of (100-idle) series.

Ok - so 'vmstat' uses /proc/stat, which uses cpustat[CPUTIME_IDLE] (or 
its NOHZ work-alike), so this should be true idle time - to the extent 
the HZ process clock's sampling is accurate.

So I guess the answer to my question is "yes". ;-)

BTW., for robustness sake you might want to add iowait to idle time (it's 
the 'wa' field of vmstat) - it shouldn't matter for this particular 
benchmark which doesn't do much IO, but it might for others.

Both CPUTIME_IDLE and CPUTIME_IOWAIT are idle states when a CPU is not 
utilized.

[ Side note: we should really implement precise idle time accounting when 
  CONFIG_IRQ_TIME_ACCOUNTING=y is enabled. We pay all the costs of the 
  timestamps, but AFAICS we don't propagate that into the idle cputime
  metrics. ]

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Aubrey Li
On Tue, Apr 30, 2019 at 12:01 AM Ingo Molnar  wrote:
> * Li, Aubrey  wrote:
>
> > > I.e. showing the approximate CPU thread-load figure column would be
> > > very useful too, where '50%' shows half-loaded, '100%' fully-loaded,
> > > '200%' over-saturated, etc. - for each row?
> >
> > See below, hope this helps.
> > .--.
> > |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / 
> > sem%] +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> > |--|
> > |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 
> > 0.1%]-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> > |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 
> > 0.2%] 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> > |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 
> > 0.1%]-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> > |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 
> > 0.0%]-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> > | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 
> > 0.2%]-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> > | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 
> > 0.4%]-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> > | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 
> > 0.8%]-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> > |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 
> > 1.4%]-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> > |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 
> > 1.0%]   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> > '--'
>
> Very nice, thank you!
>
> What's interesting is how in the over-saturated case (the last three
> rows: 128, 256 and 512 total threads) coresched-SMT leaves 20-30% CPU
> performance on the floor according to the load figures.

Yeah, I found the next focus.

>
> Is this true idle time (which shows up as 'id' during 'top'), or some
> load average artifact?
>

vmstat periodically reported intermediate CPU utilization in one second, it was
running simultaneously when the benchmarks run. The cpu% is computed by
the average of (100-idle) series.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Aubrey Li
On Mon, Apr 29, 2019 at 11:39 PM Phil Auld  wrote:
>
> On Mon, Apr 29, 2019 at 09:25:35PM +0800 Li, Aubrey wrote:
> > .--.
> > |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / 
> > sem%] +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> > |--|
> > |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 
> > 0.1%]-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> > |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 
> > 0.2%] 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> > |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 
> > 0.1%]-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> > |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 
> > 0.0%]-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> > | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 
> > 0.2%]-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> > | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 
> > 0.4%]-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> > | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 
> > 0.8%]-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> > |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 
> > 1.4%]-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> > |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 
> > 1.0%]   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> > '--'
> >
>
> That's really nice and clear.
>
> We start to see the penalty for the coresched at 32/32, leaving some cpus 
> more idle than otherwise.
> But it's pretty good overall, for this benchmark at least.
>
> Is this with stock v2 or with any of the fixes posted after? I wonder how 
> much the fixes for
> the race that violates the rule effects this, for example.
>

Yeah, this data is based on v2 without any fixes after.
I also tried some fixes potential to performance impact but no luck so far.
Please let me know if anything I missed.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Ingo Molnar


* Li, Aubrey  wrote:

> > I.e. showing the approximate CPU thread-load figure column would be 
> > very useful too, where '50%' shows half-loaded, '100%' fully-loaded, 
> > '200%' over-saturated, etc. - for each row?
> 
> See below, hope this helps.
> .--.
> |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / sem%] 
> +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> |--|
> |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 0.1%] 
>-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 0.2%] 
> 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 0.1%] 
>-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 0.0%] 
>-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 0.2%] 
>-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 0.4%] 
>-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 0.8%] 
>-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 1.4%] 
>-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 1.0%] 
>   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> '--'

Very nice, thank you!

What's interesting is how in the over-saturated case (the last three 
rows: 128, 256 and 512 total threads) coresched-SMT leaves 20-30% CPU 
performance on the floor according to the load figures.

Is this true idle time (which shows up as 'id' during 'top'), or some 
load average artifact?

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Phil Auld
On Mon, Apr 29, 2019 at 09:25:35PM +0800 Li, Aubrey wrote:
> On 2019/4/29 14:14, Ingo Molnar wrote:
> > 
> > * Li, Aubrey  wrote:
> > 
> >>> I suspect it's pretty low, below 1% for all rows?
> >>
> >> Hope my this mail box works for this...
> >>
> >> .-.
> >> |NA/AVX vanilla-SMT [std% / sem%] | coresched-SMT   [std% / sem%] 
> >> +/- |  no-SMT [std% / sem%]+/-  |
> >> |-|
> >> |  1/1508.5 [ 0.2%/ 0.0%] | 504.7   [ 1.1%/ 0.1%]
> >> -0.8%|   509.0 [ 0.2%/ 0.0%]0.1% |
> >> |  2/2   1000.2 [ 1.4%/ 0.1%] |1004.1   [ 1.6%/ 0.2%] 
> >> 0.4%|   997.6 [ 1.2%/ 0.1%]   -0.3% |
> >> |  4/4   1912.1 [ 1.0%/ 0.1%] |1904.2   [ 1.1%/ 0.1%]
> >> -0.4%|  1914.9 [ 1.3%/ 0.1%]0.1% |
> >> |  8/8   3753.5 [ 0.3%/ 0.0%] |3748.2   [ 0.3%/ 0.0%]
> >> -0.1%|  3751.3 [ 0.4%/ 0.0%]   -0.1% |
> >> | 16/16  7139.3 [ 2.4%/ 0.2%] |7137.9   [ 1.8%/ 0.2%]
> >> -0.0%|  7049.2 [ 2.4%/ 0.2%]   -1.3% |
> >> | 32/32 10899.0 [ 4.2%/ 0.4%] |   10780.3   [ 4.4%/ 0.4%]
> >> -1.1%| 10339.2 [ 9.6%/ 0.9%]   -5.1% |
> >> | 64/64 15086.1 [11.5%/ 1.2%] |   14262.0   [ 8.2%/ 0.8%]
> >> -5.5%| 11168.7 [22.2%/ 1.7%]  -26.0% |
> >> |128/12815371.9 [22.0%/ 2.2%] |   14675.8   [14.4%/ 1.4%]
> >> -4.5%| 10963.9 [18.5%/ 1.4%]  -28.7% |
> >> |256/25615990.8 [22.0%/ 2.2%] |   12227.9   [10.3%/ 1.0%]   
> >> -23.5%| 10469.9 [19.6%/ 1.7%]  -34.5% |
> >> '-'
> > 
> > Perfectly presented, thank you very much!
> 
> My pleasure! ;-)
> 
> > 
> > My final questin would be about the environment:
> > 
> >> Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > 
> > Is the typical nr_running value the sum of 'NA+AVX', i.e. is it ~256 
> > threads for the 128/128 row for example - or is it 128 parallel tasks?
> 
> That means 128 sysbench threads and 128 gemmbench tasks, so 256 threads in 
> sum.
> > 
> > I.e. showing the approximate CPU thread-load figure column would be very 
> > useful too, where '50%' shows half-loaded, '100%' fully-loaded, '200%' 
> > over-saturated, etc. - for each row?
> 
> See below, hope this helps.
> .--.
> |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / sem%] 
> +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> |--|
> |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 0.1%] 
>-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 0.2%] 
> 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 0.1%] 
>-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 0.0%] 
>-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 0.2%] 
>-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 0.4%] 
>-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 0.8%] 
>-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 1.4%] 
>-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 1.0%] 
>   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> '--'
> 

That's really nice and clear.

We start to see the penalty for the coresched at 32/32, leaving some cpus more 
idle than otherwise.  
But it's pretty good overall, for this benchmark at least.

Is this with stock v2 or with any of the fixes posted after? I wonder how much 
the fixes for
the race that violates the rule effects this, for example. 



Cheers,
Phil


> Thanks,
> -Aubrey

-- 


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Li, Aubrey
On 2019/4/29 14:14, Ingo Molnar wrote:
> 
> * Li, Aubrey  wrote:
> 
>>> I suspect it's pretty low, below 1% for all rows?
>>
>> Hope my this mail box works for this...
>>
>> .-.
>> |NA/AVX vanilla-SMT [std% / sem%] | coresched-SMT   [std% / sem%] 
>> +/- |  no-SMT [std% / sem%]+/-  |
>> |-|
>> |  1/1508.5 [ 0.2%/ 0.0%] | 504.7   [ 1.1%/ 0.1%]
>> -0.8%|   509.0 [ 0.2%/ 0.0%]0.1% |
>> |  2/2   1000.2 [ 1.4%/ 0.1%] |1004.1   [ 1.6%/ 0.2%] 
>> 0.4%|   997.6 [ 1.2%/ 0.1%]   -0.3% |
>> |  4/4   1912.1 [ 1.0%/ 0.1%] |1904.2   [ 1.1%/ 0.1%]
>> -0.4%|  1914.9 [ 1.3%/ 0.1%]0.1% |
>> |  8/8   3753.5 [ 0.3%/ 0.0%] |3748.2   [ 0.3%/ 0.0%]
>> -0.1%|  3751.3 [ 0.4%/ 0.0%]   -0.1% |
>> | 16/16  7139.3 [ 2.4%/ 0.2%] |7137.9   [ 1.8%/ 0.2%]
>> -0.0%|  7049.2 [ 2.4%/ 0.2%]   -1.3% |
>> | 32/32 10899.0 [ 4.2%/ 0.4%] |   10780.3   [ 4.4%/ 0.4%]
>> -1.1%| 10339.2 [ 9.6%/ 0.9%]   -5.1% |
>> | 64/64 15086.1 [11.5%/ 1.2%] |   14262.0   [ 8.2%/ 0.8%]
>> -5.5%| 11168.7 [22.2%/ 1.7%]  -26.0% |
>> |128/12815371.9 [22.0%/ 2.2%] |   14675.8   [14.4%/ 1.4%]
>> -4.5%| 10963.9 [18.5%/ 1.4%]  -28.7% |
>> |256/25615990.8 [22.0%/ 2.2%] |   12227.9   [10.3%/ 1.0%]   
>> -23.5%| 10469.9 [19.6%/ 1.7%]  -34.5% |
>> '-'
> 
> Perfectly presented, thank you very much!

My pleasure! ;-)

> 
> My final questin would be about the environment:
> 
>> Skylake server, 2 numa nodes, 104 CPUs (HT on)
> 
> Is the typical nr_running value the sum of 'NA+AVX', i.e. is it ~256 
> threads for the 128/128 row for example - or is it 128 parallel tasks?

That means 128 sysbench threads and 128 gemmbench tasks, so 256 threads in sum.
> 
> I.e. showing the approximate CPU thread-load figure column would be very 
> useful too, where '50%' shows half-loaded, '100%' fully-loaded, '200%' 
> over-saturated, etc. - for each row?

See below, hope this helps.
.--.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / sem%]   
  +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
|--|
|  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 0.1%]   
 -0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
|  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 0.2%]   
  0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
|  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 0.1%]   
 -0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
|  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 0.0%]   
 -0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
| 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 0.2%]   
 -0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
| 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 0.4%]   
 -1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
| 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 0.8%]   
 -5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
|128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 1.4%]   
 -4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
|256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 1.0%]   
-23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
'--'

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-28 Thread Ingo Molnar


* Li, Aubrey  wrote:

> > I suspect it's pretty low, below 1% for all rows?
> 
> Hope my this mail box works for this...
> 
> .-.
> |NA/AVX vanilla-SMT [std% / sem%] | coresched-SMT   [std% / sem%] +/- 
> |  no-SMT [std% / sem%]+/-  |
> |-|
> |  1/1508.5 [ 0.2%/ 0.0%] | 504.7   [ 1.1%/ 0.1%]
> -0.8%|   509.0 [ 0.2%/ 0.0%]0.1% |
> |  2/2   1000.2 [ 1.4%/ 0.1%] |1004.1   [ 1.6%/ 0.2%] 
> 0.4%|   997.6 [ 1.2%/ 0.1%]   -0.3% |
> |  4/4   1912.1 [ 1.0%/ 0.1%] |1904.2   [ 1.1%/ 0.1%]
> -0.4%|  1914.9 [ 1.3%/ 0.1%]0.1% |
> |  8/8   3753.5 [ 0.3%/ 0.0%] |3748.2   [ 0.3%/ 0.0%]
> -0.1%|  3751.3 [ 0.4%/ 0.0%]   -0.1% |
> | 16/16  7139.3 [ 2.4%/ 0.2%] |7137.9   [ 1.8%/ 0.2%]
> -0.0%|  7049.2 [ 2.4%/ 0.2%]   -1.3% |
> | 32/32 10899.0 [ 4.2%/ 0.4%] |   10780.3   [ 4.4%/ 0.4%]
> -1.1%| 10339.2 [ 9.6%/ 0.9%]   -5.1% |
> | 64/64 15086.1 [11.5%/ 1.2%] |   14262.0   [ 8.2%/ 0.8%]
> -5.5%| 11168.7 [22.2%/ 1.7%]  -26.0% |
> |128/12815371.9 [22.0%/ 2.2%] |   14675.8   [14.4%/ 1.4%]
> -4.5%| 10963.9 [18.5%/ 1.4%]  -28.7% |
> |256/25615990.8 [22.0%/ 2.2%] |   12227.9   [10.3%/ 1.0%]   
> -23.5%| 10469.9 [19.6%/ 1.7%]  -34.5% |
> '-'

Perfectly presented, thank you very much!

My final questin would be about the environment:

> Skylake server, 2 numa nodes, 104 CPUs (HT on)

Is the typical nr_running value the sum of 'NA+AVX', i.e. is it ~256 
threads for the 128/128 row for example - or is it 128 parallel tasks?

I.e. showing the approximate CPU thread-load figure column would be very 
useful too, where '50%' shows half-loaded, '100%' fully-loaded, '200%' 
over-saturated, etc. - for each row?

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-28 Thread Aaron Lu
On Tue, Apr 23, 2019 at 06:45:27PM +, Vineeth Remanan Pillai wrote:
> >> - Processes with different tags can still share the core
> 
> > I may have missed something... Could you explain this statement?
> 
> > This, to me, is the whole point of the patch series. If it's not
> > doing this then ... what?
> 
> What I meant was, the patch needs some more work to be accurate.
> There are some race conditions where the core violation can still
> happen. In our testing, we saw around 1 to 5% of the time being
> shared with incompatible processes. One example of this happening
> is as follows(let cpu 0 and 1 be siblings):
> - cpu 0 selects a process with a cookie
> - cpu 1 selects a higher priority process without cookie
> - Selection process restarts for cpu 0 and it might select a
>   process with cookie but with lesser priority.
> - Since it is lesser priority, the logic in pick_next_task
>   doesn't compare again for the cookie(trusts pick_task) and
>   proceeds.
> 
> This is one of the scenarios that we saw from traces, but there
> might be other race conditions as well. Fix seems a little
> involved and We are working on that.

This is what I have used to make sure no two unmatched tasks being
scheduled on the same core: (on top of v1, I thinks it's easier to just
show the diff instead of commenting on various places of the patches :-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb24a0141e57..0cdb1c6a00a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,6 +186,10 @@ struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
 */
match = idle_sched_class.pick_task(rq);
 
+   /* TODO: untagged tasks are not in the core tree */
+   if (!cookie)
+   goto out;
+
while (node) {
node_task = container_of(node, struct task_struct, core_node);
 
@@ -199,6 +203,7 @@ struct task_struct *sched_core_find(struct rq *rq, unsigned 
long cookie)
}
}
 
+out:
return match;
 }
 
@@ -3634,6 +3639,8 @@ static inline bool cookie_match(struct task_struct *a, 
struct task_struct *b)
 }
 
 // XXX fairness/fwd progress conditions
+// when max is unset, return class_pick;
+// when max is set, return cookie_pick unless class_pick has higher priority.
 static struct task_struct *
 pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
 {
@@ -3652,7 +3659,19 @@ pick_task(struct rq *rq, const struct sched_class 
*class, struct task_struct *ma
}
 
class_pick = class->pick_task(rq);
-   if (!cookie)
+   /*
+* we can only return class_pick here when max is not set.
+*
+* when max is set and cookie is 0, we still have to check if
+* class_pick's cookie matches with max, or we can end up picking
+* an unmacthed task. e.g. max is untagged and class_pick here
+* is tagged.
+*/
+   if (!cookie && !max)
+   return class_pick;
+
+   /* in case class_pick matches with max, no need to check priority */
+   if (class_pick && cookie_match(class_pick, max))
return class_pick;
 
cookie_pick = sched_core_find(rq, cookie);
@@ -3663,8 +3682,11 @@ pick_task(struct rq *rq, const struct sched_class 
*class, struct task_struct *ma
 * If class > max && class > cookie, it is the highest priority task on
 * the core (so far) and it must be selected, otherwise we must go with
 * the cookie pick in order to satisfy the constraint.
+*
+* class_pick and cookie_pick are on the same cpu so use cpu_prio_less()
+* max and class_pick are on different cpus so use core_prio_less()
 */
-   if (cpu_prio_less(cookie_pick, class_pick) && cpu_prio_less(max, 
class_pick))
+   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
class_pick))
return class_pick;
 
return cookie_pick;
@@ -3731,8 +3753,17 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = NULL;
 
-   if (i != cpu)
+   if (i != cpu) {
update_rq_clock(rq_i);
+   /*
+* we are going to pick tasks for both cpus, if our
+* sibling is idle and we have core_cookie set, now
+* is the time to clear/reset it so that we can do
+* an unconstained pick.
+*/
+   if (is_idle_task(rq_i->curr) && rq_i->core->core_cookie)
+   rq_i->core->core_cookie = 0;
+   }
}
 
/*
@@ -3794,20 +3825,42 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 *
 * NOTE: this is a linear max-filter and is thus bounded
 * in execution time.

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-28 Thread Li, Aubrey
On 2019/4/28 20:17, Ingo Molnar wrote:
> 
> * Aubrey Li  wrote:
> 
>> On Sun, Apr 28, 2019 at 5:33 PM Ingo Molnar  wrote:
>>> So because I'm a big fan of presenting data in a readable fashion, here
>>> are your results, tabulated:
>>
>> I thought I tried my best to make it readable, but this one looks much 
>> better,
>> thanks, ;-)
>>>
>>>  #
>>>  # Sysbench throughput comparison of 3 different kernels at different
>>>  # load levels, higher numbers are better:
>>>  #
>>>
>>>  
>>> .--|.
>>>  |  NA/AVX vanilla-SMT[stddev%] |coresched-SMT   [stddev%]   +/-  | 
>>>   no-SMT[stddev%]   +/-  |
>>>  
>>> |--||
>>>  |   1/1 508.5[  0.2% ] |504.7   [  1.1% ]   0.8% | 
>>>509.0[  0.2% ]   0.1% |
>>>  |   2/21000.2[  1.4% ] |   1004.1   [  1.6% ]   0.4% | 
>>>997.6[  1.2% ]   0.3% |
>>>  |   4/41912.1[  1.0% ] |   1904.2   [  1.1% ]   0.4% | 
>>>   1914.9[  1.3% ]   0.1% |
>>>  |   8/83753.5[  0.3% ] |   3748.2   [  0.3% ]   0.1% | 
>>>   3751.3[  0.4% ]   0.1% |
>>>  |  16/16   7139.3[  2.4% ] |   7137.9   [  1.8% ]   0.0% | 
>>>   7049.2[  2.4% ]   1.3% |
>>>  |  32/32  10899.0[  4.2% ] |  10780.3   [  4.4% ]  -1.1% | 
>>>  10339.2[  9.6% ]  -5.1% |
>>>  |  64/64  15086.1[ 11.5% ] |  14262.0   [  8.2% ]  -5.5% | 
>>>  11168.7[ 22.2% ] -26.0% |
>>>  | 128/128 15371.9[ 22.0% ] |  14675.8   [ 14.4% ]  -4.5% | 
>>>  10963.9[ 18.5% ] -28.7% |
>>>  | 256/256 15990.8[ 22.0% ] |  12227.9   [ 10.3% ] -23.5% | 
>>>  10469.9[ 19.6% ] -34.5% |
>>>  
>>> '--|'
>>>
>>> One major thing that sticks out is that if we compare the stddev numbers
>>> to the +/- comparisons then it's pretty clear that the benchmarks are
>>> very noisy: in all but the last row stddev is actually higher than the
>>> measured effect.
>>>
>>> So what does 'stddev' mean here, exactly? The stddev of multipe runs,
>>> i.e. measured run-to-run variance? Or is it some internal metric of the
>>> benchmark?
>>>
>>
>> The benchmark periodically reports intermediate statistics in one second,
>> the raw log looks like below:
>> [ 11s ] thds: 256 eps: 14346.72 lat (ms,95%): 44.17
>> [ 12s ] thds: 256 eps: 14328.45 lat (ms,95%): 44.17
>> [ 13s ] thds: 256 eps: 13773.06 lat (ms,95%): 43.39
>> [ 14s ] thds: 256 eps: 13752.31 lat (ms,95%): 43.39
>> [ 15s ] thds: 256 eps: 15362.79 lat (ms,95%): 43.39
>> [ 16s ] thds: 256 eps: 26580.65 lat (ms,95%): 35.59
>> [ 17s ] thds: 256 eps: 15011.78 lat (ms,95%): 36.89
>> [ 18s ] thds: 256 eps: 15025.78 lat (ms,95%): 39.65
>> [ 19s ] thds: 256 eps: 15350.87 lat (ms,95%): 39.65
>> [ 20s ] thds: 256 eps: 15491.70 lat (ms,95%): 36.89
>>
>> I have a python script to parse eps(events per second) and lat(latency)
>> out, and compute the average and stddev. (And I can draw a curve locally).
>>
>> It's noisy indeed when tasks number is greater than the CPU number.
>> It's probably caused by high frequent load balance and context switch.
> 
> Ok, so it's basically an internal workload noise metric, it doesn't 
> represent the run-to-run noise.
> 
> So it's the real stddev of the workload - but we don't know whether the 
> measured performance figure is exactly in the middle of the runtime 
> probability distribution.
> 
>> Do you have any suggestions? Or any other information I can provide?
> 
> Yeah, so we don't just want to know the "standard deviation" of the 
> measured throughput values, but also the "standard error of the mean".
> 
> I suspect it's pretty low, below 1% for all rows?

Hope my this mail box works for this...

.-.
|NA/AVX vanilla-SMT [std% / sem%] | coresched-SMT   [std% / sem%] +/- | 
 no-SMT [std% / sem%]+/-  |
|-|
|  1/1508.5 [ 0.2%/ 0.0%] | 504.7   [ 1.1%/ 0.1%]-0.8%| 
  509.0 [ 0.2%/ 0.0%]0.1% |
|  2/2   1000.2 [ 1.4%/ 0.1%] |1004.1   [ 1.6%/ 0.2%] 0.4%| 
  997.6 [ 1.2%/ 0.1%]   -0.3% |
|  4/4   1912.1 [ 1.0%/ 0.1%] |1904.2   [ 1.1%/ 0.1%]-0.4%| 
 1914.9 [ 1.3%/ 0.1%]0.1% |
|  8/8   3753.5 [ 0.3%/ 0.0%] |3748.2   [ 0.3%/ 0.0%]-0.1%| 
 3751.3 [ 0.4%/ 0.0%]   -0.1% |
| 16/16  7139.3 [ 2.4%/ 0.2%] |7137.9   [ 1.8%/ 0.2%]-0.0%| 
 7049.2 [ 2.4%/ 0.2%]   -1.3% |
| 32/32 10899.0 [ 4.2%/ 0.4%] |   10780.3   [ 4.4%/ 0.4%]-1.1%| 
10339.2 [ 9.6%/ 0.9%]   -5.1% |
| 6

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-28 Thread Ingo Molnar


* Aubrey Li  wrote:

> On Sun, Apr 28, 2019 at 5:33 PM Ingo Molnar  wrote:
> > So because I'm a big fan of presenting data in a readable fashion, here
> > are your results, tabulated:
> 
> I thought I tried my best to make it readable, but this one looks much better,
> thanks, ;-)
> >
> >  #
> >  # Sysbench throughput comparison of 3 different kernels at different
> >  # load levels, higher numbers are better:
> >  #
> >
> >  
> > .--|.
> >  |  NA/AVX vanilla-SMT[stddev%] |coresched-SMT   [stddev%]   +/-  | 
> >   no-SMT[stddev%]   +/-  |
> >  
> > |--||
> >  |   1/1 508.5[  0.2% ] |504.7   [  1.1% ]   0.8% | 
> >509.0[  0.2% ]   0.1% |
> >  |   2/21000.2[  1.4% ] |   1004.1   [  1.6% ]   0.4% | 
> >997.6[  1.2% ]   0.3% |
> >  |   4/41912.1[  1.0% ] |   1904.2   [  1.1% ]   0.4% | 
> >   1914.9[  1.3% ]   0.1% |
> >  |   8/83753.5[  0.3% ] |   3748.2   [  0.3% ]   0.1% | 
> >   3751.3[  0.4% ]   0.1% |
> >  |  16/16   7139.3[  2.4% ] |   7137.9   [  1.8% ]   0.0% | 
> >   7049.2[  2.4% ]   1.3% |
> >  |  32/32  10899.0[  4.2% ] |  10780.3   [  4.4% ]  -1.1% | 
> >  10339.2[  9.6% ]  -5.1% |
> >  |  64/64  15086.1[ 11.5% ] |  14262.0   [  8.2% ]  -5.5% | 
> >  11168.7[ 22.2% ] -26.0% |
> >  | 128/128 15371.9[ 22.0% ] |  14675.8   [ 14.4% ]  -4.5% | 
> >  10963.9[ 18.5% ] -28.7% |
> >  | 256/256 15990.8[ 22.0% ] |  12227.9   [ 10.3% ] -23.5% | 
> >  10469.9[ 19.6% ] -34.5% |
> >  
> > '--|'
> >
> > One major thing that sticks out is that if we compare the stddev numbers
> > to the +/- comparisons then it's pretty clear that the benchmarks are
> > very noisy: in all but the last row stddev is actually higher than the
> > measured effect.
> >
> > So what does 'stddev' mean here, exactly? The stddev of multipe runs,
> > i.e. measured run-to-run variance? Or is it some internal metric of the
> > benchmark?
> >
> 
> The benchmark periodically reports intermediate statistics in one second,
> the raw log looks like below:
> [ 11s ] thds: 256 eps: 14346.72 lat (ms,95%): 44.17
> [ 12s ] thds: 256 eps: 14328.45 lat (ms,95%): 44.17
> [ 13s ] thds: 256 eps: 13773.06 lat (ms,95%): 43.39
> [ 14s ] thds: 256 eps: 13752.31 lat (ms,95%): 43.39
> [ 15s ] thds: 256 eps: 15362.79 lat (ms,95%): 43.39
> [ 16s ] thds: 256 eps: 26580.65 lat (ms,95%): 35.59
> [ 17s ] thds: 256 eps: 15011.78 lat (ms,95%): 36.89
> [ 18s ] thds: 256 eps: 15025.78 lat (ms,95%): 39.65
> [ 19s ] thds: 256 eps: 15350.87 lat (ms,95%): 39.65
> [ 20s ] thds: 256 eps: 15491.70 lat (ms,95%): 36.89
> 
> I have a python script to parse eps(events per second) and lat(latency)
> out, and compute the average and stddev. (And I can draw a curve locally).
> 
> It's noisy indeed when tasks number is greater than the CPU number.
> It's probably caused by high frequent load balance and context switch.

Ok, so it's basically an internal workload noise metric, it doesn't 
represent the run-to-run noise.

So it's the real stddev of the workload - but we don't know whether the 
measured performance figure is exactly in the middle of the runtime 
probability distribution.

> Do you have any suggestions? Or any other information I can provide?

Yeah, so we don't just want to know the "standard deviation" of the 
measured throughput values, but also the "standard error of the mean".

I suspect it's pretty low, below 1% for all rows?

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-28 Thread Aubrey Li
On Sun, Apr 28, 2019 at 5:33 PM Ingo Molnar  wrote:
> So because I'm a big fan of presenting data in a readable fashion, here
> are your results, tabulated:

I thought I tried my best to make it readable, but this one looks much better,
thanks, ;-)
>
>  #
>  # Sysbench throughput comparison of 3 different kernels at different
>  # load levels, higher numbers are better:
>  #
>
>  
> .--|.
>  |  NA/AVX vanilla-SMT[stddev%] |coresched-SMT   [stddev%]   +/-  |   
> no-SMT[stddev%]   +/-  |
>  
> |--||
>  |   1/1 508.5[  0.2% ] |504.7   [  1.1% ]   0.8% |   
>  509.0[  0.2% ]   0.1% |
>  |   2/21000.2[  1.4% ] |   1004.1   [  1.6% ]   0.4% |   
>  997.6[  1.2% ]   0.3% |
>  |   4/41912.1[  1.0% ] |   1904.2   [  1.1% ]   0.4% |   
> 1914.9[  1.3% ]   0.1% |
>  |   8/83753.5[  0.3% ] |   3748.2   [  0.3% ]   0.1% |   
> 3751.3[  0.4% ]   0.1% |
>  |  16/16   7139.3[  2.4% ] |   7137.9   [  1.8% ]   0.0% |   
> 7049.2[  2.4% ]   1.3% |
>  |  32/32  10899.0[  4.2% ] |  10780.3   [  4.4% ]  -1.1% |  
> 10339.2[  9.6% ]  -5.1% |
>  |  64/64  15086.1[ 11.5% ] |  14262.0   [  8.2% ]  -5.5% |  
> 11168.7[ 22.2% ] -26.0% |
>  | 128/128 15371.9[ 22.0% ] |  14675.8   [ 14.4% ]  -4.5% |  
> 10963.9[ 18.5% ] -28.7% |
>  | 256/256 15990.8[ 22.0% ] |  12227.9   [ 10.3% ] -23.5% |  
> 10469.9[ 19.6% ] -34.5% |
>  
> '--|'
>
> One major thing that sticks out is that if we compare the stddev numbers
> to the +/- comparisons then it's pretty clear that the benchmarks are
> very noisy: in all but the last row stddev is actually higher than the
> measured effect.
>
> So what does 'stddev' mean here, exactly? The stddev of multipe runs,
> i.e. measured run-to-run variance? Or is it some internal metric of the
> benchmark?
>

The benchmark periodically reports intermediate statistics in one second,
the raw log looks like below:
[ 11s ] thds: 256 eps: 14346.72 lat (ms,95%): 44.17
[ 12s ] thds: 256 eps: 14328.45 lat (ms,95%): 44.17
[ 13s ] thds: 256 eps: 13773.06 lat (ms,95%): 43.39
[ 14s ] thds: 256 eps: 13752.31 lat (ms,95%): 43.39
[ 15s ] thds: 256 eps: 15362.79 lat (ms,95%): 43.39
[ 16s ] thds: 256 eps: 26580.65 lat (ms,95%): 35.59
[ 17s ] thds: 256 eps: 15011.78 lat (ms,95%): 36.89
[ 18s ] thds: 256 eps: 15025.78 lat (ms,95%): 39.65
[ 19s ] thds: 256 eps: 15350.87 lat (ms,95%): 39.65
[ 20s ] thds: 256 eps: 15491.70 lat (ms,95%): 36.89

I have a python script to parse eps(events per second) and lat(latency)
out, and compute the average and stddev. (And I can draw a curve locally).

It's noisy indeed when tasks number is greater than the CPU number.
It's probably caused by high frequent load balance and context switch.
Do you have any suggestions? Or any other information I can provide?

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-28 Thread Ingo Molnar


* Aubrey Li  wrote:

> > But what we are really interested in are throughput numbers under 
> > these three kernel variants, right?
> 
> These are sysbench events per second number, higher is better.
> 
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 1/1   508.5( 0.2%)504.7( 1.1%) -0.8% 509.0( 0.2%)  0.1%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 2/2  1000.2( 1.4%)   1004.1( 1.6%)  0.4% 997.6( 1.2%) -0.3%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 4/4  1912.1( 1.0%)   1904.2( 1.1%) -0.4%1914.9( 1.3%)  0.1%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 8/8  3753.5( 0.3%)   3748.2( 0.3%) -0.1%3751.3( 0.4%) -0.1%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 16/167139.3( 2.4%)   7137.9( 1.8%) -0.0%7049.2( 2.4%) -1.3%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 32/32   10899.0( 4.2%)  10780.3( 4.4%) -1.1%10339.2( 9.6%) -5.1%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 64/64   15086.1(11.5%)  14262.0( 8.2%) -5.5%11168.7(22.2%) -26.0%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 128/128 15371.9(22.0%)  14675.8(14.4%) -4.5%10963.9(18.5%) -28.7%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 256/256 15990.8(22.0%)  12227.9(10.3%) -23.5%   10469.9(19.6%) -34.5%

So because I'm a big fan of presenting data in a readable fashion, here 
are your results, tabulated:

 #
 # Sysbench throughput comparison of 3 different kernels at different 
 # load levels, higher numbers are better:
 #

 
.--|.
 |  NA/AVX vanilla-SMT[stddev%] |coresched-SMT   [stddev%]   +/-  |   
no-SMT[stddev%]   +/-  |
 
|--||
 |   1/1 508.5[  0.2% ] |504.7   [  1.1% ]   0.8% |
509.0[  0.2% ]   0.1% |
 |   2/21000.2[  1.4% ] |   1004.1   [  1.6% ]   0.4% |
997.6[  1.2% ]   0.3% |
 |   4/41912.1[  1.0% ] |   1904.2   [  1.1% ]   0.4% |   
1914.9[  1.3% ]   0.1% |
 |   8/83753.5[  0.3% ] |   3748.2   [  0.3% ]   0.1% |   
3751.3[  0.4% ]   0.1% |
 |  16/16   7139.3[  2.4% ] |   7137.9   [  1.8% ]   0.0% |   
7049.2[  2.4% ]   1.3% |
 |  32/32  10899.0[  4.2% ] |  10780.3   [  4.4% ]  -1.1% |  
10339.2[  9.6% ]  -5.1% |
 |  64/64  15086.1[ 11.5% ] |  14262.0   [  8.2% ]  -5.5% |  
11168.7[ 22.2% ] -26.0% |
 | 128/128 15371.9[ 22.0% ] |  14675.8   [ 14.4% ]  -4.5% |  
10963.9[ 18.5% ] -28.7% |
 | 256/256 15990.8[ 22.0% ] |  12227.9   [ 10.3% ] -23.5% |  
10469.9[ 19.6% ] -34.5% |
 
'--|'

One major thing that sticks out is that if we compare the stddev numbers 
to the +/- comparisons then it's pretty clear that the benchmarks are 
very noisy: in all but the last row stddev is actually higher than the 
measured effect.

So what does 'stddev' mean here, exactly? The stddev of multipe runs, 
i.e. measured run-to-run variance? Or is it some internal metric of the 
benchmark?

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-27 Thread Aubrey Li
On Sat, Apr 27, 2019 at 10:21 PM Ingo Molnar  wrote:
>
> * Aubrey Li  wrote:
>
> > On Sat, Apr 27, 2019 at 5:17 PM Ingo Molnar  wrote:
> > >
> > >
> > > * Aubrey Li  wrote:
> > >
> > > > I have the same environment setup above, for nosmt cases, I used
> > > > /sys interface Thomas mentioned, below is the result:
> > > >
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%
> > >
> > > What do these numbers mean? Are these latencies, i.e. lower is better?
> >
> > Yeah, like above setup, I run sysbench(Non-AVX task, NA) and gemmbench
> > (AVX512 task, AVX) in different level utilizatoin. The machine has 104 
> > CPUs, so
> > nosmt has 52 CPUs.  These numbers are 95th percentile latency of sysbench,
> > lower is better.
>
> But what we are really interested in are throughput numbers under these
> three kernel variants, right?
>

These are sysbench events per second number, higher is better.

NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
1/1   508.5( 0.2%)504.7( 1.1%) -0.8% 509.0( 0.2%)  0.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
2/2  1000.2( 1.4%)   1004.1( 1.6%)  0.4% 997.6( 1.2%) -0.3%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
4/4  1912.1( 1.0%)   1904.2( 1.1%) -0.4%1914.9( 1.3%)  0.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
8/8  3753.5( 0.3%)   3748.2( 0.3%) -0.1%3751.3( 0.4%) -0.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
16/167139.3( 2.4%)   7137.9( 1.8%) -0.0%7049.2( 2.4%) -1.3%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
32/32   10899.0( 4.2%)  10780.3( 4.4%) -1.1%10339.2( 9.6%) -5.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
64/64   15086.1(11.5%)  14262.0( 8.2%) -5.5%11168.7(22.2%) -26.0%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
128/128 15371.9(22.0%)  14675.8(14.4%) -4.5%10963.9(18.5%) -28.7%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
256/256 15990.8(22.0%)  12227.9(10.3%) -23.5%   10469.9(19.6%) -34.5%


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-27 Thread Ingo Molnar


* Aubrey Li  wrote:

> On Sat, Apr 27, 2019 at 5:17 PM Ingo Molnar  wrote:
> >
> >
> > * Aubrey Li  wrote:
> >
> > > I have the same environment setup above, for nosmt cases, I used
> > > /sys interface Thomas mentioned, below is the result:
> > >
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
> > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > 256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%
> >
> > What do these numbers mean? Are these latencies, i.e. lower is better?
> 
> Yeah, like above setup, I run sysbench(Non-AVX task, NA) and gemmbench
> (AVX512 task, AVX) in different level utilizatoin. The machine has 104 CPUs, 
> so
> nosmt has 52 CPUs.  These numbers are 95th percentile latency of sysbench,
> lower is better.

But what we are really interested in are throughput numbers under these 
three kernel variants, right?

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-27 Thread Aubrey Li
On Sat, Apr 27, 2019 at 5:17 PM Ingo Molnar  wrote:
>
>
> * Aubrey Li  wrote:
>
> > I have the same environment setup above, for nosmt cases, I used
> > /sys interface Thomas mentioned, below is the result:
> >
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%
>
> What do these numbers mean? Are these latencies, i.e. lower is better?

Yeah, like above setup, I run sysbench(Non-AVX task, NA) and gemmbench
(AVX512 task, AVX) in different level utilizatoin. The machine has 104 CPUs, so
nosmt has 52 CPUs.  These numbers are 95th percentile latency of sysbench,
lower is better.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-27 Thread Ingo Molnar


* Aubrey Li  wrote:

> I have the same environment setup above, for nosmt cases, I used
> /sys interface Thomas mentioned, below is the result:
> 
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
> NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> 256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%

What do these numbers mean? Are these latencies, i.e. lower is better?

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-27 Thread Ingo Molnar


* Mel Gorman  wrote:

> On Fri, Apr 26, 2019 at 11:45:45AM +0200, Ingo Molnar wrote:
> > 
> > * Mel Gorman  wrote:
> > 
> > > > > I can show a comparison with equal levels of parallelisation but with 
> > > > > HT off, it is a completely broken configuration and I do not think a 
> > > > > comparison like that makes any sense.
> > > > 
> > > > I would still be interested in that comparison, because I'd like
> > > > to learn whether there's any true *inherent* performance advantage to 
> > > > HyperThreading for that particular workload, for exactly tuned 
> > > > parallelism.
> > > > 
> > > 
> > > It really isn't a fair comparison. MPI seems to behave very differently
> > > when a machine is saturated. It's documented as changing its behaviour
> > > as it tries to avoid the worst consequences of saturation.
> > > 
> > > Curiously, the results on the 2-socket machine were not as bad as I
> > > feared when the HT configuration is running with twice the number of
> > > threads as there are CPUs
> > > 
> > > Amean bt  771.15 (   0.00%) 1086.74 * -40.93%*
> > > Amean cg  445.92 (   0.00%)  543.41 * -21.86%*
> > > Amean ep   70.01 (   0.00%)   96.29 * -37.53%*
> > > Amean is   16.75 (   0.00%)   21.19 * -26.51%*
> > > Amean lu  882.84 (   0.00%)  595.14 *  32.59%*
> > > Amean mg   84.10 (   0.00%)   80.02 *   4.84%*
> > > Amean sp 1353.88 (   0.00%) 1384.10 *  -2.23%*
> > 
> > Yeah, so what I wanted to suggest is a parallel numeric throughput test 
> > with few inter-process data dependencies, and see whether HT actually 
> > improves total throughput versus the no-HT case.
> > 
> > No over-saturation - but exactly as many threads as logical CPUs.
> > 
> > I.e. with 20 physical cores and 40 logical CPUs the numbers to compare 
> > would be a 'nosmt' benchmark running 20 threads, versus a SMT test 
> > running 40 threads.
> > 
> > I.e. how much does SMT improve total throughput when the workload's 
> > parallelism is tuned to utilize 100% of the available CPUs?
> > 
> > Does this make sense?
> > 
> 
> Yes. Here is the comparison.
> 
> Amean bt  678.75 (   0.00%)  789.13 * -16.26%*
> Amean cg  261.22 (   0.00%)  428.82 * -64.16%*
> Amean ep   55.36 (   0.00%)   84.41 * -52.48%*
> Amean is   13.25 (   0.00%)   17.82 * -34.47%*
> Amean lu 1065.08 (   0.00%) 1090.44 (  -2.38%)
> Amean mg   89.96 (   0.00%)   84.28 *   6.31%*
> Amean sp 1579.52 (   0.00%) 1506.16 *   4.64%*
> Amean ua  611.87 (   0.00%)  663.26 *  -8.40%*
> 
> This is the socket machine and with HT On, there are 80 logical CPUs
> versus HT Off with 40 logical CPUs.

That's very interesting - so for most workloads HyperThreading is a 
massive loss, and for 'mg' and 'sp' it's a 5-6% win?

I'm wondering how much of say the 'cg' workload's -64% loss could be task 
placement inefficiency - or are these all probable effects of 80 threads 
trying to use too many cache and memory resources and thus utilizing it 
all way too inefficiently?

Are these relatively simple numeric workloads, with not much scheduling 
and good overall pinning of tasks, or is it more complex than that?

Also, the takeaway appears to be: by using HT there's a potential 
advantage of +6% on the benefit side, but a potential -50%+ performance 
hit on the risk side?

I believe these results also *strongly* support a much stricter task 
placement policy in up to 50% saturation of SMT systems - it's almost 
always going to be a win for workloads that are actually trying to fill 
in some useful role.

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Aubrey Li
On Thu, Apr 25, 2019 at 5:55 PM Ingo Molnar  wrote:
> * Aubrey Li  wrote:
> > On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
> >  wrote:
> > >
> > > On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > > > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> > > >  wrote:
> > > > >
> > > > > Second iteration of the core-scheduling feature.
> > > > >
> > > > > This version fixes apparent bugs and performance issues in v1. This
> > > > > doesn't fully address the issue of core sharing between processes
> > > > > with different tags. Core sharing still happens 1% to 5% of the time
> > > > > based on the nature of workload and timing of the runnable processes.
> > > > >
> > > > > Changes in v2
> > > > > -
> > > > > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> > > >
> > > > Thanks to post v2, based on this version, here is my benchmarks result.
> > > >
> > > > Environment setup
> > > > --
> > > > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > > > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > > > cgroup2 workload, gemmbench (AVX512 workload)
> > > >
> > > > Case 1: task number < CPU num
> > > > 
> > > > 36 sysbench threads in cgroup1
> > > > 36 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> > > >
> > > > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > > > tasks, latency has ~28% improvement!!!
> > > >
> > > > Case 2: task number > CPU number
> > > > -
> > > > 72 sysbench threads in cgroup1
> > > > 72 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> > > >
> > > > So not only power, now security and performance is a pair of 
> > > > contradictions.
> > > > Due to core cookie not matching and forced idle introduced, latency has 
> > > > ~12%
> > > > regression.
> > > >
> > > > Any comments?
> > >
> > > Would it be possible to post the results with HT off as well ?
> >
> > What's the point here to turn HT off? The latency is sensitive to the
> > relationship
> > between the task number and CPU number. Usually less CPU number, more run
> > queue wait time, and worse result.
>
> HT-off numbers are mandatory: turning HT off is by far the simplest way
> to solve the security bugs in these CPUs.
>
> Any core-scheduling solution *must* perform better than HT-off for all
> relevant workloads, otherwise what's the point?
>
I have the same environment setup above, for nosmt cases, I used
/sys interface Thomas mentioned, below is the result:

NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Mel Gorman
On Fri, Apr 26, 2019 at 11:37:11AM -0700, Subhra Mazumdar wrote:
> > > So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10%
> > > harm on the workload, and also blow up stddev by randomly co-scheduling
> > > two tasks on the same physical core? Not a good trade-off.
> > > 
> > > I really think we should implement a relatively strict physical core
> > > placement policy in the under-utilized case, and resist any attempts to
> > > weaken this for special workloads that ping-pong quickly and benefit from
> > > sharing the same physical core.
> > > 
> > It's worth a shot at least. Changes should mostly be in the wake_affine
> > path for most loads of interest.
>
> Doesn't select_idle_sibling already try to do that by calling
> select_idle_core? For our OLTP workload we infact found the cost of
> select_idle_core was actually hurting more than it helped to find a fully
> idle core, so a net negative.
> 

select_idle_sibling is not guarnateed to call select_idle_core or avoid
selecting HT sibling whose other sibling is !idle but yes, in that path,
the search cost is a general concern which is why any change there is
tricky at best.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Subhra Mazumdar



On 4/26/19 3:43 AM, Mel Gorman wrote:

On Fri, Apr 26, 2019 at 10:42:22AM +0200, Ingo Molnar wrote:

It should, but it's not perfect. For example, wake_affine_idle does not
take sibling activity into account even though select_idle_sibling *may*
take it into account. Even select_idle_sibling in its fast path may use
an SMT sibling instead of searching.

There are also potential side-effects with cpuidle. Some workloads
migration around the socket as they are communicating because of how the
search for an idle CPU works. With SMT on, there is potentially a longer
opportunity for a core to reach a deep c-state and incur a bigger wakeup
latency. This is a very weak theory but I've seen cases where latency
sensitive workloads with only two communicating tasks are affected by
CPUs reaching low c-states due to migrations.


Clearly it doesn't.


It's more that it's best effort to wakeup quickly instead of being perfect
by using an expensive search every time.

Yeah, but your numbers suggest that for *most* not heavily interacting
under-utilized CPU bound workloads we hurt in the 5-10% range compared to
no-SMT - more in some cases.


Indeed, it was higher than expected and we can't even use the excuse that
more resources are available to a single logical CPU as the scheduler is
meant to keep them apart.


So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10%
harm on the workload, and also blow up stddev by randomly co-scheduling
two tasks on the same physical core? Not a good trade-off.

I really think we should implement a relatively strict physical core
placement policy in the under-utilized case, and resist any attempts to
weaken this for special workloads that ping-pong quickly and benefit from
sharing the same physical core.


It's worth a shot at least. Changes should mostly be in the wake_affine
path for most loads of interest.

Doesn't select_idle_sibling already try to do that by calling
select_idle_core? For our OLTP workload we infact found the cost of
select_idle_core was actually hurting more than it helped to find a fully
idle core, so a net negative.



Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Phil Auld
On Thu, Apr 25, 2019 at 08:53:43PM +0200 Ingo Molnar wrote:
> Interesting. This strongly suggests sub-optimal SMT-scheduling in the 
> non-saturated HT case, i.e. a scheduler balancing bug.
> 
> As long as loads are clearly below the physical cores count (which they 
> are in the early phases of your table) the scheduler should spread tasks 
> without overlapping two tasks on the same core.
> 
> Clearly it doesn't.
> 

That's especially true if there are cgroups with different numbers of
tasks in them involved. 

Here's an example showing the average number of tasks on each of the 4 numa
nodes during a test run. 20 cpus per node. There are 78 threads total, 76
for lu and 2 stress cpu hogs. So fewer than the 80 CPUs on the box. The GROUP
test has the two stresses and lu in distinct cgroups. The NORMAL test has them
all in one. This is from 5.0-rc3+, but the version doesn't matter. It's
reproducible on any kernel. SMT is on, but that also doesn't matter here.

The first two lines show where the stress jobs ran and the second show where
the 76 threads of lu ran.

GROUP_1.stress.ps.numa.hist  Average1.00   1.00
NORMAL_1.stress.ps.numa.hist Average0.00   1.10   0.90

lu.C.x_76_GROUP_1.ps.numa.hist   Average10.97  11.78  26.28  26.97
lu.C.x_76_NORMAL_1.ps.numa.hist  Average19.70  18.70  17.80  19.80

The NORMAL test is evenly balanced across the 20 cpus per numa node.  There
is between a 4x and 10x performance hit to the lu benchmark between group
and normal in any of these test runs. In this particular case it was 10x:

76_GROUPMop/s===
min q1  median  q3  max
3776.51 3776.51 3776.51 3776.51 3776.51
76_GROUPtime
min q1  median  q3  max
539.92  539.92  539.92  539.92  539.92
76_NORMALMop/s===
min q1  median  q3  max
39386   39386   39386   39386   39386
76_NORMALtime
min q1  median  q3  max
51.77   51.77   51.77   51.77   51.77


This a bit off topic, but since balancing bugs was mentioned and I've been
trying to track this down for a while (and learning the scheduler code in
the process) I figured I'd just throw it out there :)


Cheers,
Phil

-- 


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Mel Gorman
On Fri, Apr 26, 2019 at 10:42:22AM +0200, Ingo Molnar wrote:
> > It should, but it's not perfect. For example, wake_affine_idle does not
> > take sibling activity into account even though select_idle_sibling *may*
> > take it into account. Even select_idle_sibling in its fast path may use
> > an SMT sibling instead of searching.
> > 
> > There are also potential side-effects with cpuidle. Some workloads
> > migration around the socket as they are communicating because of how the
> > search for an idle CPU works. With SMT on, there is potentially a longer
> > opportunity for a core to reach a deep c-state and incur a bigger wakeup
> > latency. This is a very weak theory but I've seen cases where latency
> > sensitive workloads with only two communicating tasks are affected by
> > CPUs reaching low c-states due to migrations.
> > 
> > > Clearly it doesn't.
> > > 
> > 
> > It's more that it's best effort to wakeup quickly instead of being perfect
> > by using an expensive search every time.
> 
> Yeah, but your numbers suggest that for *most* not heavily interacting 
> under-utilized CPU bound workloads we hurt in the 5-10% range compared to 
> no-SMT - more in some cases.
> 

Indeed, it was higher than expected and we can't even use the excuse that
more resources are available to a single logical CPU as the scheduler is
meant to keep them apart.

> So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10% 
> harm on the workload, and also blow up stddev by randomly co-scheduling 
> two tasks on the same physical core? Not a good trade-off.
> 
> I really think we should implement a relatively strict physical core 
> placement policy in the under-utilized case, and resist any attempts to 
> weaken this for special workloads that ping-pong quickly and benefit from 
> sharing the same physical core.
> 

It's worth a shot at least. Changes should mostly be in the wake_affine
path for most loads of interest.

> I.e. as long as load is kept below ~50% the SMT and !SMT benchmark 
> results and stddev numbers should match up. (With a bit of a leewy if the 
> workload gets near to 50% or occasionally goes above it.)
> 
> There's absolutely no excluse for these numbers at 30-40% load levels I 
> think.
> 

Agreed. I'll put it on the todo list but there is no way I'll get to it
in the short term due to LSF/MM. Minimally I'll put some thought into
tooling to track how often siblings are used with some reporting on when a
sibling was used when there was an idle core available. That'll at least
quantify the problem and verify the hypothesis.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Mel Gorman
On Fri, Apr 26, 2019 at 11:45:45AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman  wrote:
> 
> > > > I can show a comparison with equal levels of parallelisation but with 
> > > > HT off, it is a completely broken configuration and I do not think a 
> > > > comparison like that makes any sense.
> > > 
> > > I would still be interested in that comparison, because I'd like
> > > to learn whether there's any true *inherent* performance advantage to 
> > > HyperThreading for that particular workload, for exactly tuned 
> > > parallelism.
> > > 
> > 
> > It really isn't a fair comparison. MPI seems to behave very differently
> > when a machine is saturated. It's documented as changing its behaviour
> > as it tries to avoid the worst consequences of saturation.
> > 
> > Curiously, the results on the 2-socket machine were not as bad as I
> > feared when the HT configuration is running with twice the number of
> > threads as there are CPUs
> > 
> > Amean bt  771.15 (   0.00%) 1086.74 * -40.93%*
> > Amean cg  445.92 (   0.00%)  543.41 * -21.86%*
> > Amean ep   70.01 (   0.00%)   96.29 * -37.53%*
> > Amean is   16.75 (   0.00%)   21.19 * -26.51%*
> > Amean lu  882.84 (   0.00%)  595.14 *  32.59%*
> > Amean mg   84.10 (   0.00%)   80.02 *   4.84%*
> > Amean sp 1353.88 (   0.00%) 1384.10 *  -2.23%*
> 
> Yeah, so what I wanted to suggest is a parallel numeric throughput test 
> with few inter-process data dependencies, and see whether HT actually 
> improves total throughput versus the no-HT case.
> 
> No over-saturation - but exactly as many threads as logical CPUs.
> 
> I.e. with 20 physical cores and 40 logical CPUs the numbers to compare 
> would be a 'nosmt' benchmark running 20 threads, versus a SMT test 
> running 40 threads.
> 
> I.e. how much does SMT improve total throughput when the workload's 
> parallelism is tuned to utilize 100% of the available CPUs?
> 
> Does this make sense?
> 

Yes. Here is the comparison.

Amean bt  678.75 (   0.00%)  789.13 * -16.26%*
Amean cg  261.22 (   0.00%)  428.82 * -64.16%*
Amean ep   55.36 (   0.00%)   84.41 * -52.48%*
Amean is   13.25 (   0.00%)   17.82 * -34.47%*
Amean lu 1065.08 (   0.00%) 1090.44 (  -2.38%)
Amean mg   89.96 (   0.00%)   84.28 *   6.31%*
Amean sp 1579.52 (   0.00%) 1506.16 *   4.64%*
Amean ua  611.87 (   0.00%)  663.26 *  -8.40%*

This is the socket machine and with HT On, there are 80 logical CPUs
versus HT Off with 40 logical CPUs.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Ingo Molnar


* Aubrey Li  wrote:

> On Thu, Apr 25, 2019 at 5:55 PM Ingo Molnar  wrote:
> >
> >
> > * Aubrey Li  wrote:
> >
> > > On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
> > >  wrote:
> > > >
> > > > On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > > > > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> > > > >  wrote:
> > > > > >
> > > > > > Second iteration of the core-scheduling feature.
> > > > > >
> > > > > > This version fixes apparent bugs and performance issues in v1. This
> > > > > > doesn't fully address the issue of core sharing between processes
> > > > > > with different tags. Core sharing still happens 1% to 5% of the time
> > > > > > based on the nature of workload and timing of the runnable 
> > > > > > processes.
> > > > > >
> > > > > > Changes in v2
> > > > > > -
> > > > > > - rebased on mainline commit: 
> > > > > > 6d906f99817951e2257d577656899da02bb33105
> > > > >
> > > > > Thanks to post v2, based on this version, here is my benchmarks 
> > > > > result.
> > > > >
> > > > > Environment setup
> > > > > --
> > > > > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > > > > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > > > > cgroup2 workload, gemmbench (AVX512 workload)
> > > > >
> > > > > Case 1: task number < CPU num
> > > > > 
> > > > > 36 sysbench threads in cgroup1
> > > > > 36 gemmbench threads in cgroup2
> > > > >
> > > > > core sched off:
> > > > > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > > > > core sched on:
> > > > > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> > > > >
> > > > > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > > > > tasks, latency has ~28% improvement!!!
> > > > >
> > > > > Case 2: task number > CPU number
> > > > > -
> > > > > 72 sysbench threads in cgroup1
> > > > > 72 gemmbench threads in cgroup2
> > > > >
> > > > > core sched off:
> > > > > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > > > > core sched on:
> > > > > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> > > > >
> > > > > So not only power, now security and performance is a pair of 
> > > > > contradictions.
> > > > > Due to core cookie not matching and forced idle introduced, latency 
> > > > > has ~12%
> > > > > regression.
> > > > >
> > > > > Any comments?
> > > >
> > > > Would it be possible to post the results with HT off as well ?
> > >
> > > What's the point here to turn HT off? The latency is sensitive to the
> > > relationship
> > > between the task number and CPU number. Usually less CPU number, more run
> > > queue wait time, and worse result.
> >
> > HT-off numbers are mandatory: turning HT off is by far the simplest way
> > to solve the security bugs in these CPUs.
> >
> > Any core-scheduling solution *must* perform better than HT-off for all
> > relevant workloads, otherwise what's the point?
> >
> Got it, I'll measure HT-off cases soon.

Thanks!

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Ingo Molnar


* Mel Gorman  wrote:

> > Interesting.
> > 
> > Here too I'm wondering whether the scheduler could do something to 
> > improve the saturated case: which *is* an important workload, as kernel 
> > hackers tend to over-load their systems a bit when building kernel, to 
> > make sure the system is at least 100% utilized. ;-)
> > 
> 
> Every so often I try but I never managed to settle on a heuristic that 
> helped this case without breaking others. The biggest hurdle is that 
> typically things are better if migrations are low but it's hard to do 
> that in a way that does not also stack tasks on the same CPUs 
> prematurely.

So instead of using a heuristic (which are fragile and most of them are 
also annoyingly non-deterministic and increase overall noise and make 
measurements harder) I'd suggest using SCHED_BATCH just as a hardcore 
toggle to maximize for CPU-bound throughput.

It's not used very much, but the kernel build could use it by default 
(i.e. we could use a "chrt -b" call within the main Makefile), so it 
would be the perfect guinea pig and wouldn't affect anything else.

I.e. we could use SCHED_BATCH to maximize kernel build speed, with no no 
regard to latency (within SCHED_BATCH workloads). I suspect this will 
also maximize bandwidth of a lot of other real-world, highly parallel but 
interacting processing workloads.

[ I'd even be willing to rename it to SCHED_KBUILD, internally. ;-) ]

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Ingo Molnar


* Mel Gorman  wrote:

> > > I can show a comparison with equal levels of parallelisation but with 
> > > HT off, it is a completely broken configuration and I do not think a 
> > > comparison like that makes any sense.
> > 
> > I would still be interested in that comparison, because I'd like
> > to learn whether there's any true *inherent* performance advantage to 
> > HyperThreading for that particular workload, for exactly tuned 
> > parallelism.
> > 
> 
> It really isn't a fair comparison. MPI seems to behave very differently
> when a machine is saturated. It's documented as changing its behaviour
> as it tries to avoid the worst consequences of saturation.
> 
> Curiously, the results on the 2-socket machine were not as bad as I
> feared when the HT configuration is running with twice the number of
> threads as there are CPUs
> 
> Amean bt  771.15 (   0.00%) 1086.74 * -40.93%*
> Amean cg  445.92 (   0.00%)  543.41 * -21.86%*
> Amean ep   70.01 (   0.00%)   96.29 * -37.53%*
> Amean is   16.75 (   0.00%)   21.19 * -26.51%*
> Amean lu  882.84 (   0.00%)  595.14 *  32.59%*
> Amean mg   84.10 (   0.00%)   80.02 *   4.84%*
> Amean sp 1353.88 (   0.00%) 1384.10 *  -2.23%*

Yeah, so what I wanted to suggest is a parallel numeric throughput test 
with few inter-process data dependencies, and see whether HT actually 
improves total throughput versus the no-HT case.

No over-saturation - but exactly as many threads as logical CPUs.

I.e. with 20 physical cores and 40 logical CPUs the numbers to compare 
would be a 'nosmt' benchmark running 20 threads, versus a SMT test 
running 40 threads.

I.e. how much does SMT improve total throughput when the workload's 
parallelism is tuned to utilize 100% of the available CPUs?

Does this make sense?

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Ingo Molnar


* Mel Gorman  wrote:

> > > Same -- performance is better until the machine gets saturated and
> > > disabling HT hits scaling limits earlier.
> > 
> > Interesting. This strongly suggests sub-optimal SMT-scheduling in the 
> > non-saturated HT case, i.e. a scheduler balancing bug.
> > 
> 
> Yeah, it does but mpstat didn't appear to indicate that SMT siblings are
> being used prematurely so it's a bit of a curiousity.
> 
> > As long as loads are clearly below the physical cores count (which they 
> > are in the early phases of your table) the scheduler should spread tasks 
> > without overlapping two tasks on the same core.
> > 
> 
> It should, but it's not perfect. For example, wake_affine_idle does not
> take sibling activity into account even though select_idle_sibling *may*
> take it into account. Even select_idle_sibling in its fast path may use
> an SMT sibling instead of searching.
> 
> There are also potential side-effects with cpuidle. Some workloads
> migration around the socket as they are communicating because of how the
> search for an idle CPU works. With SMT on, there is potentially a longer
> opportunity for a core to reach a deep c-state and incur a bigger wakeup
> latency. This is a very weak theory but I've seen cases where latency
> sensitive workloads with only two communicating tasks are affected by
> CPUs reaching low c-states due to migrations.
> 
> > Clearly it doesn't.
> > 
> 
> It's more that it's best effort to wakeup quickly instead of being perfect
> by using an expensive search every time.

Yeah, but your numbers suggest that for *most* not heavily interacting 
under-utilized CPU bound workloads we hurt in the 5-10% range compared to 
no-SMT - more in some cases.

So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10% 
harm on the workload, and also blow up stddev by randomly co-scheduling 
two tasks on the same physical core? Not a good trade-off.

I really think we should implement a relatively strict physical core 
placement policy in the under-utilized case, and resist any attempts to 
weaken this for special workloads that ping-pong quickly and benefit from 
sharing the same physical core.

I.e. as long as load is kept below ~50% the SMT and !SMT benchmark 
results and stddev numbers should match up. (With a bit of a leewy if the 
workload gets near to 50% or occasionally goes above it.)

There's absolutely no excluse for these numbers at 30-40% load levels I 
think.

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Aubrey Li
On Thu, Apr 25, 2019 at 5:55 PM Ingo Molnar  wrote:
>
>
> * Aubrey Li  wrote:
>
> > On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
> >  wrote:
> > >
> > > On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > > > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> > > >  wrote:
> > > > >
> > > > > Second iteration of the core-scheduling feature.
> > > > >
> > > > > This version fixes apparent bugs and performance issues in v1. This
> > > > > doesn't fully address the issue of core sharing between processes
> > > > > with different tags. Core sharing still happens 1% to 5% of the time
> > > > > based on the nature of workload and timing of the runnable processes.
> > > > >
> > > > > Changes in v2
> > > > > -
> > > > > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> > > >
> > > > Thanks to post v2, based on this version, here is my benchmarks result.
> > > >
> > > > Environment setup
> > > > --
> > > > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > > > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > > > cgroup2 workload, gemmbench (AVX512 workload)
> > > >
> > > > Case 1: task number < CPU num
> > > > 
> > > > 36 sysbench threads in cgroup1
> > > > 36 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> > > >
> > > > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > > > tasks, latency has ~28% improvement!!!
> > > >
> > > > Case 2: task number > CPU number
> > > > -
> > > > 72 sysbench threads in cgroup1
> > > > 72 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> > > >
> > > > So not only power, now security and performance is a pair of 
> > > > contradictions.
> > > > Due to core cookie not matching and forced idle introduced, latency has 
> > > > ~12%
> > > > regression.
> > > >
> > > > Any comments?
> > >
> > > Would it be possible to post the results with HT off as well ?
> >
> > What's the point here to turn HT off? The latency is sensitive to the
> > relationship
> > between the task number and CPU number. Usually less CPU number, more run
> > queue wait time, and worse result.
>
> HT-off numbers are mandatory: turning HT off is by far the simplest way
> to solve the security bugs in these CPUs.
>
> Any core-scheduling solution *must* perform better than HT-off for all
> relevant workloads, otherwise what's the point?
>
Got it, I'll measure HT-off cases soon.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Mel Gorman
On Thu, Apr 25, 2019 at 08:53:43PM +0200, Ingo Molnar wrote:
> > I don't have the data in a format that can be present everything in a clear
> > format but here is an attempt anyway. This is long but the central point
> > that when when a machine is lightly loaded, HT Off generally performs
> > better than HT On and even when heavily utilised, it's still not a
> > guaranteed loss. I only suggest reading after this if you have coffee
> > and time. Ideally all this would be updated with a comparison to core
> > scheduling but I may not get it queued on my test grid before I leave
> > for LSF/MM and besides, the authors pushing this feature should be able
> > to provide supporting data justifying the complexity of the series.
> 
> BTW., a side note: I'd suggest introducing a runtime toggle 'nosmt' 
> facility, i.e. turn a system between SMT and non-SMT execution runtime, 
> with full reversability between these states and no restrictions.
> 
> That should make both benchmarking more convenient (no kernel reboots and 
> kernel parameters to check), and it would also make it easier for system 
> administrators to experiment with how SMT and no-SMT affects their 
> typical workloads.
> 

Noted, I wasn't aware of the option Thomas laid out but even if I was, I
probably would have used the boot parameter anyway. The grid automation
reboots between tests and it knows how to add/remove kernel command
lines so it's trivial for me to setup. There is definite value for live
experimentation as long as they know to keep an eye on the CPU enumeration
when setting up cpumasks.

> > Here is a tbench comparison scaling from a low thread count to a high
> > thread count. I picked tbench because it's relatively uncomplicated and
> > tends to be reasonable at spotting scheduler regressions. The kernel
> > version is old but for the purposes of this discussion, it doesn't matter
> > 
> > 1-socket Skylake (8 logical CPUs HT On, 4 logical CPUs HT Off)
> 
> Side question: while obviously most of the core-sched interest is 
> concentrated around Intel's HyperThreading SMT, I'm wondering whether you 
> have any data regarding AMD systems - in particular Ryzen based CPUs 
> appear to have a pretty robust SMT implementation.
> 

Unfortunately not. Such machines are available internally but they are
heavily used for functional enablement. This might change in the future
and if so, I'll queue the test.

> > 2-socket Broadwell (80 logical CPUs HT On, 40 logical CPUs HT Off)
> > 
> > smt  nosmt
> > Hmean 1514.28 (   0.00%)  540.90 *   5.18%*
> > Hmean 2982.19 (   0.00%) 1042.98 *   6.19%*
> > Hmean 4   1820.02 (   0.00%) 1943.38 *   6.78%*
> > Hmean 8   3356.73 (   0.00%) 3655.92 *   8.91%*
> > Hmean 16  6240.53 (   0.00%) 7057.57 *  13.09%*
> > Hmean 32 10584.60 (   0.00%)15934.82 *  50.55%*
> > Hmean 64 24967.92 (   0.00%)21103.79 * -15.48%*
> > Hmean 12827106.28 (   0.00%)20822.46 * -23.18%*
> > Hmean 25628345.15 (   0.00%)21625.67 * -23.71%*
> > Hmean 32028358.54 (   0.00%)21768.70 * -23.24%*
> > Stddev1  2.10 (   0.00%)3.44 ( -63.59%)
> > Stddev2  2.46 (   0.00%)4.83 ( -95.91%)
> > Stddev4  7.57 (   0.00%)6.14 (  18.86%)
> > Stddev8  6.53 (   0.00%)   11.80 ( -80.79%)
> > Stddev1611.23 (   0.00%)   16.03 ( -42.74%)
> > Stddev3218.99 (   0.00%)   22.04 ( -16.10%)
> > Stddev6410.86 (   0.00%)   14.31 ( -31.71%)
> > Stddev128   25.10 (   0.00%)   16.08 (  35.93%)
> > Stddev256   29.95 (   0.00%)   71.39 (-138.36%)
> > 
> > Same -- performance is better until the machine gets saturated and
> > disabling HT hits scaling limits earlier.
> 
> Interesting. This strongly suggests sub-optimal SMT-scheduling in the 
> non-saturated HT case, i.e. a scheduler balancing bug.
> 

Yeah, it does but mpstat didn't appear to indicate that SMT siblings are
being used prematurely so it's a bit of a curiousity.

> As long as loads are clearly below the physical cores count (which they 
> are in the early phases of your table) the scheduler should spread tasks 
> without overlapping two tasks on the same core.
> 

It should, but it's not perfect. For example, wake_affine_idle does not
take sibling activity into account even though select_idle_sibling *may*
take it into account. Even select_idle_sibling in its fast path may use
an SMT sibling instead of searching.

There are also potential side-effects with cpuidle. Some workloads
migration around the socket as they are communicating because of how the
search for an idle CPU works. With SMT on, there is potentially a longer
opportunity for a core to reach a deep c-state and incur a bigger wakeup
latency. This is a very weak theory but I've seen cases where latency
sensitive workloads with o

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> It exists already: /sys/devices/system/cpu/smt/control
> 
> Setting it to off will offline all siblings, on will online them again.

Indeed, added by 05736e4ac13c last year (and I promptly forgot about it 
...) - I was thrown off a bit by the 'nosmt' flag Mel used, but that's 
probably because he used an older kernel.

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Thomas Gleixner
On Thu, 25 Apr 2019, Ingo Molnar wrote:
> * Mel Gorman  wrote:
> > I don't have the data in a format that can be present everything in a clear
> > format but here is an attempt anyway. This is long but the central point
> > that when when a machine is lightly loaded, HT Off generally performs
> > better than HT On and even when heavily utilised, it's still not a
> > guaranteed loss. I only suggest reading after this if you have coffee
> > and time. Ideally all this would be updated with a comparison to core
> > scheduling but I may not get it queued on my test grid before I leave
> > for LSF/MM and besides, the authors pushing this feature should be able
> > to provide supporting data justifying the complexity of the series.
> 
> BTW., a side note: I'd suggest introducing a runtime toggle 'nosmt' 
> facility, i.e. turn a system between SMT and non-SMT execution runtime, 
> with full reversability between these states and no restrictions.

It exists already: /sys/devices/system/cpu/smt/control

Setting it to off will offline all siblings, on will online them again.

Thanks,

tglx


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Ingo Molnar


* Mel Gorman  wrote:

> On Thu, Apr 25, 2019 at 11:55:08AM +0200, Ingo Molnar wrote:
> > > > Would it be possible to post the results with HT off as well ?
> > > 
> > > What's the point here to turn HT off? The latency is sensitive to the
> > > relationship
> > > between the task number and CPU number. Usually less CPU number, more run
> > > queue wait time, and worse result.
> > 
> > HT-off numbers are mandatory: turning HT off is by far the simplest way 
> > to solve the security bugs in these CPUs.
> > 
> > Any core-scheduling solution *must* perform better than HT-off for all 
> > relevant workloads, otherwise what's the point?
> > 
> 
> I agree. Not only should HT-off be evaluated but it should properly
> evaluate for different levels of machine utilisation to get a complete
> picture.
> 
> Around the same time this was first posted and because of kernel
> warnings from L1TF, I did a preliminary evaluation of HT On vs HT Off
> using nosmt -- this is sub-optimal in itself but it was convenient. The
> conventional wisdom that HT gets a 30% boost appears to be primarily based
> on academic papers evaluating HPC workloads on a Pentium 4 with a focus
> on embarassingly parallel problems which is the ideal case for HT but not
> the universal case. The conventional wisdom is questionable at best. The
> only modern comparisons I could find were focused on games primarily
> which I think hit scaling limits before HT is a factor in some cases.
> 
> I don't have the data in a format that can be present everything in a clear
> format but here is an attempt anyway. This is long but the central point
> that when when a machine is lightly loaded, HT Off generally performs
> better than HT On and even when heavily utilised, it's still not a
> guaranteed loss. I only suggest reading after this if you have coffee
> and time. Ideally all this would be updated with a comparison to core
> scheduling but I may not get it queued on my test grid before I leave
> for LSF/MM and besides, the authors pushing this feature should be able
> to provide supporting data justifying the complexity of the series.

BTW., a side note: I'd suggest introducing a runtime toggle 'nosmt' 
facility, i.e. turn a system between SMT and non-SMT execution runtime, 
with full reversability between these states and no restrictions.

That should make both benchmarking more convenient (no kernel reboots and 
kernel parameters to check), and it would also make it easier for system 
administrators to experiment with how SMT and no-SMT affects their 
typical workloads.

> Here is a tbench comparison scaling from a low thread count to a high
> thread count. I picked tbench because it's relatively uncomplicated and
> tends to be reasonable at spotting scheduler regressions. The kernel
> version is old but for the purposes of this discussion, it doesn't matter
> 
> 1-socket Skylake (8 logical CPUs HT On, 4 logical CPUs HT Off)

Side question: while obviously most of the core-sched interest is 
concentrated around Intel's HyperThreading SMT, I'm wondering whether you 
have any data regarding AMD systems - in particular Ryzen based CPUs 
appear to have a pretty robust SMT implementation.

> Hmean 1   484.00 (   0.00%)  519.95 *   7.43%*
> Hmean 2   925.02 (   0.00%) 1022.28 *  10.51%*
> Hmean 4  1730.34 (   0.00%) 2029.81 *  17.31%*
> Hmean 8  2883.57 (   0.00%) 2040.89 * -29.22%*
> Hmean 16 2830.61 (   0.00%) 2039.74 * -27.94%*
> Hmean 32 2855.54 (   0.00%) 2042.70 * -28.47%*
> Stddev1 1.16 (   0.00%)0.62 (  46.43%)
> Stddev2 1.31 (   0.00%)1.00 (  23.32%)
> Stddev4 4.89 (   0.00%)   12.86 (-163.14%)
> Stddev8 4.30 (   0.00%)2.53 (  40.99%)
> Stddev163.38 (   0.00%)5.92 ( -75.08%)
> Stddev325.47 (   0.00%)   14.28 (-160.77%)
> 
> Note that disabling HT performs better when cores are available but hits
> scaling limits past 4 CPUs when the machine is saturated with HT off.
> It's similar with 2 sockets
>
> 2-socket Broadwell (80 logical CPUs HT On, 40 logical CPUs HT Off)
> 
> smt  nosmt
> Hmean 1514.28 (   0.00%)  540.90 *   5.18%*
> Hmean 2982.19 (   0.00%) 1042.98 *   6.19%*
> Hmean 4   1820.02 (   0.00%) 1943.38 *   6.78%*
> Hmean 8   3356.73 (   0.00%) 3655.92 *   8.91%*
> Hmean 16  6240.53 (   0.00%) 7057.57 *  13.09%*
> Hmean 32 10584.60 (   0.00%)15934.82 *  50.55%*
> Hmean 64 24967.92 (   0.00%)21103.79 * -15.48%*
> Hmean 12827106.28 (   0.00%)20822.46 * -23.18%*
> Hmean 25628345.15 (   0.00%)21625.67 * -23.71%*
> Hmean 32028358.54 (   0.00%)21768.70 * -23.24%*
> Stddev1  2.10 (   0.00%)3.44 ( -63.59%)
> Stddev2  2.46 (   0.00%)4.83 ( -95.91%)
> Stddev4

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Mel Gorman
On Thu, Apr 25, 2019 at 11:55:08AM +0200, Ingo Molnar wrote:
> > > Would it be possible to post the results with HT off as well ?
> > 
> > What's the point here to turn HT off? The latency is sensitive to the
> > relationship
> > between the task number and CPU number. Usually less CPU number, more run
> > queue wait time, and worse result.
> 
> HT-off numbers are mandatory: turning HT off is by far the simplest way 
> to solve the security bugs in these CPUs.
> 
> Any core-scheduling solution *must* perform better than HT-off for all 
> relevant workloads, otherwise what's the point?
> 

I agree. Not only should HT-off be evaluated but it should properly
evaluate for different levels of machine utilisation to get a complete
picture.

Around the same time this was first posted and because of kernel
warnings from L1TF, I did a preliminary evaluation of HT On vs HT Off
using nosmt -- this is sub-optimal in itself but it was convenient. The
conventional wisdom that HT gets a 30% boost appears to be primarily based
on academic papers evaluating HPC workloads on a Pentium 4 with a focus
on embarassingly parallel problems which is the ideal case for HT but not
the universal case. The conventional wisdom is questionable at best. The
only modern comparisons I could find were focused on games primarily
which I think hit scaling limits before HT is a factor in some cases.

I don't have the data in a format that can be present everything in a clear
format but here is an attempt anyway. This is long but the central point
that when when a machine is lightly loaded, HT Off generally performs
better than HT On and even when heavily utilised, it's still not a
guaranteed loss. I only suggest reading after this if you have coffee
and time. Ideally all this would be updated with a comparison to core
scheduling but I may not get it queued on my test grid before I leave
for LSF/MM and besides, the authors pushing this feature should be able
to provide supporting data justifying the complexity of the series.

Here is a tbench comparison scaling from a low thread count to a high
thread count. I picked tbench because it's relatively uncomplicated and
tends to be reasonable at spotting scheduler regressions. The kernel
version is old but for the purposes of this discussion, it doesn't matter

1-socket Skylake (8 logical CPUs HT On, 4 logical CPUs HT Off)
smt  nosmt
Hmean 1   484.00 (   0.00%)  519.95 *   7.43%*
Hmean 2   925.02 (   0.00%) 1022.28 *  10.51%*
Hmean 4  1730.34 (   0.00%) 2029.81 *  17.31%*
Hmean 8  2883.57 (   0.00%) 2040.89 * -29.22%*
Hmean 16 2830.61 (   0.00%) 2039.74 * -27.94%*
Hmean 32 2855.54 (   0.00%) 2042.70 * -28.47%*
Stddev1 1.16 (   0.00%)0.62 (  46.43%)
Stddev2 1.31 (   0.00%)1.00 (  23.32%)
Stddev4 4.89 (   0.00%)   12.86 (-163.14%)
Stddev8 4.30 (   0.00%)2.53 (  40.99%)
Stddev163.38 (   0.00%)5.92 ( -75.08%)
Stddev325.47 (   0.00%)   14.28 (-160.77%)

Note that disabling HT performs better when cores are available but hits
scaling limits past 4 CPUs when the machine is saturated with HT off.
It's similar with 2 sockets

2-socket Broadwell (80 logical CPUs HT On, 40 logical CPUs HT Off)

smt  nosmt
Hmean 1514.28 (   0.00%)  540.90 *   5.18%*
Hmean 2982.19 (   0.00%) 1042.98 *   6.19%*
Hmean 4   1820.02 (   0.00%) 1943.38 *   6.78%*
Hmean 8   3356.73 (   0.00%) 3655.92 *   8.91%*
Hmean 16  6240.53 (   0.00%) 7057.57 *  13.09%*
Hmean 32 10584.60 (   0.00%)15934.82 *  50.55%*
Hmean 64 24967.92 (   0.00%)21103.79 * -15.48%*
Hmean 12827106.28 (   0.00%)20822.46 * -23.18%*
Hmean 25628345.15 (   0.00%)21625.67 * -23.71%*
Hmean 32028358.54 (   0.00%)21768.70 * -23.24%*
Stddev1  2.10 (   0.00%)3.44 ( -63.59%)
Stddev2  2.46 (   0.00%)4.83 ( -95.91%)
Stddev4  7.57 (   0.00%)6.14 (  18.86%)
Stddev8  6.53 (   0.00%)   11.80 ( -80.79%)
Stddev1611.23 (   0.00%)   16.03 ( -42.74%)
Stddev3218.99 (   0.00%)   22.04 ( -16.10%)
Stddev6410.86 (   0.00%)   14.31 ( -31.71%)
Stddev128   25.10 (   0.00%)   16.08 (  35.93%)
Stddev256   29.95 (   0.00%)   71.39 (-138.36%)

Same -- performance is better until the machine gets saturated and
disabling HT hits scaling limits earlier.

The workload "mutilate" is a load generator for memcached that is meant
to simulate a workload interesting to Facebook.

1-socket
Hmean 128570.67 (   0.00%)31632.92 *  10.72%*
Hmean 376904.93 (   0.00%)89644.73 *  16.57%*
Hmean 5   107487.40 (   0.00%)93418.09 * -13.09%*
Hmean

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Julien Desfossez
On 23-Apr-2019 04:18:05 PM, Vineeth Remanan Pillai wrote:
> Second iteration of the core-scheduling feature.
> 
> This version fixes apparent bugs and performance issues in v1. This
> doesn't fully address the issue of core sharing between processes
> with different tags. Core sharing still happens 1% to 5% of the time
> based on the nature of workload and timing of the runnable processes.
> 
> Changes in v2
> -
> - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105

Here are our benchmark results.

Environment setup:
--
Skylake server, 2 numa nodes, total 72 CPUs with HT on
Workload in KVM virtual machines, one cpu cgroup per VM (including qemu
and vhost threads)


Case 1: MySQL TPC-C
---
1 12-vcpus-32gb MySQL server per numa node (clients on another physical
machine)
96 semi-idle 1-vcpu-512mb VM per numa node (sending metrics over a VPN
every 15 seconds)
--> 3 vcpus per physical CPU
Average of 10 5-minutes runs.

- baseline:
  - avg tps: 1878
  - stdev tps: 47
- nosmt:
  - avg tps: 959 (-49% from baseline)
  - stdev tps: 35
- core scheduling:
  - avg tps: 1406 (-25% from baseline)
  - stdev tps: 48
  - Co-scheduling stats (5 minutes sample):
- 48.9% VM threads
- 49.6% idle
- 1.3% foreign threads

So in the v2, the case with a very noisy test, benefits from core
scheduling (the baseline is also better compared to v1 so we probably
benefit from other changes in the kernel).


Case 2: linpack with enough room

2 12-vcpus-32gb linpack VMs both pinned on the same NUMA node (36
hardware threads with SMT on).
100k context switches/sec.
Average of 5 15-minutes runs.

- baseline:
  - avg gflops: 403
  - stdev: 20
- nosmt:
  - avg gflops: 355 (-12% from baseline)
  - stdev: 28
- core scheduling:
  - avg gflops: 364 (-9% from baseline)
  - stdev: 59
  - Co-scheduling stats (5 minutes sample):
- 39.3% VM threads
- 59.3% idle
- 0.07% foreign threads

No real difference between nosmt and core scheduling when there is
enough room to run a cpu-intensive workload even with smt off.


Case 3: full node linpack
-
3 12-vcpus-32gb linpack VMs all pinned on the same NUMA node (36
hardware threads with SMT on).
155k context switches/sec
Average of 5 15-minutes runs.

- baseline:
  - avg gflops: 270
  - stdev: 5
- nosmt (switching to 2:1 ratio of vcpu to hardware threads):
  - avg gflops: 209 (-22.46% from baseline)
  - stdev: 6.2
- core scheduling
  - avg gflops: 269 (-0.11% from baseline)
  - stdev: 5.7
  - Co-scheduling stats (5 minutes sample):
- 93.7% VM threads
- 6.3% idle
- 0.04% foreign threads

Here the core scheduling is a major improvement in terms of performance
compared to nosmt.

Julien


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Ingo Molnar


* Aubrey Li  wrote:

> On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
>  wrote:
> >
> > On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> > >  wrote:
> > > >
> > > > Second iteration of the core-scheduling feature.
> > > >
> > > > This version fixes apparent bugs and performance issues in v1. This
> > > > doesn't fully address the issue of core sharing between processes
> > > > with different tags. Core sharing still happens 1% to 5% of the time
> > > > based on the nature of workload and timing of the runnable processes.
> > > >
> > > > Changes in v2
> > > > -
> > > > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> > >
> > > Thanks to post v2, based on this version, here is my benchmarks result.
> > >
> > > Environment setup
> > > --
> > > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > > cgroup2 workload, gemmbench (AVX512 workload)
> > >
> > > Case 1: task number < CPU num
> > > 
> > > 36 sysbench threads in cgroup1
> > > 36 gemmbench threads in cgroup2
> > >
> > > core sched off:
> > > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > > core sched on:
> > > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> > >
> > > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > > tasks, latency has ~28% improvement!!!
> > >
> > > Case 2: task number > CPU number
> > > -
> > > 72 sysbench threads in cgroup1
> > > 72 gemmbench threads in cgroup2
> > >
> > > core sched off:
> > > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > > core sched on:
> > > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> > >
> > > So not only power, now security and performance is a pair of 
> > > contradictions.
> > > Due to core cookie not matching and forced idle introduced, latency has 
> > > ~12%
> > > regression.
> > >
> > > Any comments?
> >
> > Would it be possible to post the results with HT off as well ?
> 
> What's the point here to turn HT off? The latency is sensitive to the
> relationship
> between the task number and CPU number. Usually less CPU number, more run
> queue wait time, and worse result.

HT-off numbers are mandatory: turning HT off is by far the simplest way 
to solve the security bugs in these CPUs.

Any core-scheduling solution *must* perform better than HT-off for all 
relevant workloads, otherwise what's the point?

Thanks,

Ingo


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-24 Thread Aubrey Li
On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
 wrote:
>
> On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> >  wrote:
> > >
> > > Second iteration of the core-scheduling feature.
> > >
> > > This version fixes apparent bugs and performance issues in v1. This
> > > doesn't fully address the issue of core sharing between processes
> > > with different tags. Core sharing still happens 1% to 5% of the time
> > > based on the nature of workload and timing of the runnable processes.
> > >
> > > Changes in v2
> > > -
> > > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> >
> > Thanks to post v2, based on this version, here is my benchmarks result.
> >
> > Environment setup
> > --
> > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > cgroup2 workload, gemmbench (AVX512 workload)
> >
> > Case 1: task number < CPU num
> > 
> > 36 sysbench threads in cgroup1
> > 36 gemmbench threads in cgroup2
> >
> > core sched off:
> > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > core sched on:
> > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> >
> > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > tasks, latency has ~28% improvement!!!
> >
> > Case 2: task number > CPU number
> > -
> > 72 sysbench threads in cgroup1
> > 72 gemmbench threads in cgroup2
> >
> > core sched off:
> > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > core sched on:
> > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> >
> > So not only power, now security and performance is a pair of contradictions.
> > Due to core cookie not matching and forced idle introduced, latency has ~12%
> > regression.
> >
> > Any comments?
>
> Would it be possible to post the results with HT off as well ?

What's the point here to turn HT off? The latency is sensitive to the
relationship
between the task number and CPU number. Usually less CPU number, more run
queue wait time, and worse result.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-24 Thread Julien Desfossez
On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
>  wrote:
> >
> > Second iteration of the core-scheduling feature.
> >
> > This version fixes apparent bugs and performance issues in v1. This
> > doesn't fully address the issue of core sharing between processes
> > with different tags. Core sharing still happens 1% to 5% of the time
> > based on the nature of workload and timing of the runnable processes.
> >
> > Changes in v2
> > -
> > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> 
> Thanks to post v2, based on this version, here is my benchmarks result.
> 
> Environment setup
> --
> Skylake server, 2 numa nodes, 104 CPUs (HT on)
> cgroup1 workload, sysbench (CPU intensive non AVX workload)
> cgroup2 workload, gemmbench (AVX512 workload)
> 
> Case 1: task number < CPU num
> 
> 36 sysbench threads in cgroup1
> 36 gemmbench threads in cgroup2
> 
> core sched off:
> - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> core sched on:
> - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> 
> Due to core cookie matching, sysbench tasks won't be affect by AVX512
> tasks, latency has ~28% improvement!!!
> 
> Case 2: task number > CPU number
> -
> 72 sysbench threads in cgroup1
> 72 gemmbench threads in cgroup2
> 
> core sched off:
> - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> core sched on:
> - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> 
> So not only power, now security and performance is a pair of contradictions.
> Due to core cookie not matching and forced idle introduced, latency has ~12%
> regression.
> 
> Any comments?

Would it be possible to post the results with HT off as well ?

Thanks,

Julien


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-24 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> Second iteration of the core-scheduling feature.
>
> This version fixes apparent bugs and performance issues in v1. This
> doesn't fully address the issue of core sharing between processes
> with different tags. Core sharing still happens 1% to 5% of the time
> based on the nature of workload and timing of the runnable processes.
>
> Changes in v2
> -
> - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105

Thanks to post v2, based on this version, here is my benchmarks result.

Environment setup
--
Skylake server, 2 numa nodes, 104 CPUs (HT on)
cgroup1 workload, sysbench (CPU intensive non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

Case 1: task number < CPU num

36 sysbench threads in cgroup1
36 gemmbench threads in cgroup2

core sched off:
- sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
core sched on:
- sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449

Due to core cookie matching, sysbench tasks won't be affect by AVX512
tasks, latency has ~28% improvement!!!

Case 2: task number > CPU number
-
72 sysbench threads in cgroup1
72 gemmbench threads in cgroup2

core sched off:
- sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
core sched on:
- sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863

So not only power, now security and performance is a pair of contradictions.
Due to core cookie not matching and forced idle introduced, latency has ~12%
regression.

Any comments?

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-24 Thread Vineeth Remanan Pillai
> Is this one missed? Or fixed with a better impl?
>
> The boot up CPUs don't match the possible cpu map, so the not onlined
> CPU rq->core are not initialized, which causes NULL pointer dereference
> panic in online_fair_sched_group():
>
Thanks for pointing this out. I think the ideal fix would be to
correctly initialize/cleanup the coresched attributes in the cpu
hotplug code path so that lock could be taken successfully if the
sibling is offlined/onlined after coresched was enabled. We are
working on another bug related to hotplugpath and shall introduce
the fix in v3.

Thanks


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> Second iteration of the core-scheduling feature.
>
> This version fixes apparent bugs and performance issues in v1. This
> doesn't fully address the issue of core sharing between processes
> with different tags. Core sharing still happens 1% to 5% of the time
> based on the nature of workload and timing of the runnable processes.
>
> Changes in v2
> -
> - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> - Fixes for couple of NULL pointer dereference crashes
>   - Subhra Mazumdar
>   - Tim Chen

Is this one missed? Or fixed with a better impl?

The boot up CPUs don't match the possible cpu map, so the not onlined
CPU rq->core are not initialized, which causes NULL pointer dereference
panic in online_fair_sched_group():

Thanks,
-Aubrey

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85c728d..bdabf20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10492,6 +10492,10 @@ void online_fair_sched_group(struct task_group *tg)
rq = cpu_rq(i);
se = tg->se[i];

+#ifdef CONFIG_SCHED_CORE
+   if (!rq->core)
+   continue;
+#endif
raw_spin_lock_irq(rq_lockp(rq));
update_rq_clock(rq);
attach_entity_cfs_rq(se);

> - Improves priority comparison logic for process in different cpus
>   - Peter Zijlstra
>   - Aaron Lu
> - Fixes a hard lockup in rq locking
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fix for 32bit build
>   - Aubrey Li
>
> Issues
> --
> - Processes with different tags can still share the core
> - A crash when disabling cpus with core-scheduling on
>- https://paste.debian.net/plainh/fa6bcfa8
>
> ---
>
> Peter Zijlstra (16):
>   stop_machine: Fix stop_cpus_in_progress ordering
>   sched: Fix kerneldoc comment for ia64_set_curr_task
>   sched: Wrap rq::lock access
>   sched/{rt,deadline}: Fix set_next_task vs pick_next_task
>   sched: Add task_struct pointer to sched_class::set_curr_task
>   sched/fair: Export newidle_balance()
>   sched: Allow put_prev_task() to drop rq->lock
>   sched: Rework pick_next_task() slow-path
>   sched: Introduce sched_class::pick_task()
>   sched: Core-wide rq->lock
>   sched: Basic tracking of matching tasks
>   sched: A quick and dirty cgroup tagging interface
>   sched: Add core wide task selection and scheduling.
>   sched/fair: Add a few assertions
>   sched: Trivial forced-newidle balancer
>   sched: Debug bits...
>
> Vineeth Remanan Pillai (1):
>   sched: Wake up sibling if it has something to run
>
>  include/linux/sched.h|   9 +-
>  kernel/Kconfig.preempt   |   7 +-
>  kernel/sched/core.c  | 800 +--
>  kernel/sched/cpuacct.c   |  12 +-
>  kernel/sched/deadline.c  |  99 +++--
>  kernel/sched/debug.c |   4 +-
>  kernel/sched/fair.c  | 137 +--
>  kernel/sched/idle.c  |  42 +-
>  kernel/sched/pelt.h  |   2 +-
>  kernel/sched/rt.c|  96 +++--
>  kernel/sched/sched.h | 185 ++---
>  kernel/sched/stop_task.c |  35 +-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c|   2 +
>  14 files changed, 1145 insertions(+), 289 deletions(-)
>
> --
> 2.17.1
>


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Vineeth Remanan Pillai
>> - Processes with different tags can still share the core

> I may have missed something... Could you explain this statement?

> This, to me, is the whole point of the patch series. If it's not
> doing this then ... what?

What I meant was, the patch needs some more work to be accurate.
There are some race conditions where the core violation can still
happen. In our testing, we saw around 1 to 5% of the time being
shared with incompatible processes. One example of this happening
is as follows(let cpu 0 and 1 be siblings):
- cpu 0 selects a process with a cookie
- cpu 1 selects a higher priority process without cookie
- Selection process restarts for cpu 0 and it might select a
  process with cookie but with lesser priority.
- Since it is lesser priority, the logic in pick_next_task
  doesn't compare again for the cookie(trusts pick_task) and
  proceeds.

This is one of the scenarios that we saw from traces, but there
might be other race conditions as well. Fix seems a little
involved and We are working on that.

Thanks


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Phil Auld
Hi,

On Tue, Apr 23, 2019 at 04:18:05PM + Vineeth Remanan Pillai wrote:
> Second iteration of the core-scheduling feature.

Thanks for spinning V2 of this.

> 
> This version fixes apparent bugs and performance issues in v1. This
> doesn't fully address the issue of core sharing between processes
> with different tags. Core sharing still happens 1% to 5% of the time
> based on the nature of workload and timing of the runnable processes.
> 
> Changes in v2
> -
> - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> - Fixes for couple of NULL pointer dereference crashes
>   - Subhra Mazumdar
>   - Tim Chen
> - Improves priority comparison logic for process in different cpus
>   - Peter Zijlstra
>   - Aaron Lu
> - Fixes a hard lockup in rq locking
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fix for 32bit build
>   - Aubrey Li
> 
> Issues
> --
> - Processes with different tags can still share the core

I may have missed something... Could you explain this statement?

This, to me, is the whole point of the patch series. If it's not
doing this then ... what?



Thanks,
Phil



> - A crash when disabling cpus with core-scheduling on
>- https://paste.debian.net/plainh/fa6bcfa8
> 
> ---
> 
> Peter Zijlstra (16):
>   stop_machine: Fix stop_cpus_in_progress ordering
>   sched: Fix kerneldoc comment for ia64_set_curr_task
>   sched: Wrap rq::lock access
>   sched/{rt,deadline}: Fix set_next_task vs pick_next_task
>   sched: Add task_struct pointer to sched_class::set_curr_task
>   sched/fair: Export newidle_balance()
>   sched: Allow put_prev_task() to drop rq->lock
>   sched: Rework pick_next_task() slow-path
>   sched: Introduce sched_class::pick_task()
>   sched: Core-wide rq->lock
>   sched: Basic tracking of matching tasks
>   sched: A quick and dirty cgroup tagging interface
>   sched: Add core wide task selection and scheduling.
>   sched/fair: Add a few assertions
>   sched: Trivial forced-newidle balancer
>   sched: Debug bits...
> 
> Vineeth Remanan Pillai (1):
>   sched: Wake up sibling if it has something to run
> 
>  include/linux/sched.h|   9 +-
>  kernel/Kconfig.preempt   |   7 +-
>  kernel/sched/core.c  | 800 +--
>  kernel/sched/cpuacct.c   |  12 +-
>  kernel/sched/deadline.c  |  99 +++--
>  kernel/sched/debug.c |   4 +-
>  kernel/sched/fair.c  | 137 +--
>  kernel/sched/idle.c  |  42 +-
>  kernel/sched/pelt.h  |   2 +-
>  kernel/sched/rt.c|  96 +++--
>  kernel/sched/sched.h | 185 ++---
>  kernel/sched/stop_task.c |  35 +-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c|   2 +
>  14 files changed, 1145 insertions(+), 289 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 


[RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Vineeth Remanan Pillai
Second iteration of the core-scheduling feature.

This version fixes apparent bugs and performance issues in v1. This
doesn't fully address the issue of core sharing between processes
with different tags. Core sharing still happens 1% to 5% of the time
based on the nature of workload and timing of the runnable processes.

Changes in v2
-
- rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Issues
--
- Processes with different tags can still share the core
- A crash when disabling cpus with core-scheduling on
   - https://paste.debian.net/plainh/fa6bcfa8

---

Peter Zijlstra (16):
  stop_machine: Fix stop_cpus_in_progress ordering
  sched: Fix kerneldoc comment for ia64_set_curr_task
  sched: Wrap rq::lock access
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched/fair: Export newidle_balance()
  sched: Allow put_prev_task() to drop rq->lock
  sched: Rework pick_next_task() slow-path
  sched: Introduce sched_class::pick_task()
  sched: Core-wide rq->lock
  sched: Basic tracking of matching tasks
  sched: A quick and dirty cgroup tagging interface
  sched: Add core wide task selection and scheduling.
  sched/fair: Add a few assertions
  sched: Trivial forced-newidle balancer
  sched: Debug bits...

Vineeth Remanan Pillai (1):
  sched: Wake up sibling if it has something to run

 include/linux/sched.h|   9 +-
 kernel/Kconfig.preempt   |   7 +-
 kernel/sched/core.c  | 800 +--
 kernel/sched/cpuacct.c   |  12 +-
 kernel/sched/deadline.c  |  99 +++--
 kernel/sched/debug.c |   4 +-
 kernel/sched/fair.c  | 137 +--
 kernel/sched/idle.c  |  42 +-
 kernel/sched/pelt.h  |   2 +-
 kernel/sched/rt.c|  96 +++--
 kernel/sched/sched.h | 185 ++---
 kernel/sched/stop_task.c |  35 +-
 kernel/sched/topology.c  |   4 +-
 kernel/stop_machine.c|   2 +
 14 files changed, 1145 insertions(+), 289 deletions(-)

-- 
2.17.1