Re: [RFC PATCH V2] sched: Improve scalability of select_idle_sibling using SMT balance

2018-01-10 Thread Subhra Mazumdar



On 01/09/2018 06:50 AM, Steven Sistare wrote:

On 1/8/2018 5:18 PM, Peter Zijlstra wrote:

On Mon, Jan 08, 2018 at 02:12:37PM -0800, subhra mazumdar wrote:

@@ -2751,6 +2763,31 @@ context_switch(struct rq *rq, struct task_struct *prev,
   struct task_struct *next, struct rq_flags *rf)
  {
struct mm_struct *mm, *oldmm;
+   int this_cpu = rq->cpu;
+   struct sched_domain *sd;
+   int prev_busy, next_busy;
+
+   if (rq->curr_util == UTIL_UNINITIALIZED)
+   prev_busy = 0;
+   else
+   prev_busy = (prev != rq->idle);
+   next_busy = (next != rq->idle);
+
+   /*
+* From sd_llc downward update the SMT utilization.
+* Skip the lowest level 0.
+*/
+   sd = rcu_dereference_sched(per_cpu(sd_llc, this_cpu));
+   if (next_busy != prev_busy) {
+   for_each_lower_domain(sd) {
+   if (sd->level == 0)
+   break;
+   sd_context_switch(sd, rq, next_busy - prev_busy);
+   }
+   }
+

No, we're not going to be adding atomic ops here. We've been arguing
over adding a single memory barrier to this path, atomic are just not
going to happen.

Also this is entirely the wrong way to do this, we already have code
paths that _know_ if they're going into or coming out of idle.

Yes, it would be more efficient to adjust the busy-cpu count of each level
of the hierarchy in pick_next_task_idle and put_prev_task_idle.
OK, I have moved it to pick_next_task_idle/put_prev_task_idle. Will send 
out the v3.


Thanks,
Subhra


- Steve




[RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-01-12 Thread subhra mazumdar
Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem of potentially searching all
cores or cpus by using a randomized approach called SMT balance. It
maintains a utilization of the SMTs per scheduling group based on the
number of busy CPUs in the group (henceforth referred to as SMT
utilization). This is accounted at the time of context switch. The SMT
utilization is maintained only for levels below the LLC domain, so only for
cores. During context switch each cpu of a core atomically increments or
decrements the SMT utilization variable of that core depending on whether
the cpu is going busy or idle. Since the atomic variable is per core, it
will be updated only by the hyperthreads in a core with minimal contention.

In the fast path of wakeup the scheduler compares the target cpu group of
select_idle_sibling with a random group in the LLC domain w.r.t SMT
utilization to determine a better core to schedule. It chooses the core
which has more SMT capacity (idle cpus) left. The SMT capacity is computed
simply by subtracting SMT utilization from group weight. This comparison
can be done in O(1). Finally it does an idle cpu search only in that core
starting from a random cpu index. The random number generation needs to be
fast and uses a per cpu pseudo random number generator.

Following are the numbers with various benchmarks on a x86 2 socket system
with 22 cores per socket and 2 hyperthreads per core:

hackbench process:
groups  baseline-rc6(avg)  %stdev  patch(avg)  %stdev
1   0.4797 15.75   0.4324 (+9.86%) 2.23
2   0.4877 9.990.4535 (+7.01%) 3.36
4   0.8603 1.090.8376 (+2.64%) 0.95
8   1.496  0.601.4516 (+2.97%) 1.38
16  2.6642 0.372.5857 (+2.95%) 0.68
32  4.6715 0.404.5158 (+3.33%) 0.67

uperf pingpong throughput with loopback interface and message size = 8k:
threads baseline-rc6(avg)  %stdev  patch(avg)   %stdev
8   49.47  0.3551.16 (+3.42%)   0.53
16  95.28  0.77101.02 (+6.03%)  0.43
32  156.77 1.17181.52 (+15.79%) 0.96
48  193.24 0.22212.90 (+10.17%) 0.45
64  216.21 9.33264.14 (+22.17%) 0.69
128 379.62 10.29   416.36 (+9.68%)  1.04

Oracle DB TPC-C throughput normalized to baseline:
users   baseline-rc6 norm(avg)  %stdev  patch norm(avg)   %stdev
20  1   0.941.0071 (+0.71%)   1.03
40  1   0.821.0126 (+1.26%)   0.65
60  1   1.100.9928 (-0.72%)   0.67
80  1   0.631.003 (+0.30%)0.64
100 1   0.820.9957 (-0.43%)   0.15
120 1   0.461.0034 (+0.34%)   1.74
140 1   1.441.0247 (+2.47%)   0.15
160 1   0.851.0445 (+4.45%)   0.81
180 1   0.191.0382 (+3.82%)   0.57
200 1   1.401.0295 (+2.95%)   0.94
220 1   1.021.0242 (+2.42%)   0.85

Following is the cost (in us) of select_idle_sibling() with hackbench 16
groups:

function baseline-rc6  %stdev  patch   %stdev
select_idle_sibling()0.556 1.720.263 (-52.70%) 0.78

Signed-off-by: subhra mazumdar 
---
 include/linux/sched/topology.h |   2 +
 kernel/sched/core.c|  43 +++
 kernel/sched/fair.c| 251 -
 kernel/sched/idle_task.c   |   3 +-
 kernel/sched/sched.h   |  28 ++---
 kernel/sched/topology.c|  35 +-
 6 files changed, 208 insertions(+), 154 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7d065ab..cd1f129 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -146,6 +146,8 @@ struct sched_domain {
struct sched_domain_shared *shared;
 
unsigned int span_weight;
+   struct sched_group **sg_array;
+   int sg_num;
/*
 * Span of all CPUs in this domain.
 *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da..8e0f6bb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2743,6 +2743,48 @@ asmlinkage __visible void schedule_tail(struct 
task_struct *prev)
put_user(task_pid_vnr(current), current->set_child_tid);
 }
 
+#ifdef CONFIG_SCHED_SMT
+
+/*
+ * From sd_llc downward update the SMT utilization.
+ * Skip the lowest level 0.
+ */
+void smt_util(struct rq *rq, int prev_busy, int n

[RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-01-29 Thread subhra mazumdar
Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem of potentially searching all
cores or cpus by using a randomized approach called SMT balance. It
maintains a utilization of the SMTs per scheduling group based on the
number of busy CPUs in the group (henceforth referred to as SMT
utilization). This is accounted at the time of context switch. The SMT
utilization is maintained only for levels below the LLC domain, so only for
cores. During context switch each cpu of a core atomically increments or
decrements the SMT utilization variable of that core depending on whether
the cpu is going busy or idle. Since the atomic variable is per core, it
will be updated only by the hyperthreads in a core with minimal contention.

In the fast path of wakeup the scheduler compares the target cpu group of
select_idle_sibling with a random group in the LLC domain w.r.t SMT
utilization to determine a better core to schedule. It chooses the core
which has more SMT capacity (idle cpus) left. The SMT capacity is computed
simply by subtracting SMT utilization from group weight. This comparison
can be done in O(1). Finally it does an idle cpu search only in that core
starting from a random cpu index. The random number generation needs to be
fast and uses a per cpu pseudo random number generator.

Following are the numbers with various benchmarks on a x86 2 socket system
with 22 cores per socket and 2 hyperthreads per core:

hackbench process:
groups  baseline-rc6(avg)  %stdev  patch(avg)  %stdev
1   0.4797 15.75   0.4324 (+9.86%) 2.23
2   0.4877 9.990.4535 (+7.01%) 3.36
4   0.8603 1.090.8376 (+2.64%) 0.95
8   1.496  0.601.4516 (+2.97%) 1.38
16  2.6642 0.372.5857 (+2.95%) 0.68
32  4.6715 0.404.5158 (+3.33%) 0.67

uperf pingpong throughput with loopback interface and message size = 8k:
threads baseline-rc6(avg)  %stdev  patch(avg)   %stdev
8   49.47  0.3551.16 (+3.42%)   0.53
16  95.28  0.77101.02 (+6.03%)  0.43
32  156.77 1.17181.52 (+15.79%) 0.96
48  193.24 0.22212.90 (+10.17%) 0.45
64  216.21 9.33264.14 (+22.17%) 0.69
128 379.62 10.29   416.36 (+9.68%)  1.04

Oracle DB TPC-C throughput normalized to baseline:
users   baseline-rc6 norm(avg)  %stdev  patch norm(avg)   %stdev
20  1   0.941.0071 (+0.71%)   1.03
40  1   0.821.0126 (+1.26%)   0.65
60  1   1.100.9928 (-0.72%)   0.67
80  1   0.631.003 (+0.30%)0.64
100 1   0.820.9957 (-0.43%)   0.15
120 1   0.461.0034 (+0.34%)   1.74
140 1   1.441.0247 (+2.47%)   0.15
160 1   0.851.0445 (+4.45%)   0.81
180 1   0.191.0382 (+3.82%)   0.57
200 1   1.401.0295 (+2.95%)   0.94
220 1   1.021.0242 (+2.42%)   0.85

Following is the cost (in us) of select_idle_sibling() with hackbench 16
groups:

function baseline-rc6  %stdev  patch   %stdev
select_idle_sibling()0.556 1.720.263 (-52.70%) 0.78

Signed-off-by: subhra mazumdar 
---
 include/linux/sched/topology.h |   2 +
 kernel/sched/core.c|  43 +++
 kernel/sched/fair.c| 247 -
 kernel/sched/idle_task.c   |   3 +-
 kernel/sched/sched.h   |  28 ++---
 kernel/sched/topology.c|  35 +-
 6 files changed, 206 insertions(+), 152 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cf257c2..e63e4fb 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -147,6 +147,8 @@ struct sched_domain {
struct sched_domain_shared *shared;
 
unsigned int span_weight;
+   struct sched_group **sg_array;
+   int sg_num;
/*
 * Span of all CPUs in this domain.
 *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32a..58f8684 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2752,6 +2752,48 @@ asmlinkage __visible void schedule_tail(struct 
task_struct *prev)
put_user(task_pid_vnr(current), current->set_child_tid);
 }
 
+#ifdef CONFIG_SCHED_SMT
+
+/*
+ * From sd_llc downward update the SMT utilization.
+ * Skip the lowest level 0.
+ */
+void smt_util(struct rq *rq, int prev_busy, int n

[RFC PATCH V2] sched: Improve scalability of select_idle_sibling using SMT balance

2018-01-08 Thread subhra mazumdar
Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem of potentially searching all
cores or cpus by using a randomized approach called SMT balance. It
maintains a utilization of the SMTs per scheduling group based on the
number of busy CPUs in the group (henceforth referred to as SMT
utilization). This is accounted at the time of context switch. The SMT
utilization is maintained only for levels below the LLC domain, so only for
cores. During context switch each cpu of a core atomically increments or
decrements the SMT utilization variable of that core depending on whether
the cpu is going busy or idle. Since the atomic variable is per core, it
will be updated only by the hyperthreads in a core with minimal contention.
The cost of context_switch() function below with this patch confirms that
the overhead is minimal and mostly within noise margin.

In the fast path of wakeup the scheduler compares the target cpu group of
select_idle_sibling with a random group in the LLC domain w.r.t SMT
utilization to determine a better core to schedule. It chooses the core
which has more SMT capacity (idle cpus) left. The SMT capacity is computed
simply by subtracting SMT utilization from group weight. This comparison
can be done in O(1). Finally it does an idle cpu search only in that core
starting from a random cpu index. The random number generation needs to be
fast and uses a per cpu pseudo random number generator.

Following are the numbers with various benchmarks on a x86 2 socket system
with 22 cores per socket and 2 hyperthreads per core:

hackbench process:
groups  baseline-rc6(avg)  %stdev  patch(avg)  %stdev
1   0.4797 15.75   0.4324 (+9.86%) 2.23
2   0.4877 9.990.4535 (+7.01%) 3.36
4   0.8603 1.090.8376 (+2.64%) 0.95
8   1.496  0.601.4516 (+2.97%) 1.38
16  2.6642 0.372.5857 (+2.95%) 0.68
32  4.6715 0.404.5158 (+3.33%) 0.67

uperf pingpong throughput with loopback interface and message size = 8k:
threads baseline-rc6(avg)  %stdev  patch(avg)   %stdev
8   49.47  0.3551.16 (+3.42%)   0.53
16  95.28  0.77101.02 (+6.03%)  0.43
32  156.77 1.17181.52 (+15.79%) 0.96
48  193.24 0.22212.90 (+10.17%) 0.45
64  216.21 9.33264.14 (+22.17%) 0.69
128 379.62 10.29   416.36 (+9.68%)  1.04

Oracle DB TPC-C throughput normalized to baseline:
users   baseline-rc6 norm(avg)  %stdev  patch norm(avg)   %stdev
20  1   0.941.0071 (+0.71%)   1.03
40  1   0.821.0126 (+1.26%)   0.65
60  1   1.100.9928 (-0.72%)   0.67
80  1   0.631.003 (+0.30%)0.64
100 1   0.820.9957 (-0.43%)   0.15
120 1   0.461.0034 (+0.34%)   1.74
140 1   1.441.0247 (+2.47%)   0.15
160 1   0.851.0445 (+4.45%)   0.81
180 1   0.191.0382 (+3.82%)   0.57
200 1   1.401.0295 (+2.95%)   0.94
220 1   1.021.0242 (+2.42%)   0.85

Following are the cost (in us) of context_switch() and
select_idle_sibling() with hackbench 16 groups:

function baseline-rc6  %stdev  patch   %stdev
context_switch() 663.8799  4.46687.4068 (+3.54%)   2.85
select_idle_sibling()0.556 1.720.263 (-52.70%) 0.78

Signed-off-by: subhra mazumdar 
---
 include/linux/sched/topology.h |   2 +
 kernel/sched/core.c|  38 +++
 kernel/sched/fair.c| 245 -
 kernel/sched/idle_task.c   |   1 -
 kernel/sched/sched.h   |  26 ++---
 kernel/sched/topology.c|  35 +-
 6 files changed, 197 insertions(+), 150 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7d065ab..cd1f129 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -146,6 +146,8 @@ struct sched_domain {
struct sched_domain_shared *shared;
 
unsigned int span_weight;
+   struct sched_group **sg_array;
+   int sg_num;
/*
 * Span of all CPUs in this domain.
 *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da..805451b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2743,6 +2743,18 @@ asmlinkage __visible void schedule_tail(struct 
task_struct *prev)
put_user(task_pi

Re: [RFC 00/60] Coscheduling for Linux

2018-09-27 Thread Subhra Mazumdar




On 09/24/2018 08:43 AM, Jan H. Schönherr wrote:

On 09/19/2018 11:53 PM, Subhra Mazumdar wrote:

Can we have a more generic interface, like specifying a set of task ids
to be co-scheduled with a particular level rather than tying this with
cgroups? KVMs may not always run with cgroups and there might be other
use cases where we might want co-scheduling that doesn't relate to
cgroups.

Currently: no.

At this point the implementation is tightly coupled to the cpu cgroup
controller. This *might* change, if the task group optimizations mentioned
in other parts of this e-mail thread are done, as I think, that it would
decouple the various mechanisms.

That said, what if you were able to disable the "group-based fairness"
aspect of the cpu cgroup controller? Then you would be able to control
just the coscheduling aspects on their own. Would that satisfy the use
case you have in mind?

Regards
Jan

Yes that will suffice the use case. We wish to experiment at some point
with co-scheduling of certain workers threads in DB parallel query and see
if there is any benefit

Thanks,
Subhra


Re: [RFC 00/60] Coscheduling for Linux

2018-09-27 Thread Subhra Mazumdar




On 09/26/2018 02:58 AM, Jan H. Schönherr wrote:

On 09/17/2018 02:25 PM, Peter Zijlstra wrote:

On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:


Assuming, there is a cgroup-less solution that can prevent simultaneous
execution of tasks on a core, when they're not supposed to. How would you
tell the scheduler, which tasks these are?

Specifically for L1TF I hooked into/extended KVM's preempt_notifier
registration interface, which tells us which tasks are VCPUs and to
which VM they belong.

But if we want to actually expose this to userspace, we can either do a
prctl() or extend struct sched_attr.

Both, Peter and Subhra, seem to prefer an interface different than cgroups
to specify what to coschedule.

Can you provide some extra motivation for me, why you feel that way?
(ignoring the current scalability issues with the cpu group controller)


After all, cgroups where designed to create arbitrary groups of tasks and
to attach functionality to those groups.

If we were to introduce a different interface to control that, we'd need to
introduce a whole new group concept, so that you make tasks part of some
group while at the same time preventing unauthorized tasks from joining a
group.


I currently don't see any wins, just a loss in flexibility.

Regards
Jan

I think cgroups will the get the job done for any use case. But we have,
e.g. affinity control via both sched_setaffinity and cgroup cpusets. It
will be good to have an alternative way to specify co-scheduling too for
those who don't want to use cgroup for some reason. It can be added later
on though, only how one will override the other will need to be sorted out.


[PATCH 5/5] sched: SIS_CORE to disable idle core search

2018-06-28 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 574cb14..33fbc47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6464,9 +6464,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[PATCH 4/5] sched: add sched feature to disable idle core search

2018-06-28 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae848..de15733 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



[PATCH 1/5] sched: limit cpu search in select_idle_cpu

2018-06-28 Thread subhra mazumdar
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..7243146 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6390,10 +6390,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = 2*floor;
+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}
 
time = local_clock();
-- 
2.9.3



[PATCH 3/5] sched: rotate the cpu search window for better spread

2018-06-28 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7243146..574cb14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6403,9 +6403,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3



[PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit

2018-06-28 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b25..b3e4ec1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -20,6 +20,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
 /*
@@ -5996,6 +5997,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4..eb12b50 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -912,6 +912,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr(&runqueues)
-- 
2.9.3



[RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path

2018-06-28 Thread subhra mazumdar
(1.07%)   1.05

Following are the schbench performance numbers with SIS_CORE false and
SIS_PROP false. This recovers the latency increase by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 
tasks (lower is better):
percentile  baseline  %stdev   patch   %stdev
50  942.82 93.33 (0.71%)   1.24  
75  124   2.13 122.67 (1.08%)  1.7   
90  152   1.74 149.33 (1.75%)  2.35  
95  171   2.11 167 (2.34%) 2.74
99  512.67104.96   206 (59.82%)8.86
99.52296  82.553121.67 (-35.96%)   97.37 
99.912517.33  2.38 12592 (-0.6%)   1.67

Changes since v1
 - Compute the upper and lower limit based on number of cpus in a core
 - Split up the search limit and search window rotation into separate
   patches
 - Add new sched feature to have option of disabling idle core search

subhra mazumdar (5):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search

 kernel/sched/core.c |  2 ++
 kernel/sched/fair.c | 31 +++
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  1 +
 4 files changed, 27 insertions(+), 8 deletions(-)

-- 
2.9.3



Re: [PATCH 1/3] sched: remove select_idle_core() for scalability

2018-05-30 Thread Subhra Mazumdar




On 05/29/2018 02:36 PM, Peter Zijlstra wrote:

On Wed, May 02, 2018 at 02:58:42PM -0700, Subhra Mazumdar wrote:

I re-ran the test after fixing that bug but still get similar regressions
for hackbench
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline   %stdev  patch %stdev
1   0.5742 21.13   0.5131 (10.64%) 4.11
2   0.5776 7.87    0.5387 (6.73%) 2.39
4   0.9578 1.12    1.0549 (-10.14%) 0.85
8   1.7018 1.35    1.8516 (-8.8%) 1.56
16  2.9955 1.36    3.2466 (-8.38%) 0.42
32  5.4354 0.59    5.7738 (-6.23%) 0.38

On my IVB-EP (2 socket, 10 core/socket, 2 threads/core):

bench:

   perf stat --null --repeat 10 -- perf bench sched messaging -g $i -t -l 1 2>&1 | 
grep "seconds time elapsed"

config + results:

ORIG (SIS_PROP, shift=9)

1:0.557325175 seconds time elapsed  
( +-  0.83% )
2:0.620646551 seconds time elapsed  
( +-  1.46% )
5:2.313514786 seconds time elapsed  
( +-  2.11% )
10:3.796233615 seconds time elapsed 
 ( +-  1.57% )
20:6.319403172 seconds time elapsed 
 ( +-  1.61% )
40:9.313219134 seconds time elapsed 
 ( +-  1.03% )

PROP+AGE+ONCE shift=0

1:0.559497993 seconds time elapsed  
( +-  0.55% )
2:0.631549599 seconds time elapsed  
( +-  1.73% )
5:2.195464815 seconds time elapsed  
( +-  1.77% )
10:3.703455811 seconds time elapsed 
 ( +-  1.30% )
20:6.440869566 seconds time elapsed 
 ( +-  1.23% )
40:9.537849253 seconds time elapsed 
 ( +-  2.00% )

FOLD+AGE+ONCE+PONIES shift=0

1:0.558893325 seconds time elapsed  
( +-  0.98% )
2:0.617426276 seconds time elapsed  
( +-  1.07% )
5:2.342727231 seconds time elapsed  
( +-  1.34% )
10:3.850449091 seconds time elapsed 
 ( +-  1.07% )
20:6.622412262 seconds time elapsed 
 ( +-  0.85% )
40:9.487138039 seconds time elapsed 
 ( +-  2.88% )

FOLD+AGE+ONCE+PONIES+PONIES2 shift=0

10:   3.695294317 seconds time elapsed  
( +-  1.21% )


Which seems to not hurt anymore.. can you confirm?

Also, I didn't run anything other than hackbench on it so far.

(sorry, the code is a right mess, it's what I ended up with after a day
of poking with no cleanups)


I tested with FOLD+AGE+ONCE+PONIES+PONIES2 shift=0 vs baseline but see some
regression for hackbench and uperf:

hackbench   BL  stdev%  test    stdev% %gain
1(40 tasks) 0.5816  8.94    0.5607  2.89 3.593535
2(80 tasks) 0.6428  10.64   0.5984  3.38 6.907280
4(160 tasks)    1.0152  1.99    1.0036  2.03 1.142631
8(320 tasks)    1.8128  1.40    1.7931  0.97 1.086716
16(640 tasks)   3.1666  0.80    3.2332  0.48 -2.103207
32(1280 tasks)  5.6084  0.83    5.8489  0.56 -4.288210

Uperf    BL  stdev%  test    stdev% %gain
8 threads   45.36   0.43    45.16   0.49 -0.433536
16 threads  87.81   0.82    88.6    0.38 0.899669
32 threads  151.18  0.01    149.98  0.04 -0.795925
48 threads  190.19  0.21    184.77  0.23 -2.849681
64 threads  190.42  0.35    183.78  0.08 -3.485217
128 threads 323.85  0.27    266.32  0.68 -17.766089

sysbench    BL  stdev%  test stdev% %gain
8 threads   2095.44 1.82    2102.63  0.29 0.343006
16 threads  4218.44 0.06    4179.82  0.49 -0.915413
32 threads  7531.36 0.48    7744.72  0.13 2.832912
48 threads  10206.42    0.20    10144.65 0.19 -0.605163
64 threads  12053.72    0.09    11784.38 0.32 -2.234547
128 threads 14810.33    0.04    14741.78 0.16 -0.462867

I have a patch which is much smaller but seems to work well so far for both
x86 and SPARC across benchmarks I have run so far. It keeps the idle cpu
search between 1 core and 2 core amount of cpus and also puts a new
sched feature of doing idle core search or not. It can be on by default but
for workloads (like Oracle DB on x86) we can turn it off. I plan to send
that after some more testing.


[PATCH 5/5] sched: SIS_CORE to disable idle core search

2018-06-12 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 849c7c8..35a076e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6464,9 +6464,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path

2018-06-12 Thread subhra mazumdar
(1.07%)   1.05

Following are the schbench performance numbers with SIS_CORE false and
SIS_PROP false. This recovers the latency increase by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 
tasks (lower is better):
percentile  baseline  %stdev   patch   %stdev
50  942.82 93.33 (0.71%)   1.24  
75  124   2.13 122.67 (1.08%)  1.7   
90  152   1.74 149.33 (1.75%)  2.35  
95  171   2.11 167 (2.34%) 2.74
99  512.67104.96   206 (59.82%)8.86
99.52296  82.553121.67 (-35.96%)   97.37 
99.912517.33  2.38 12592 (-0.6%)   1.67

Changes since v1
 - Compute the upper and lower limit based on number of cpus in a core
 - Split up the search limit and search window rotation into separate
   patches
 - Add new sched feature to have option of disabling idle core search

subhra mazumdar (5):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search

 kernel/sched/core.c |  2 ++
 kernel/sched/fair.c | 31 +++
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  1 +
 4 files changed, 27 insertions(+), 8 deletions(-)

-- 
2.9.3



[PATCH 1/5] sched: limit cpu search in select_idle_cpu

2018-06-12 Thread subhra mazumdar
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..9a6d28d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6389,11 +6389,18 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
return -1;
 
if (sched_feat(SIS_PROP)) {
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = 2*floor;
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}
 
time = local_clock();
-- 
2.9.3



[PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit

2018-06-12 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b25..b3e4ec1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -20,6 +20,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
 /*
@@ -5996,6 +5997,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4..eb12b50 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -912,6 +912,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr(&runqueues)
-- 
2.9.3



[PATCH 3/5] sched: rotate the cpu search window for better spread

2018-06-12 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a6d28d..849c7c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6403,9 +6403,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3



[PATCH 4/5] sched: add sched feature to disable idle core search

2018-06-12 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae848..de15733 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



Re: [PATCH 1/5] sched: limit cpu search in select_idle_cpu

2018-06-12 Thread Subhra Mazumdar




On 06/12/2018 01:33 PM, kbuild test robot wrote:

Hi subhra,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on v4.17 next-20180612]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/subhra-mazumdar/Improve-scheduler-scalability-for-fast-path/20180613-015158
config: i386-randconfig-x070-201823 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
 # save the attached .config to linux build tree
 make ARCH=i386

All warnings (new ones prefixed by >>):

kernel/sched/fair.c: In function 'select_idle_cpu':

kernel/sched/fair.c:6396:3: warning: ISO C90 forbids mixed declarations and 
code [-Wdeclaration-after-statement]

   u64 span_avg = sd->span_weight * avg_idle;
   ^~~





I fixed this patch, please try the following

---8<---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..7243146 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t

    u64 avg_cost, avg_idle;
    u64 time, cost;
    s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;

    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
    if (!this_sd)
@@ -6390,10 +6390,17 @@ static int select_idle_cpu(struct task_struct 
*p, struct sched_domain *sd, int t


    if (sched_feat(SIS_PROP)) {
    u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = 2*floor;
+   if (span_avg > floor*avg_cost) {
    nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
    }

    time = local_clock();



Gang scheduling

2018-10-10 Thread Subhra Mazumdar

Hi,

I was following the Coscheduling patch discussion on lkml and Peter 
mentioned he had a patch series. I found the following on github.


https://github.com/pdxChen/gang/commits/sched_1.23-loadbal

I would like to test this with KVMs. Are the commits from 38d5acb to 
f019876 sufficient? Also is there any documentaion on how to use it (any 
knobs I need to turn on for gang scheduling to happen?) or is it enabled 
by default for KVMs?


Thanks,
Subhra



Re: [RFC 00/60] Coscheduling for Linux

2018-10-29 Thread Subhra Mazumdar



On 10/26/18 4:44 PM, Jan H. Schönherr wrote:

On 19/10/2018 02.26, Subhra Mazumdar wrote:

Hi Jan,

Hi. Sorry for the delay.


On 9/7/18 2:39 PM, Jan H. Schönherr wrote:

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


Do you know how much is the delay? i.e what is overlap time when a thread
of new group starts executing on one HT while there is still thread of
another group running on the other HT?

The delay is roughly equivalent to the IPI latency, if we're just talking
about coscheduling at SMT level: one sibling decides to schedule another
group, sends an IPI to the other sibling(s), and may already start
executing a task of that other group, before the IPI is received on the
other end.

Can you point to where the leader is sending the IPI to other siblings?

I did some experiment and delay seems to be sub microsec. I ran 2 threads
that are just looping in one cosched group and affinitized to the 2 HTs of
a core. And another thread in a different cosched group starts running
affinitized to the first HT of the same core. I time stamped just before
context_switch() in __schedule() for the threads switching from one to
another and one to idle. Following is what I get on cpu 1 and 45 that are
siblings, cpu 1 is where the other thread preempts:

[  403.216625] cpu:45 sub1->idle:403216624579
[  403.238623] cpu:1 sub1->sub2:403238621585
[  403.238624] cpu:45 sub1->idle:403238621787
[  403.260619] cpu:1 sub1->sub2:403260619182
[  403.260620] cpu:45 sub1->idle:403260619413
[  403.282617] cpu:1 sub1->sub2:403282617157
[  403.282618] cpu:45 sub1->idle:403282617317
..

Not sure why the first switch on cpu to idle happened. But then onwards
the difference in timestamps is less than a microsec. This is just a crude
way to get a sense of the delay, may not be exact.

Thanks,
Subhra


Now, there are some things that may delay processing an IPI, but in those
cases the target CPU isn't executing user code.

I've yet to produce some current numbers for SMT-only coscheduling. An
older ballpark number I have is about 2 microseconds for the collective
context switch of one hierarchy level, but take that with a grain of salt.

Regards
Jan



Re: Gang scheduling

2018-10-15 Thread Subhra Mazumdar




On 10/12/2018 11:01 AM, Tim Chen wrote:

On 10/10/2018 05:09 PM, Subhra Mazumdar wrote:

Hi,

I was following the Coscheduling patch discussion on lkml and Peter mentioned 
he had a patch series. I found the following on github.

https://github.com/pdxChen/gang/commits/sched_1.23-loadbal

I would like to test this with KVMs. Are the commits from 38d5acb to f019876 
sufficient? Also is there any documentaion on how to use it (any knobs I need 
to turn on for gang scheduling to happen?) or is it enabled by default for KVMs?

Thanks,
Subhra


I would suggest you try
https://github.com/pdxChen/gang/tree/sched_1.23-base
without the load balancing part of gang scheduling.
It is enabled by default for KVMs.

Due to the constant change in gang scheduling status of the QEMU thread
depending on whether vcpu is loaded or unloaded,
the load balancing part of the code doesn't work very well.

Thanks. Does this mean each vcpu thread need to be affinitized to a cpu?


The current version of the code need to be optimized further.  Right now
the QEMU thread constantly does vcpu load and unload during VM enter and exit.
We gang schedule only after vcpu load and register the thread to be gang
scheduled.  When we do vcpu unload, the thread is removed from the set
to be gang scheduled.  Each time there's a synchronization with the
sibling thread that's expensive.

However, for QEMU, there's a one to one correspondence between the QEMU
thread and vcpu.  So we don't have to change the gang scheduling status
for such thread to avoid the church and sync with the sibling. That should
be helpful for VM with lots of I/O causing constant VM exits.  We're
still working on this optimization.  And the load balancing should be
better after this change.

Tim



Also FYI I get the following error while building sched_1.23-base:

ERROR: "sched_ttwu_pending" [arch/x86/kvm/kvm-intel.ko] undefined!
scripts/Makefile.modpost:92: recipe for target '__modpost' failed

Adding the following fixed it:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 46807dc..302b77d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -21,6 +21,7 @@
 #include 

 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+EXPORT_SYMBOL_GPL(sched_ttwu_pending);

 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
 /*


Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe

2018-11-05 Thread Subhra Mazumdar



On 11/5/18 2:08 AM, Mel Gorman wrote:

Adding Al Viro as per get_maintainers.pl.

On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote:

Introduce pipe_ll_usec field for pipes that indicates the amount of micro
seconds a thread should spin if pipe is empty or full before sleeping. This
is similar to network sockets.

Can you point to what pattern from network sockets you are duplicated
exactly? One would assume it's busy_loop_current_time and busy_loop_timeout
but it should be in the changelog because there are differences in polling
depending on where you are in the network subsystem (which I'm not very
familiar with).

I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By
similar I meant having a similar mechanism for pipes to busy wait



Workloads like hackbench in pipe mode
benefits significantly from this by avoiding the sleep and wakeup overhead.
Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
to enable or disable busy waiting via /proc. The value of it specifies the
amount of spin in microseconds. Default value is 0 indicating no spin.


Your lead mail indicates the spin was set to a "suitable spin time". How
should an administrator select this spin time? What works for hackbench
might not be suitable for another workload. What if the spin time
selected happens to be just slightly longer than the time it takes the
reader to respond? In such a case, the result would be "all spin, no
gain". While networking potentially suffers the same problem, it appears
to be opt-in per socket so it's up to the application not to shoot
itself in the foot.

Even for network, sk_ll_usec is assigned the value of the tunable
sysctl_net_busy_read in sock_init_data() for all sockets initialized by
default. There is way for per socket setting using sock_setsockopt(), for
pipes that can be added later if needed in case of different apps running
in one system. But there are cases where only one app runs (e.g big DBs)
and one tunable will suffice. It can be set to a value that is tested to be
beneficial under the operating conditions.



Signed-off-by: subhra mazumdar 
---
  fs/pipe.c | 12 
  include/linux/pipe_fs_i.h |  2 ++
  kernel/sysctl.c   |  7 +++
  3 files changed, 21 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c..35d805b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,6 +26,7 @@
  
  #include 

  #include 
+#include 
  
  #include "internal.h"
  
@@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;

   */
  unsigned long pipe_user_pages_hard;
  unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
+unsigned int pipe_busy_poll;
  
  /*

   * We use a start+len construction, which provides full use of the
@@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
  void pipe_wait(struct pipe_inode_info *pipe)
  {
DEFINE_WAIT(wait);
+   u64 start;
  
  	/*

 * Pipes are system-local resources, so sleeping on them
@@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
 */
prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
+   start = local_clock();
+   while (current->state != TASK_RUNNING &&
+  ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
+   cpu_relax();
schedule();
finish_wait(&pipe->wait, &wait);
pipe_lock(pipe);

Networking breaks this out better in terms of options instead of
hard-coding. This does not handle need_resched or signal delivery
properly where as networking does for example.

I don't disable preemption, so don't think checking need_resched is needed.
Can you point to what you mean by handling signal delivery in case of
networking? Not sure what I am missing. My initial version broke it out
like networking but after Peter's suggestion I clubbed it. I don't feel
strongly either way.



@@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
struct file *files[2];
int fd[2];
int error;
+   struct pipe_inode_info *pipe;
  
  	error = __do_pipe_flags(fd, files, flags);

if (!error) {
@@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
}
+   pipe = files[0]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
+   pipe = files[1]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
}
return error;
  }

You add a pipe field but the value in always based on the sysctl
so the information is redundantu (barring a race condition on one
pipe write per sysctl update which is an irrelevant corner case). In
comparison, the network subsystem appears to

Re: [PATCH 00/10] steal tasks to improve CPU utilization

2018-11-02 Thread Subhra Mazumdar



On 10/22/18 7:59 AM, Steve Sistare wrote:

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates.  To minimize search time, steal the first migratable
task that is found when the bitmap is traversed.  For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.


Is the bitmask saving much? I tried a simple stealing that just starts
searching the domain from the current cpu and steals a thread from the
first cpu that has more than one runnable thread. It seems to perform
similar to your patch.

hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
    baseline    %stdev  patch %stdev
1(40 tasks) 0.5524  2.36    0.5522 (0.045%) 3.82
2(80 tasks) 0.6482  11.4    0.7241 (-11.7%) 20.34
4(160 tasks)    0.9756  0.95    0.8276 (15.1%) 5.8
8(320 tasks)    1.7699  1.62    1.6655 (5.9%) 1.57
16(640 tasks)   3.1018  0.77    2.9858 (3.74%) 1.4
32(1280 tasks)  5.565   0.62    5.3388 (4.1%) 0.72

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Oracle database OLTP, logging _enabled_

Users %speedup
20 1.2
40 -0.41
60 0.83
80 2.37
100 1.54
120 3.0
140 2.24
160 1.82
180 1.94
200 2.23
220 1.49

Below is the patch (not in best shape)

--->8

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f808ddf..1690451 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3540,6 +3540,7 @@ static inline unsigned long cfs_rq_load_avg(struct 
cfs_rq *cfs_rq)

 }

 static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
+static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf);

 static inline unsigned long task_util(struct task_struct *p)
 {
@@ -6619,6 +6620,8 @@ done: __maybe_unused;

 idle:
    new_tasks = idle_balance(rq, rf);
+   if (new_tasks == 0)
+   new_tasks = my_idle_balance(rq, rf);

    /*
 * Because idle_balance() releases (and re-acquires) rq->lock, 
it is

@@ -8434,6 +8437,75 @@ static int should_we_balance(struct lb_env *env)
    return balance_cpu == env->dst_cpu;
 }

+int get_best_cpu(int this_cpu, struct sched_domain *sd)
+{
+   struct rq *this_rq, *rq;
+   int i;
+   int best_cpu = -1;
+
+   this_rq = cpu_rq(this_cpu);
+   for_each_cpu_wrap(i, sched_domain_span(sd), this_cpu) {
+   if (this_rq->nr_running > 0)
+   return (-1);
+   if (i == this_cpu)
+   continue;
+   rq = cpu_rq(i);
+   if (rq->nr_running <= 1)
+   continue;
+   best_cpu = i;
+   break;
+   }
+   return (best_cpu);
+}
+static int my_load_balance(int this_cpu, struct rq *this_rq,
+  struct sched_domain *sd, enum cpu_idle_type idle)
+{
+   int ld_moved = 0;
+   struct rq *busiest;
+   unsigned long flags;
+   struct task_struct *p = NULL;
+   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+   int best_cpu;
+
+   struct lb_env env = {
+   .sd = sd,
+   .dst_cpu    = this_cpu,
+   .dst_rq = this_rq,
+   .dst_grpmask    = sched_group_span(sd->groups),
+   .idle   = idle,
+   .cpus   = cpus,
+   .tasks  = LIST_HEAD_INIT(env.tasks),
+   };
+
+   if (idle == CPU_NEWLY_IDLE)
+   env.dst_grpmask = NULL;
+
+   cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+
+   best_cpu = get_best_cpu(this_cpu, sd);
+
+   if (best_cpu >= 0)
+   busiest = cpu_rq(best_cpu);
+   else
+   goto out;
+
+   env.src_cpu = busiest->cpu;
+   env.src_rq = busiest;
+   raw_spin_lock_irqsave(&busiest->lock, flags);
+
+   p = detach_one_task(&env);
+   raw_spin_unlock(&busiest->lock);
+   if (p) {
+   attach_one_task(this_rq, p);
+   ld_moved++;
+   }
+   local_irq_restore(flags);
+
+out:
+   return ld_moved;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  

Re: [RFC 00/60] Coscheduling for Linux

2018-10-26 Thread Subhra Mazumdar





D) What can I *not* do with this?
-

Besides the missing load-balancing within coscheduled task-groups, this
implementation has the following properties, which might be considered
short-comings.

This particular implementation focuses on SCHED_OTHER tasks managed by CFS
and allows coscheduling them. Interrupts as well as tasks in higher
scheduling classes are currently out-of-scope: they are assumed to be
negligible interruptions as far as coscheduling is concerned and they do
*not* cause a preemption of a whole group. This implementation could be
extended to cover higher scheduling classes. Interrupts, however, are an
orthogonal issue.

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


The leader doesn't kick the other cpus _immediately_ to switch to a
different cosched group. So threads from previous cosched group will keep
running in other HTs till their sched_slice is over (in worst case). This
can still keep the window of L1TF vulnerability open?


Re: [RFC 00/60] Coscheduling for Linux

2018-10-18 Thread Subhra Mazumdar

Hi Jan,

On 9/7/18 2:39 PM, Jan H. Schönherr wrote:

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


Do you know how much is the delay? i.e what is overlap time when a thread
of new group starts executing on one HT while there is still thread of
another group running on the other HT?

Thanks,
Subhra


Re: [RFC PATCH 2/2] pipe: use pipe busy wait

2018-09-17 Thread Subhra Mazumdar




On 09/07/2018 05:25 AM, Peter Zijlstra wrote:

On Thu, Aug 30, 2018 at 01:24:58PM -0700, subhra mazumdar wrote:

+void pipe_busy_wait(struct pipe_inode_info *pipe)
+{
+   unsigned long wait_flag = pipe->pipe_wait_flag;
+   unsigned long start_time = pipe_busy_loop_current_time();
+
+   pipe_unlock(pipe);
+   preempt_disable();
+   for (;;) {
+   if (pipe->pipe_wait_flag > wait_flag) {
+   preempt_enable();
+   pipe_lock(pipe);
+   return;
+   }
+   if (pipe_busy_loop_timeout(pipe, start_time))
+   break;
+   cpu_relax();
+   }
+   preempt_enable();
+   pipe_lock(pipe);
+   if (pipe->pipe_wait_flag > wait_flag)
+   return;
+   pipe_wait(pipe);
+}
+
+void wake_up_busy_poll(struct pipe_inode_info *pipe)
+{
+   pipe->pipe_wait_flag++;
+}

Why not just busy wait on current->state ? A little something like:

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c0977d..8d9f1c95ff99 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -106,6 +106,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
  void pipe_wait(struct pipe_inode_info *pipe)
  {
DEFINE_WAIT(wait);
+   u64 start;
  
  	/*

 * Pipes are system-local resources, so sleeping on them
@@ -113,7 +114,15 @@ void pipe_wait(struct pipe_inode_info *pipe)
 */
prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
-   schedule();
+
+   preempt_disable();
+   start = local_clock();
+   while (!need_resched() && current->state != TASK_RUNNING &&
+   (local_clock() - start) < pipe->poll_usec)
+   cpu_relax();
+   schedule_preempt_disabled();
+   preempt_enable();
+
finish_wait(&pipe->wait, &wait);
pipe_lock(pipe);
  }

This will make the current thread always spin and block as it itself does
the state change to TASK_RUNNING in finish_wait.


Re: [RFC 00/60] Coscheduling for Linux

2018-09-17 Thread Subhra Mazumdar




On 09/07/2018 02:39 PM, Jan H. Schönherr wrote:

This patch series extends CFS with support for coscheduling. The
implementation is versatile enough to cover many different coscheduling
use-cases, while at the same time being non-intrusive, so that behavior of
legacy workloads does not change.

Peter Zijlstra once called coscheduling a "scalability nightmare waiting to
happen". Well, with this patch series, coscheduling certainly happened.
However, I disagree on the scalability nightmare. :)

In the remainder of this email, you will find:

A) Quickstart guide for the impatient.
B) Why would I want this?
C) How does it work?
D) What can I *not* do with this?
E) What's the overhead?
F) High-level overview of the patches in this series.

Regards
Jan


A) Quickstart guide for the impatient.
--

Here is a quickstart guide to set up coscheduling at core-level for
selected tasks on an SMT-capable system:

1. Apply the patch series to v4.19-rc2.
2. Compile with "CONFIG_COSCHEDULING=y".
3. Boot into the newly built kernel with an additional kernel command line
argument "cosched_max_level=1" to enable coscheduling up to core-level.
4. Create one or more cgroups and set their "cpu.scheduled" to "1".
5. Put tasks into the created cgroups and set their affinity explicitly.
6. Enjoy tasks of the same group and on the same core executing
simultaneously, whenever they are executed.

You are not restricted to coscheduling at core-level. Just select higher
numbers in steps 3 and 4. See also further below for more information, esp.
when you want to try higher numbers on larger systems.

Setting affinity explicitly for tasks within coscheduled cgroups is
currently necessary, as the load balancing portion is still missing in this
series.


I don't get the affinity part. If I create two cgroups by giving them only
cpu shares (no cpuset) and set their cpu.scheduled=1, will this ensure
co-scheduling of each group on core level for all cores in the system?

Thanks,
Subhra


Re: [RFC PATCH 2/2] pipe: use pipe busy wait

2018-09-17 Thread Subhra Mazumdar




On 09/17/2018 03:43 PM, Peter Zijlstra wrote:

On Mon, Sep 17, 2018 at 02:05:40PM -0700, Subhra Mazumdar wrote:

On 09/07/2018 05:25 AM, Peter Zijlstra wrote:

Why not just busy wait on current->state ? A little something like:

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c0977d..8d9f1c95ff99 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -106,6 +106,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
  void pipe_wait(struct pipe_inode_info *pipe)
  {
DEFINE_WAIT(wait);
+   u64 start;
/*
 * Pipes are system-local resources, so sleeping on them
@@ -113,7 +114,15 @@ void pipe_wait(struct pipe_inode_info *pipe)
 */
prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
-   schedule();
+
+   preempt_disable();
+   start = local_clock();
+   while (!need_resched() && current->state != TASK_RUNNING &&
+   (local_clock() - start) < pipe->poll_usec)
+   cpu_relax();
+   schedule_preempt_disabled();
+   preempt_enable();
+
finish_wait(&pipe->wait, &wait);
pipe_lock(pipe);
  }

This will make the current thread always spin and block as it itself does
the state change to TASK_RUNNING in finish_wait.

Nah, the actual wakeup will also do that state change. The one in
finish_wait() is for the case where the wait condition became true
without wakeup, such that we don't 'leak' the INTERRUPTIBLE state.

Ok, it works. I see similar improvements with hackbench as the original
patch.


Re: [PATCH 1/3] sched: remove select_idle_core() for scalability

2018-04-30 Thread Subhra Mazumdar



On 04/25/2018 10:49 AM, Peter Zijlstra wrote:

On Tue, Apr 24, 2018 at 02:45:50PM -0700, Subhra Mazumdar wrote:

So what you said makes sense in theory but is not borne out by real
world results. This indicates that threads of these benchmarks care more
about running immediately on any idle cpu rather than spending time to find
fully idle core to run on.

But you only ran on Intel which emunerates siblings far apart in the
cpuid space. Which is not something we should rely on.


So by only doing a linear scan on CPU number you will actually fill
cores instead of equally spreading across cores. Worse still, by
limiting the scan to _4_ you only barely even get onto a next core for
SMT4 hardware, never mind SMT8.

Again this doesn't matter for the benchmarks I ran. Most are happy to make
the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
the scan window is rotated over all cpus, so idle cpus will be found soon.

You've not been reading well. The Intel machine you tested this on most
likely doesn't suffer that problem because of the way it happens to
iterate SMT threads.

How does Sparc iterate its SMT siblings in cpuid space?

SPARC does sequential enumeration of siblings first, although this needs to
be confirmed if non-sequential enumeration on x86 is the reason of the
improvements through tests. I don't have a SPARC test system handy now.


Also, your benchmarks chose an unfortunate nr of threads vs topology.
The 2^n thing chosen never hits the 100% core case (6,22 resp.).


So while I'm not adverse to limiting the empty core search; I do feel it
is important to have. Overloading cores when you don't have to is not
good.

Can we have a config or a way for enabling/disabling select_idle_core?

I like Rohit's suggestion of folding select_idle_core and
select_idle_cpu much better, then it stays SMT aware.

Something like the completely untested patch below.

I tried both the patches you suggested, the first with merging of
select_idle_core and select_idle_cpu and second with the new way of
calculating avg_idle and finally both combined. I ran the following
benchmarks for each, the merge only patch seems to giving similar
improvements as my original patch for Uperf and Oracle DB tests, but it
regresses for hackbench. If we can fix this I am OK with it. I can do a run
of other benchamrks after that.

I also noticed a possible bug later in the merge code. Shouldn't it be:

if (busy < best_busy) {
    best_busy = busy;
    best_cpu = first_idle;
}

Unfortunately I noticed it after all runs.

merge:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline   %stdev  patch %stdev
1   0.5742 21.13   0.5099 (11.2%) 2.24
2   0.5776 7.87    0.5385 (6.77%) 3.38
4   0.9578 1.12    1.0626 (-10.94%) 1.35
8   1.7018 1.35    1.8615 (-9.38%) 0.73
16  2.9955 1.36    3.2424 (-8.24%) 0.66
32  5.4354 0.59    5.749  (-5.77%) 0.55

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline    %stdev  patch %stdev
8   49.47   0.35    49.98 (1.03%) 1.36
16  95.28   0.77    97.46 (2.29%) 0.11
32  156.77  1.17    167.03 (6.54%) 1.98
48  193.24  0.22    230.96 (19.52%) 2.44
64  216.21  9.33    299.55 (38.54%) 4
128 379.62  10.29   357.87 (-5.73%) 0.85

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline    %stdev  patch %stdev
20  1   1.35    0.9919 (-0.81%) 0.14
40  1   0.42    0.9959 (-0.41%) 0.72
60  1   1.54    0.9872 (-1.28%) 1.27
80  1   0.58    0.9925 (-0.75%) 0.5
100 1   0.77    1.0145 (1.45%) 1.29
120 1   0.35    1.0136 (1.36%) 1.15
140 1   0.19    1.0404 (4.04%) 0.91
160 1   0.09    1.0317 (3.17%) 1.41
180 1   0.99    1.0322 (3.22%) 0.51
200 1   1.03    1.0245 (2.45%) 0.95
220 1   1.69    1.0296 (2.96%) 2.83

new avg_idle:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline   %stdev  patch %stdev
1   0.5742 21.13   0.5241 (8.73%) 8.26
2   0.5776 7.87    0.5436 (5.89%) 8.53
4   0.9578 1.12    0.989 (-3.26%) 1.9
8   1.7018 1.35    1.7568 (-3.23%) 1.22
16  2.9955 1.36    3.1119 (-3.89%) 0.92
32  5.4354 0.59    5.5889 (-2.82%) 0.64

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline    %stdev  patch %stdev
8   49.47   0.35    48.11 (-2.75%) 0.29
16  95.28   0.77    93.67 (-1.68%) 0.68
32  156.77 

Re: [PATCH 1/3] sched: remove select_idle_core() for scalability

2018-05-02 Thread Subhra Mazumdar



On 05/01/2018 11:03 AM, Peter Zijlstra wrote:

On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:

I also noticed a possible bug later in the merge code. Shouldn't it be:

if (busy < best_busy) {
     best_busy = busy;
     best_cpu = first_idle;
}

Uhh, quite. I did say it was completely untested, but yes.. /me dons the
brown paper bag.

I re-ran the test after fixing that bug but still get similar regressions
for hackbench, while similar improvements on Uperf. I didn't re-run the
Oracle DB tests but my guess is it will show similar improvement.

merge:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline   %stdev  patch %stdev
1   0.5742 21.13   0.5131 (10.64%) 4.11
2   0.5776 7.87    0.5387 (6.73%) 2.39
4   0.9578 1.12    1.0549 (-10.14%) 0.85
8   1.7018 1.35    1.8516 (-8.8%) 1.56
16  2.9955 1.36    3.2466 (-8.38%) 0.42
32  5.4354 0.59    5.7738 (-6.23%) 0.38

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline    %stdev  patch %stdev
8   49.47   0.35    51.1 (3.29%) 0.13
16  95.28   0.77    98.45 (3.33%) 0.61
32  156.77  1.17    170.97 (9.06%) 5.62
48  193.24  0.22    245.89 (27.25%) 7.26
64  216.21  9.33    316.43 (46.35%) 0.37
128 379.62  10.29   337.85 (-11%) 3.68

I tried using the next_cpu technique with the merge but didn't help. I am
open to suggestions.

merge + next_cpu:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline   %stdev  patch %stdev
1   0.5742 21.13   0.5107 (11.06%) 6.35
2   0.5776 7.87    0.5917 (-2.44%) 11.16
4   0.9578 1.12    1.0761 (-12.35%) 1.1
8   1.7018 1.35    1.8748 (-10.17%) 0.8
16  2.9955 1.36    3.2419 (-8.23%) 0.43
32  5.4354 0.59    5.6958 (-4.79%) 0.58

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline    %stdev  patch %stdev
8   49.47   0.35    51.65 (4.41%) 0.26
16  95.28   0.77    99.8 (4.75%) 1.1
32  156.77  1.17    168.37 (7.4%) 0.6
48  193.24  0.22    228.8 (18.4%) 1.75
64  216.21  9.33    287.11 (32.79%) 10.82
128 379.62  10.29   346.22 (-8.8%) 4.7

Finally there was earlier suggestion by Peter in select_task_rq_fair to
transpose the cpu offset that I had tried earlier but also regressed on
hackbench. Just wanted to mention that so we have closure on that.

transpose cpu offset in select_task_rq_fair:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline   %stdev  patch %stdev
1   0.5742 21.13   0.5251 (8.55%) 2.57
2   0.5776 7.87    0.5471 (5.28%) 11
4   0.9578 1.12    1.0148 (-5.95%) 1.97
8   1.7018 1.35    1.798 (-5.65%) 0.97
16  2.9955 1.36    3.088 (-3.09%) 2.7
32  5.4354 0.59    5.2815 (2.8%) 1.26


Re: [PATCH 1/3] sched: remove select_idle_core() for scalability

2018-04-24 Thread Subhra Mazumdar



On 04/24/2018 05:46 AM, Peter Zijlstra wrote:

On Mon, Apr 23, 2018 at 05:41:14PM -0700, subhra mazumdar wrote:

select_idle_core() can potentially search all cpus to find the fully idle
core even if there is one such core. Removing this is necessary to achieve
scalability in the fast path.

So this removes the whole core awareness from the wakeup path; this
needs far more justification.

In general running on pure cores is much faster than running on threads.
If you plot performance numbers there's almost always a fairly
significant drop in slope at the moment when we run out of cores and
start using threads.

The only justification I have is the benchmarks I ran all most all
improved, most importantly our internal Oracle DB tests which we care about
a lot. So what you said makes sense in theory but is not borne out by real
world results. This indicates that threads of these benchmarks care more
about running immediately on any idle cpu rather than spending time to find
fully idle core to run on.

Also, depending on cpu enumeration, your next patch might not even leave
the core scanning for idle CPUs.

Now, typically on Intel systems, we first enumerate cores and then
siblings, but I've seen Intel systems that don't do this and enumerate
all threads together. Also other architectures are known to iterate full
cores together, both s390 and Power for example do this.

So by only doing a linear scan on CPU number you will actually fill
cores instead of equally spreading across cores. Worse still, by
limiting the scan to _4_ you only barely even get onto a next core for
SMT4 hardware, never mind SMT8.

Again this doesn't matter for the benchmarks I ran. Most are happy to make
the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
the scan window is rotated over all cpus, so idle cpus will be found soon.
There is also stealing by idle cpus. Also this was an RFT so I request this
to be tested on other architectrues like SMT4/SMT8.


So while I'm not adverse to limiting the empty core search; I do feel it
is important to have. Overloading cores when you don't have to is not
good.

Can we have a config or a way for enabling/disabling select_idle_core?


Re: [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit

2018-04-24 Thread Subhra Mazumdar



On 04/24/2018 05:47 AM, Peter Zijlstra wrote:

On Mon, Apr 23, 2018 at 05:41:15PM -0700, subhra mazumdar wrote:

@@ -17,6 +17,7 @@
  #include 
  
  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
  
  #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)

  /*
@@ -6018,6 +6019,7 @@ void __init sched_init(void)
struct rq *rq;
  
  		rq = cpu_rq(i);

+   per_cpu(next_cpu, i) = -1;

If you leave it uninitialized it'll be 0, and we can avoid that extra
branch in the next patch, no?

0 can be a valid cpu id. I wanted to distinguish the first time. The branch
predictor will be fully trained so will not have any cost.


Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability

2018-04-24 Thread Subhra Mazumdar



On 04/24/2018 05:48 AM, Peter Zijlstra wrote:

On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:

+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+

This one; what's the point here?

Want to start search from target first time and from the next_cpu next time
onwards. If this doesn't many any difference in performance I can change
it. Will require re-running all the tests.


Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability

2018-04-24 Thread Subhra Mazumdar



On 04/24/2018 05:48 AM, Peter Zijlstra wrote:

On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:

Lower the lower limit of idle cpu search in select_idle_cpu() and also put
an upper limit. This helps in scalability of the search by restricting the
search window. Also rotating the search window with help of next_cpu
ensures any idle cpu is eventually found in case of high load.

So this patch does 2 (possibly 3) things, that's not good.

During testing I did try with first only restricting the search window.
That alone wasn't enough to give the full benefit, rotating search window
was essential to get best results. I will break this up in next version.


Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability

2018-04-24 Thread Subhra Mazumdar



On 04/24/2018 05:53 AM, Peter Zijlstra wrote:

On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:

Lower the lower limit of idle cpu search in select_idle_cpu() and also put
an upper limit. This helps in scalability of the search by restricting the
search window.
@@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
  
  	if (sched_feat(SIS_PROP)) {

u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   if (span_avg > 2*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > 4)
+   nr = 4;
+   } else {
+   nr = 2;
+   }
}

Why do you need to put a max on? Why isn't the proportional thing
working as is? (is the average no good because of big variance or what)

Firstly the choosing of 512 seems arbitrary. Secondly the logic here is
that the enqueuing cpu should search up to time it can get work itself.
Why is that the optimal amount to search?


Again, why do you need to lower the min; what's wrong with 4?

The reason I picked 4 is that many laptops have 4 CPUs and desktops
really want to avoid queueing if at all possible.

To find the optimum upper and lower limit I varied them over many
combinations. 4 and 2 gave the best results across most benchmarks.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-28 Thread Subhra Mazumdar



On 2/18/19 8:56 AM, Peter Zijlstra wrote:

A much 'demanded' feature: core-scheduling :-(

I still hate it with a passion, and that is part of why it took a little
longer than 'promised'.

While this one doesn't have all the 'features' of the previous (never
published) version and isn't L1TF 'complete', I tend to like the structure
better (relatively speaking: I hate it slightly less).

This one is sched class agnostic and therefore, in principle, doesn't horribly
wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
to force-idle siblings).

Now, as hinted by that, there are semi sane reasons for actually having this.
Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
per core (due to SMT fundamentally sharing caches) and therefore grouping
related tasks on a core makes it more reliable.

However; whichever way around you turn this cookie; it is expensive and nasty.


I am seeing the following hard lockup frequently now. Following is full
kernel output:

[ 5846.412296] drop_caches (8657): drop_caches: 3
[ 5846.624823] drop_caches (8658): drop_caches: 3
[ 5850.604641] hugetlbfs: oracle (8671): Using mlock ulimits for SHM_HUGETL
B is deprecated
[ 5962.930812] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
[ 5962.930814] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5962.930828] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930828] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930829] RIP: 0010:try_to_wake_up+0x98/0x470
[ 5962.930830] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43
3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38
<85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f
[ 5962.930831] RSP: 0018:c9000f4dbcb8 EFLAGS: 0002
[ 5962.930832] RAX: 0001 RBX: 88dfb4af1680 RCX:
0041
[ 5962.930832] RDX: 0001 RSI:  RDI:
88dfb4af214c
[ 5962.930833] RBP:  R08: 0001 R09:
c9000f4dbd80
[ 5962.930833] R10: 8880 R11: ea00f0003d80 R12:
88dfb4af214c
[ 5962.930834] R13: 0001 R14: 0046 R15:
0001
[ 5962.930834] FS:  7ff4fabd9ae0() GS:88dfbe28()
knlGS:
[ 5962.930834] CS:  0010 DS:  ES:  CR0: 80050033
[ 5962.930835] CR2: 000f4cc84000 CR3: 003b93d36002 CR4:
003606e0
[ 5962.930835] DR0:  DR1:  DR2:

[ 5962.930836] DR3:  DR6: fffe0ff0 DR7:
0400
[ 5962.930836] Call Trace:
[ 5962.930837]  ? __switch_to_asm+0x34/0x70
[ 5962.930837]  ? __switch_to_asm+0x40/0x70
[ 5962.930838]  ? __switch_to_asm+0x34/0x70
[ 5962.930838]  autoremove_wake_function+0x11/0x50
[ 5962.930838]  __wake_up_common+0x8f/0x160
[ 5962.930839]  ? __switch_to_asm+0x40/0x70
[ 5962.930839]  __wake_up_common_lock+0x7c/0xc0
[ 5962.930840]  pipe_write+0x24e/0x3f0
[ 5962.930840]  __vfs_write+0x127/0x1b0
[ 5962.930840]  vfs_write+0xb3/0x1b0
[ 5962.930841]  ksys_write+0x52/0xc0
[ 5962.930841]  do_syscall_64+0x5b/0x170
[ 5962.930842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5962.930842] RIP: 0033:0x3b5900e7b0
[ 5962.930843] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90
90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24
[ 5962.930843] RSP: 002b:7ffedbcd93a8 EFLAGS: 0246 ORIG_RAX:
0001
[ 5962.930844] RAX: ffda RBX: 7ff4faa86e24 RCX:
003b5900e7b0
[ 5962.930845] RDX: 028f RSI: 7ff4faa9688e RDI:
000a
[ 5962.930845] RBP: 7ffedbcd93c0 R08: 7ffedbcd9458 R09:
0020
[ 5962.930846] R10:  R11: 0246 R12:
7ffedbcd9458
[ 5962.930847] R13: 7ff4faa9688e R14: 7ff4faa89cc8 R15:
7ff4faa86bd0
[ 5962.930847] Kernel panic - not syncing: Hard LOCKUP
[ 5962.930848] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930848] Hardware name: Oracle Corporation ORACLE
SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930849] Call Trace:
[ 5962.930849]  
[ 5962.930849]  dump_stack+0x5c/0x7b
[ 5962.930850]  panic+0xfe/0x2b2
[ 5962.930850]  nmi_panic+0x35/0x40
[ 5962.930851]  watchdog_overflow_callback+0xef/0x100
[ 5962.930851]  __perf_event_overflow+0x5a/0xe0
[ 5962.930852]  handle_pmi_common+0x1d1/0x280
[ 5962.930852]  ? __set_pte_vaddr+0x32/0x50
[ 5962.930852]  ? __set

Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-03-26 Thread Subhra Mazumdar



On 3/22/19 5:06 PM, Subhra Mazumdar wrote:


On 3/21/19 2:20 PM, Julien Desfossez wrote:
On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar 


wrote:

On 3/18/19 8:41 AM, Julien Desfossez wrote:

On further investigation, we could see that the contention is mostly 
in the

way rq locks are taken. With this patchset, we lock the whole core if
cpu.tag is set for at least one cgroup. Due to this, __schedule() is 
more or

less serialized for the core and that attributes to the performance loss
that we are seeing. We also saw that newidle_balance() takes 
considerably
long time in load_balance() due to the rq spinlock contention. Do you 
think
it would help if the core-wide locking was only performed when 
absolutely

needed ?

Is the core wide lock primarily responsible for the regression? I ran 
upto patch

12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any regression.  
Of course
the core sched version of pick_next_task() may be doing more but 
comparing with

the __pick_next_task() it doesn't look too horrible.
I gathered some data with only 1 DB instance running (which also has 52% 
slow
down). Following are the numbers of pick_next_task() calls and their avg 
cost
for patch 12 and patch 15. The total number of calls seems to be similar 
but the
avg cost (in us) has more than doubled. For both the patches I had put 
the DB

instance into a cpu tagged cgroup.

 patch12 patch15
count pick_next_task 62317898 58925395
avg cost pick_next_task  0.6566323209   1.4223810108


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-04-01 Thread Subhra Mazumdar



On 3/29/19 3:23 PM, Subhra Mazumdar wrote:


On 3/29/19 6:35 AM, Julien Desfossez wrote:
On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar 


wrote:

Is the core wide lock primarily responsible for the regression? I ran
upto patch
12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any 
regression.  Of

course
the core sched version of pick_next_task() may be doing more but
comparing with
the __pick_next_task() it doesn't look too horrible.
On further testing and investigation, we also agree that spinlock 
contention
is not the major cause for the regression, but we feel that it should 
be one

of the major contributing factors to this performance loss.



I finally did some code bisection and found the following lines are
basically responsible for the regression. Commenting them out I don't see
the regressions. Can you confirm? I am yet to figure if this is needed 
for

the correctness of core scheduling and if so can we do this better?

>8-

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe3918c..3b3388a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 * If there weren't no cookies; we 
don't need

 * to bother with the other siblings.
*/
-   if (i == cpu && !rq->core->core_cookie)
-   goto next_class;
+   //if (i == cpu && !rq->core->core_cookie)
+   //goto next_class;

continue;
    }

AFAICT this condition is not needed for correctness as cookie matching will
sill be enforced. Peter any thoughts? I get the following numbers with 1 DB
and 2 DB instance.

1 DB instance
users  baseline   %idle  core_sched %idle
16 1  84 -5.5% 84
24 1  76 -5% 76
32 1  69 -0.45% 69

2 DB instance
users  baseline   %idle  core_sched %idle
16 1  66 -23.8% 69
24 1  54 -3.1% 57
32 1  42 -21.1%  48


Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

2019-04-10 Thread Subhra Mazumdar



On 4/9/19 11:38 AM, Julien Desfossez wrote:

We found the source of the major performance regression we discussed
previously. It turns out there was a pattern where a task (a kworker in this
case) could be woken up, but the core could still end up idle before that
task had a chance to run.

Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
task2 are in the same cgroup with the tag enabled (each following line
happens in the increasing order of time):
- task1 running on cpu0, task2 running on cpu1
- sched_waking(kworker/0, target_cpu=cpu0)
- task1 scheduled out of cpu0
- kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
   cpu0 is idle
- task2 scheduled out of cpu1
- cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
   the task selection if core_cookie is NULL for currently selected process
   and the cpu1’s runqueue.
- cpu1 is idle
--> both siblings are idle but kworker/0 is still in the run queue of cpu0.
 Cpu0 may stay idle for longer if it goes deep idle.

With the fix below, we ensure to send an IPI to the sibling if it is idle
and has tasks waiting in its runqueue.
This fixes the performance issue we were seeing.

Now here is what we can measure with a disk write-intensive benchmark:
- no performance impact with enabling core scheduling without any tagged
   task,
- 5% overhead if one tagged task is competing with an untagged task,
- 10% overhead if 2 tasks tagged with a different tag are competing
   against each other.

We are starting more scaling tests, but this is very encouraging !


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1fa10561279..02c862a5e973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
  
  trace_printk("unconstrained pick: %s/%d %lx\n",

next->comm, next->pid, 
next->core_cookie);
+   rq->core_pick = NULL;
  
+/*

+* If the sibling is idling, we might want to 
wake it
+* so that it can check for any runnable but 
blocked tasks
+* due to previous task matching.
+*/
+   for_each_cpu(j, smt_mask) {
+   struct rq *rq_j = cpu_rq(j);
+   rq_j->core_pick = NULL;
+   if (j != cpu && is_idle_task(rq_j->curr) 
&& rq_j->nr_running) {
+   resched_curr(rq_j);
+   trace_printk("IPI(%d->%d[%d]) idle 
preempt\n",
+cpu, j, 
rq_j->nr_running);
+   }
+   }
goto done;
}
  

I see similar improvement with this patch as removing the condition I
earlier mentioned. So that's not needed. I also included the patch for the
priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
(from earlier emails).


1 DB instance

users  baseline   %idle    core_sched %idle
16 1  84   -4.9% 84
24 1  76   -6.7% 75
32 1  69   -2.4% 69

2 DB instance

users  baseline   %idle    core_sched %idle
16 1  66   -19.5% 69
24 1  54   -9.8% 57
32 1  42   -27.2%    48


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-03-19 Thread Subhra Mazumdar



On 3/18/19 8:41 AM, Julien Desfossez wrote:

The case where we try to acquire the lock on 2 runqueues belonging to 2
different cores requires the rq_lockp wrapper as well otherwise we
frequently deadlock in there.

This fixes the crash reported in
1552577311-8218-1-git-send-email-jdesfos...@digitalocean.com

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76fee56..71bb71f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct 
rq *rq2)
raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock);/* Fake it out ;) */
} else {
-   if (rq1 < rq2) {
+   if (rq_lockp(rq1) < rq_lockp(rq2)) {
raw_spin_lock(rq_lockp(rq1));
raw_spin_lock_nested(rq_lockp(rq2), 
SINGLE_DEPTH_NESTING);
} else {
With this fix and my previous NULL pointer fix my stress tests are 
surviving. I

re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB
instance in separate core scheduling group. The numbers look much worse 
now.


users  baseline  %stdev  %idle  core_sched  %stdev %idle
16 1 0.3 66 -73.4%  136.8 82
24 1 1.6 54 -95.8%  133.2 81
32 1 1.5 42 -97.5%  124.3 89

I also notice that if I enable a bunch of debug configs related to 
mutexes, spin
locks, lockdep etc. (which I did earlier to debug the dead lock), it 
opens up a

can of worms with multiple crashes.


[RFC PATCH 4/9] sched: SIS_CORE to disable idle core search

2019-08-30 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c31082d..23ec9c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[RFC PATCH 2/9] sched: add search limit as per latency-nice

2019-08-30 Thread subhra mazumdar
Put upper and lower limit on CPU search in select_idle_cpu. The lower limit
is set to amount of CPUs in a core  while upper limit is derived from the
latency-nice of the thread. This ensures for any architecture we will
usually search beyond a core. Changing the latency-nice value by user will
change the search cost making it appropriate for given workload.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b08d00c..c31082d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
return -1;
 
if (sched_feat(SIS_PROP)) {
-   u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
-   nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
+   if (nr < floor)
+   nr = floor;
}
 
time = local_clock();
-- 
2.9.3



[RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-08-30 Thread subhra mazumdar
Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
"latency-nice" which is shared by all the threads in that Cgroup.

Signed-off-by: subhra mazumdar 
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 40 
 kernel/sched/fair.c   |  1 +
 kernel/sched/sched.h  |  8 
 4 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b4a79c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -631,6 +631,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned intrt_priority;
+   u64 latency_nice;
 
const struct sched_class*sched_class;
struct sched_entity se;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..47969bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5976,6 +5976,7 @@ void __init sched_init(void)
init_dl_rq(&rq->dl);
 #ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
+   root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
/*
@@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
 */
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
  struct task_group, css);
+   tsk->latency_nice = tg->latency_nice;
tg = autogroup_task_group(tsk, tg);
tsk->sched_task_group = tg;
 
@@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
+struct cftype *cft)
+{
+   struct task_group *tg = css_tg(css);
+
+   return tg->latency_nice;
+}
+
+static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 latency_nice)
+{
+   struct task_group *tg = css_tg(css);
+   struct css_task_iter it;
+   struct task_struct *p;
+
+   if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
+   return -ERANGE;
+
+   tg->latency_nice = latency_nice;
+
+   css_task_iter_start(css, 0, &it);
+   while ((p = css_task_iter_next(&it)))
+   p->latency_nice = latency_nice;
+   css_task_iter_end(&it);
+
+   return 0;
+}
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
 #endif
+   {
+   .name = "latency-nice",
+   .read_u64 = cpu_latency_nice_read_u64,
+   .write_u64 = cpu_latency_nice_write_u64,
+   },
{ } /* Terminate */
 };
 
@@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
.write = cpu_max_write,
},
 #endif
+   {
+   .name = "latency-nice",
+   .read_u64 = cpu_latency_nice_read_u64,
+   .write_u64 = cpu_latency_nice_write_u64,
+   },
{ } /* terminate */
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b08d00c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, 
struct task_group *parent)
goto err;
 
tg->shares = NICE_0_LOAD;
+   tg->latency_nice = LATENCY_NICE_DEFAULT;
 
init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..365c928 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq 
*this_rq) { }
 #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
 
 /*
+ * Latency-nice default value
+ */
+#defineLATENCY_NICE_DEFAULT5
+#defineLATENCY_NICE_MIN1
+#defineLATENCY_NICE_MAX100
+
+/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -362,6 +369,7 @@ struct cfs_bandwidth {
 /* Task group related information */
 struct task_group {
struct cgroup_subsys_state css;
+   u64 latency_nice;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
-- 
2.9.3



[RFC PATCH 9/9] sched: rotate the cpu search window for better spread

2019-08-30 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94dd4a32..7419b47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, floor, nr = INT_MAX;
+   int cpu, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
nr = floor;
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3



[RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit

2019-08-30 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47969bc..5862d54 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 365c928..cca2b09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1002,6 +1002,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr(&runqueues)
-- 
2.9.3



[RFC PATCH 7/9] sched: search SMT before LLC domain

2019-08-30 Thread subhra mazumdar
Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
in L1 cache locality.
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8856503..94dd4a32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
return i;
}
 
-   i = select_idle_cpu(p, sd, target);
+   i = select_idle_smt(p, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, target);
+   i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-- 
2.9.3



[RFC PATCH 0/9] Task latency-nice

2019-08-30 Thread subhra mazumdar
Introduce new per task property latency-nice for controlling scalability
in scheduler idle CPU search path. Valid latency-nice values are from 1 to
100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
CPU cgroup file cpu.latency-nice is added as an interface to set and get.
All tasks in the same cgroup share the same latency-nice value. Using a
lower latency-nice value can help latency intolerant tasks e.g very short
running OLTP threads where full LLC search cost can be significant compared
to run time of the threads. The default latency-nice value is 5.

In addition to latency-nice, it also adds a new sched feature SIS_CORE to
be able to disable idle core search altogether which is costly and hurts
more than it helps in short running workloads.

Finally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline   latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE 
8   64.66  64.38 (-0.43%)  64.79 (0.2%)
16  123.34 122.88 (-0.37%) 125.87 (2.05%)
32  215.18 215.55 (0.17%)  247.77 (15.15%)
48  278.56 321.6 (15.45%)  321.2 (15.3%)
64  259.99 319.45 (22.87%) 333.95 (28.44%)
128 431.1  437.69 (1.53%)  431.09 (0%)

subhra mazumdar (9):
  sched,cgroup: Add interface for latency-nice
  sched: add search limit as per latency-nice
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  sched: Define macro for number of CPUs in core
  x86/smpboot: Optimize cpumask_weight_sibling macro for x86
  sched: search SMT before LLC domain
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread

 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/sched.h   |  1 +
 include/linux/topology.h|  4 
 kernel/sched/core.c | 42 +
 kernel/sched/fair.c | 34 +
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  9 +
 9 files changed, 97 insertions(+), 13 deletions(-)

-- 
2.9.3



[RFC PATCH 5/9] sched: Define macro for number of CPUs in core

2019-08-30 Thread subhra mazumdar
Introduce macro topology_sibling_weight for number of sibling CPUs in a
core and use in select_idle_cpu

Signed-off-by: subhra mazumdar 
---
 include/linux/topology.h | 4 
 kernel/sched/fair.c  | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)  cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_weight
+#define topology_sibling_weight(cpu)   \
+   cpumask_weight(topology_sibling_cpumask(cpu))
+#endif
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu) cpumask_of(cpu)
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23ec9c6..8856503 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6205,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
return -1;
 
if (sched_feat(SIS_PROP)) {
-   floor = cpumask_weight(topology_sibling_cpumask(target));
+   floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
-- 
2.9.3



[RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86

2019-08-30 Thread subhra mazumdar
Use per-CPU variable for cpumask_weight_sibling macro in case of x86 for
fast lookup in select_idle_cpu. This avoids reading multiple cache lines
in case of systems with large numbers of CPUs where bitmask can span
multiple cache lines. Even if bitmask spans only one cache line this avoids
looping through it to find the number of bits and gets it in O(1).

Signed-off-by: subhra mazumdar 
---
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)   (per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..57ad88d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,9 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +523,8 @@ void set_cpu_sibling_map(int cpu)
 
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +534,13 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = &cpu_data(i);
 
-   if ((i == cpu) || (has_smt && match_smt(c, o)))
+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, i) =
+   cpumask_weight(topology_sibling_cpumask(i));
+   }
 
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
else
physid_set_mask_of_physid(0, &phys_cpu_present_map);
cpumask_set_cpu(0, topology_sibling_cpumask(0));
+   per_cpu(cpumask_weight_sibling, 0) =
+   cpumask_weight(topology_sibling_cpumask(0));
cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
for_each_cpu(sibling, topology_core_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));
/*/
 * last thread sibling in this cpu core going down
 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = 0;
cpumask_clear(topology_core_cpumask(cpu));
c->cpu_core_id = 0;
c->booted_cores = 0;
-- 
2.9.3



[RFC PATCH 3/9] sched: add sched feature to disable idle core search

2019-08-30 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



Panic on v5.3-rc4

2019-08-15 Thread Subhra Mazumdar

I am getting the following panic during boot of tag v5.3-rc4 of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git. I don't
see the panic on tag v5.2 on same rig. Is it a bug or something
legitimately changed?

Thanks,
Subhra

[  147.184948] dracut Warning: No root device 
"block:/dev/mapper/vg_paex623-lv_root" found



[  147.282665] dracut Warning: LVM vg_paex623/lv_swap not found


[  147.354854] dracut Warning: LVM vg_paex623/lv_root not found

[  147.432099] dracut Warning: Boot has failed. To debug this issue add 
"rdshell" to the kernel command line.

[  147.549737] dracut Warning: Signal caught!


[  147.600145] dracut Warning: LVM vg_paex623/lv_swap not found
[  147.670692] dracut Warning: LVM vg_paex623/lv_root not found

[  147.738593] dracut Warning: Boot has failed. To debug this issue add 
"rdshell" to the kernel command line.
[  147.856206] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0100
[  147.947859] CPU: 3 PID: 1 Comm: init Not tainted 
5.3.0-rc4latency_nice_BL #3
[  148.032225] Hardware name: Oracle Corporation ORACLE SERVER 
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016

[  148.149879] Call Trace:
[  148.179117] dump_stack+0x5c/0x7b
[  148.218760] panic+0xfe/0x2e2
[  148.254242] do_exit+0xbd8/0xbe0
[  148.292842] do_group_exit+0x3a/0xa0
[  148.335597] __x64_sys_exit_group+0x14/0x20
[  148.385644] do_syscall_64+0x5b/0x1d0
[  148.429448] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  148.489900] RIP: 0033:0x3a5ccad018
[  148.530582] Code: Bad RIP value.
[  148.569178] RSP: 002b:7ffc147a5b48 EFLAGS: 0246 ORIG_RAX: 
00e7
[  148.659791] RAX: ffda RBX: 0004 RCX: 
003a5ccad018
[  148.745197] RDX: 0001 RSI: 003c RDI: 
0001
[  148.830605] RBP:  R08: 00e7 R09: 
ffa8
[  148.916011] R10: 0001 R11: 0246 R12: 
00401fb0
[  149.001414] R13: 7ffc147a5e20 R14:  R15: 


[  149.086929] Kernel Offset: disabled
[  149.132815] ---[ end Kernel panic - not syncing: Attempted to kill 
init! exitcode=0x0100 ]---




Re: [RFC PATCH 7/9] sched: search SMT before LLC domain

2019-09-05 Thread Subhra Mazumdar



On 9/5/19 2:31 AM, Peter Zijlstra wrote:

On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:

Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
in L1 cache locality.
---
  kernel/sched/fair.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8856503..94dd4a32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
return i;
}
  
-	i = select_idle_cpu(p, sd, target);

+   i = select_idle_smt(p, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
  
-	i = select_idle_smt(p, target);

+   i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
  

But it is absolutely conceptually wrong. An idle core is a much better
target than an idle sibling.

This is select_idle_smt not select_idle_core.


Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search

2019-09-05 Thread Subhra Mazumdar



On 9/5/19 3:17 AM, Patrick Bellasi wrote:

On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...


Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
  kernel/sched/features.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
   */
  SCHED_FEAT(SIS_AVG_CPU, false)
  SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)

Why do we need a sched_feature? If you think there are systems in
which the usage of latency-nice does not make sense for in "Select Idle
Sibling", then we should probably better add a new Kconfig option.

This is not for latency-nice but to be able to disable a different aspect
of the scheduler, i.e searching for idle cores. This can be made part of
latency-nice (i.e not do idle core search if latency-nice is below a
certain value) but even then having a feature to disable it doesn't hurt.


If that's the case, you can probably use the init/Kconfig's
"Scheduler features" section, recently added by:

   commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")


  /*
   * Issue a WARN when we do multiple update_rq_clock() calls

Best,
Patrick



Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

2019-07-16 Thread Subhra Mazumdar



On 7/2/19 10:58 PM, Peter Zijlstra wrote:

On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:

The soft affinity CPUs present in the cpumask cpus_preferred is used by the
scheduler in two levels of search. First is in determining wake affine
which choses the LLC domain and secondly while searching for idle CPUs in
LLC domain. In the first level it uses cpus_preferred to prune out the
search space. In the second level it first searches the cpus_preferred and
then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
any overhead in the scheduler fast path when soft affinity is not used.
This only changes the wake up path of the scheduler, the idle balancing
is unchanged; together they achieve the "softness" of scheduling.

I really dislike this implementation.

I thought the idea was to remain work conserving (in so far as that
we're that anyway), so changing select_idle_sibling() doesn't make sense
to me. If there is idle, we use it.

Same for newidle; which you already retained.

The scheduler is already not work conserving in many ways. Soft affinity is
only for those who want to use it and has no side effects when not used.
Also the way scheduler is implemented in the first level of search it may
not be possible to do it in a work conserving way, I am open to ideas.


This then leaves regular balancing, and for that we can fudge with
can_migrate_task() and nr_balance_failed or something.

Possibly but I don't know if similar performance behavior can be achieved
by the periodic load balancer. Do you want a performance comparison of the
two approaches?


And I also really don't want a second utilization tipping point; we
already have the overloaded thing.

The numbers in the cover letter show that a static tipping point will not
work for all workloads. What soft affinity is doing is essentially trading
off cache coherence for more CPU. The optimum tradeoff point will vary
from workload to workload and the system metrics of coherence overhead etc.
If we just use the domain overload that becomes a static definition of
tipping point, we need something tunable that captures this tradeoff. The
ratio of CPU util seemed to work well and capture that.


I also still dislike how you never looked into the numa balancer, which
already has peferred_nid stuff.

Not sure if you mean using the existing NUMA balancer or enhancing it. If
the former, I have numbers in the cover letter that show NUMA balancer is
not making any difference. I allocated memory of each DB instance to one
NUMA node using numactl, but NUMA balancer still migrated pages, so numactl
only seems to control the initial allocation. Secondly even though NUMA
balancer migrated pages it had no performance benefit as compared to
disabling it.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Subhra Mazumdar



On 2/20/19 1:42 AM, Peter Zijlstra wrote:

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:

Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
quick and dirty cgroup tagging interface," I believe cgroups are used to
define co-scheduling groups in this implementation.

Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
pal...@google.com) are considering an interface that is usable by unprivileged
userspace apps. cgroups are a global resource that require privileged access.
Have you considered an interface that is akin to namespaces? Consider the
following strawperson API proposal (I understand prctl() is generally
used for process
specific actions, so we aren't married to using prctl()):

I don't think we're anywhere near the point where I care about
interfaces with this stuff.

Interfaces are a trivial but tedious matter once the rest works to
satisfaction.

As it happens; there is actually a bug in that very cgroup patch that
can cause undesired scheduling. Try spotting and fixing that.

Another question is if we want to be L1TF complete (and how strict) or
not, and if so, build the missing pieces (for instance we currently
don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
and horrible code and missing for that reason).

I remember asking Paul about this and he mentioned he has a Address Space
Isolation proposal to cover this. So it seems this is out of scope of
core scheduling?


So first; does this provide what we need? If that's sorted we can
bike-shed on uapi/abi.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Subhra Mazumdar



On 2/18/19 9:49 AM, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

   Linus

I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.  In all cases I put each DB
instance in separate cpu cgroup. Following are the avg throughput numbers
of the 2 instances. %stdev is the standard deviation between the 2
instances.

Baseline = build w/o CONFIG_SCHED_CORE
core_sched = build w/ CONFIG_SCHED_CORE
HT_disable = offlined sibling HT with baseline

Users  Baseline  %stdev  core_sched %stdev HT_disable   %stdev
16 997768    3.28    808193(-19%)   34 1053888(+5.6%)   2.9
24 1157314   9.4 974555(-15.8%) 40.5 1197904(+3.5%)   4.6
32 1693644   6.4 1237195(-27%)  42.8 1308180(-22.8%)  5.3

The regressions are substantial. Also noticed one of the DB instances was
having much less throughput than the other with core scheduling which
brought down the avg and also reflected in the very high %stdev. Disabling
HT has effect at 32 users but still better than core scheduling both in
terms of avg and %stdev. There are some issue with the DB setup for which
I couldn't go beyond 32 users.


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 8:49 AM, Aubrey Li wrote:

Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel 
build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for 
each VM.
Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.



Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 11:19 AM, Subhra Mazumdar wrote:


On 5/8/19 8:49 AM, Aubrey Li wrote:
Pawan ran an experiment setting up 2 VMs, with one VM doing a 
parallel kernel build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with 
8 vcpu for each VM.

Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.


Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
why you think it is unlikely?


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 5:01 PM, Aubrey Li wrote:

On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar
 wrote:


On 5/8/19 11:19 AM, Subhra Mazumdar wrote:

On 5/8/19 8:49 AM, Aubrey Li wrote:

Pawan ran an experiment setting up 2 VMs, with one VM doing a
parallel kernel build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with
8 vcpu for each VM.
Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.


Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
why you think it is unlikely?

Yeah, I meant 2 different cookies on the system, but unlikely 2
different cookies
on one same rq.

If I read the source correctly, for the sched_core_balance path, when try to
steal cookie from another CPU, sched_core_find() uses dst's cookie to search
if there is a cookie match in src's rq, and sched_core_find() returns idle or
matched task, and later put this matched task onto dst's rq (activate_task() in
sched_core_find()). At this moment, the nodes on the rq's rb tree should have
same cookies.

Thanks,
-Aubrey

Yes, but sched_core_find is also called from pick_task to find a local
matching task. The enqueue side logic of the scheduler is unchanged with
core scheduling, so it is possible tasks with different cookies are
enqueued on the same rq. So while searching for a matching task locally
doing it correctly should matter.


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 6:38 PM, Aubrey Li wrote:

On Thu, May 9, 2019 at 8:29 AM Subhra Mazumdar
 wrote:


On 5/8/19 5:01 PM, Aubrey Li wrote:

On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar
 wrote:

On 5/8/19 11:19 AM, Subhra Mazumdar wrote:

On 5/8/19 8:49 AM, Aubrey Li wrote:

Pawan ran an experiment setting up 2 VMs, with one VM doing a
parallel kernel build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with
8 vcpu for each VM.
Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.


Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
why you think it is unlikely?

Yeah, I meant 2 different cookies on the system, but unlikely 2
different cookies
on one same rq.

If I read the source correctly, for the sched_core_balance path, when try to
steal cookie from another CPU, sched_core_find() uses dst's cookie to search
if there is a cookie match in src's rq, and sched_core_find() returns idle or
matched task, and later put this matched task onto dst's rq (activate_task() in
sched_core_find()). At this moment, the nodes on the rq's rb tree should have
same cookies.

Thanks,
-Aubrey

Yes, but sched_core_find is also called from pick_task to find a local
matching task.

Can a local searching introduce a different cookies? Where is it from?

No. I meant the local search uses the same binary search of sched_core_find
so it has to be correct.



The enqueue side logic of the scheduler is unchanged with
core scheduling,

But only the task with cookies is placed onto this rb tree?


so it is possible tasks with different cookies are
enqueued on the same rq. So while searching for a matching task locally
doing it correctly should matter.

May I know how exactly?

select_task_rq_* seems to be unchanged. So the search logic to find a cpu
to enqueue when a task becomes runnable is same as before and doesn't do
any kind of cookie matching.


Thanks,
-Aubrey


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-09 Thread Subhra Mazumdar




select_task_rq_* seems to be unchanged. So the search logic to find a cpu
to enqueue when a task becomes runnable is same as before and doesn't do
any kind of cookie matching.

Okay, that's true in task wakeup path, and also load_balance seems to pull task
without checking cookie too. But my system is not over loaded when I tested this
patch, so there is none or only one task in rq and on the rq's rb
tree, so this patch
does not make a difference.

I had same hypothesis for my tests.


The question is, should we do cookie checking for task selecting CPU and load
balance CPU pulling task?

The basic issue is keeping the CPUs busy. In case of overloaded system,
the trivial new idle balancer should be able to find a matching task
in case of forced idle. More problematic is the lower load scenario when
there aren't any matching task to be found but there are runnable tasks of
other groups. Also wake up code path tries to balance threads across cores
(select_idle_core) first which is opposite of what core scheduling wants.
I will re-run my tests with select_idle_core disabled, but the issue is
on x86 Intel systems (my test rig) the CPU ids are interleaved across cores
so even select_idle_cpu will balance across cores first. May be others have
some better ideas?


Thanks,
-Aubrey


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-04-04 Thread Subhra Mazumdar




We tried to comment those lines and it doesn’t seem to get rid of the
performance regression we are seeing.
Can you elaborate a bit more about the test you are performing, what kind of
resources it uses ?

I am running 1 and 2 Oracle DB instances each running TPC-C workload. The
clients driving the instances also run in same node. Each server client
pair is put in each cpu group and tagged.

Can you also try to replicate our test and see if you see the same problem ?

cgcreate -g cpu,cpuset:set1
cat /sys/devices/system/cpu/cpu{0,2,4,6}/topology/thread_siblings_list
0,36
2,38
4,40
6,42

echo "0,2,4,6,36,38,40,42" | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.cpus
echo 0 | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.mems

echo 1 | sudo tee /sys/fs/cgroup/cpu,cpuacct/set1/cpu.tag

sysbench --test=fileio prepare
cgexec -g cpu,cpuset:set1 sysbench --threads=4 --test=fileio \
--file-test-mode=seqwr run

The reason we create a cpuset is to narrow down the investigation to just 4
cores on a highly powerful machine. It might not be needed if testing on a
smaller machine.

With this sysbench test I am not seeing any improvement with removing the
condition. Also with hackbench I found it makes no difference but that has
much lower regression to begin with (18%)


Julien


Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-05-14 Thread Subhra Mazumdar



On 5/14/19 9:03 AM, Steven Sistare wrote:

On 5/13/2019 7:35 AM, Peter Zijlstra wrote:

On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:

On 10-05-19, 09:21, Peter Zijlstra wrote:

I don't hate his per se; but the whole select_idle_sibling() thing is
something that needs looking at.

There was the task stealing thing from Steve that looked interesting and
that would render your apporach unfeasible.

I am surely missing something as I don't see how that patchset will
make this patchset perform badly, than what it already does.

Nah; I just misremembered. I know Oracle has a patch set poking at
select_idle_siblings() _somewhere_ (as do I), and I just found the wrong
one.

Basically everybody is complaining select_idle_sibling() is too
expensive for checking the entire LLC domain, except for FB (and thus
likely some other workloads too) that depend on it to kill their tail
latency.

But I suppose we could still do this, even if we scan only a subset of
the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
tasks and pick that if you do not (in your limited scan) find a better
candidate.

Subhra posted a patch that incrementally searches for an idle CPU in the LLC,
remembering the last CPU examined, and searching a fixed number of CPUs from 
there.
That technique is compatible with the one that Viresh suggests; the incremental
search would stop if a SCHED_IDLE cpu was found.

This was the last version of patchset I sent:
https://lkml.org/lkml/2018/6/28/810

Also select_idle_core is a net -ve for certain workloads like OLTP. So I
had put a SCHED_FEAT to be able to disable it.

Thanks,
Subhra


I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle 
CPUs,
updated with atomic operations.  Performance was basically unchanged for the
workloads I tested, and I inserted timers around the idle search showing it was
a very small fraction of time both before and after my changes.  That led me to
ignore the push side and optimize the pull side with task stealing.

I would be very interested in hearing from folks that have workloads that 
demonstrate
that select_idle_sibling is too expensive.

- Steve


Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-05-14 Thread Subhra Mazumdar



On 5/14/19 10:27 AM, Subhra Mazumdar wrote:


On 5/14/19 9:03 AM, Steven Sistare wrote:

On 5/13/2019 7:35 AM, Peter Zijlstra wrote:

On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:

On 10-05-19, 09:21, Peter Zijlstra wrote:

I don't hate his per se; but the whole select_idle_sibling() thing is
something that needs looking at.

There was the task stealing thing from Steve that looked 
interesting and

that would render your apporach unfeasible.

I am surely missing something as I don't see how that patchset will
make this patchset perform badly, than what it already does.

Nah; I just misremembered. I know Oracle has a patch set poking at
select_idle_siblings() _somewhere_ (as do I), and I just found the 
wrong

one.

Basically everybody is complaining select_idle_sibling() is too
expensive for checking the entire LLC domain, except for FB (and thus
likely some other workloads too) that depend on it to kill their tail
latency.

But I suppose we could still do this, even if we scan only a subset of
the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
tasks and pick that if you do not (in your limited scan) find a better
candidate.
Subhra posted a patch that incrementally searches for an idle CPU in 
the LLC,
remembering the last CPU examined, and searching a fixed number of 
CPUs from there.
That technique is compatible with the one that Viresh suggests; the 
incremental

search would stop if a SCHED_IDLE cpu was found.

This was the last version of patchset I sent:
https://lkml.org/lkml/2018/6/28/810

Also select_idle_core is a net -ve for certain workloads like OLTP. So I
had put a SCHED_FEAT to be able to disable it.

Forgot to add, the cpumask_weight computation may not be O(1) with large
number of CPUs, so needs to be precomputed in a per-cpu variable to further
optimize. That part is missing from the above patchset.


Thanks,
Subhra


I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap 
of idle CPUs,
updated with atomic operations.  Performance was basically unchanged 
for the
workloads I tested, and I inserted timers around the idle search 
showing it was
a very small fraction of time both before and after my changes.  That 
led me to

ignore the push side and optimize the pull side with task stealing.

I would be very interested in hearing from folks that have workloads 
that demonstrate

that select_idle_sibling is too expensive.

- Steve


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-03-29 Thread Subhra Mazumdar



On 3/29/19 6:35 AM, Julien Desfossez wrote:

On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar 
wrote:

Is the core wide lock primarily responsible for the regression? I ran
upto patch
12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any regression.  Of
course
the core sched version of pick_next_task() may be doing more but
comparing with
the __pick_next_task() it doesn't look too horrible.

On further testing and investigation, we also agree that spinlock contention
is not the major cause for the regression, but we feel that it should be one
of the major contributing factors to this performance loss.



I finally did some code bisection and found the following lines are
basically responsible for the regression. Commenting them out I don't see
the regressions. Can you confirm? I am yet to figure if this is needed for
the correctness of core scheduling and if so can we do this better?

>8-

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe3918c..3b3388a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 * If there weren't no cookies; we 
don't need

 * to bother with the other siblings.
*/
-   if (i == cpu && !rq->core->core_cookie)
-   goto next_class;
+   //if (i == cpu && !rq->core->core_cookie)
+   //goto next_class;

continue;
    }


Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

2019-04-19 Thread Subhra Mazumdar



On 4/19/19 1:40 AM, Ingo Molnar wrote:

* Subhra Mazumdar  wrote:


I see similar improvement with this patch as removing the condition I
earlier mentioned. So that's not needed. I also included the patch for the
priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
(from earlier emails).


1 DB instance

users  baseline   %idle    core_sched %idle
16 1  84   -4.9% 84
24 1  76   -6.7% 75
32 1  69   -2.4% 69

2 DB instance

users  baseline   %idle    core_sched %idle
16 1  66   -19.5% 69
24 1  54   -9.8% 57
32 1  42   -27.2%    48

So HT disabling slows down the 2DB instance by -22%, while core-sched
slows it down by -27.2%?

Would it be possible to see all the results in two larger tables (1 DB
instance and 2 DB instance) so that we can compare the performance of the
3 kernel variants with each other:

  - "vanilla +HT": Hyperthreading enabled,  vanilla scheduler
  - "vanilla -HT": Hyperthreading disabled, vanilla scheduler
  - "core_sched":  Hyperthreading enabled,  core-scheduling enabled

?

Thanks,

Ingo

Following are the numbers. Disabling HT gives improvement in some cases.

1 DB instance

users  vanilla+HT   core_sched vanilla-HT
16 1    -4.9% -11.7%
24 1    -6.7% +13.7%
32 1    -2.4% +8%

2 DB instance

users  vanilla+HT   core_sched vanilla-HT
16 1    -19.5% +5.6%
24 1    -9.8% +3.5%
32 1    -27.2%    -22.8%


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Subhra Mazumdar



On 4/26/19 3:43 AM, Mel Gorman wrote:

On Fri, Apr 26, 2019 at 10:42:22AM +0200, Ingo Molnar wrote:

It should, but it's not perfect. For example, wake_affine_idle does not
take sibling activity into account even though select_idle_sibling *may*
take it into account. Even select_idle_sibling in its fast path may use
an SMT sibling instead of searching.

There are also potential side-effects with cpuidle. Some workloads
migration around the socket as they are communicating because of how the
search for an idle CPU works. With SMT on, there is potentially a longer
opportunity for a core to reach a deep c-state and incur a bigger wakeup
latency. This is a very weak theory but I've seen cases where latency
sensitive workloads with only two communicating tasks are affected by
CPUs reaching low c-states due to migrations.


Clearly it doesn't.


It's more that it's best effort to wakeup quickly instead of being perfect
by using an expensive search every time.

Yeah, but your numbers suggest that for *most* not heavily interacting
under-utilized CPU bound workloads we hurt in the 5-10% range compared to
no-SMT - more in some cases.


Indeed, it was higher than expected and we can't even use the excuse that
more resources are available to a single logical CPU as the scheduler is
meant to keep them apart.


So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10%
harm on the workload, and also blow up stddev by randomly co-scheduling
two tasks on the same physical core? Not a good trade-off.

I really think we should implement a relatively strict physical core
placement policy in the under-utilized case, and resist any attempts to
weaken this for special workloads that ping-pong quickly and benefit from
sharing the same physical core.


It's worth a shot at least. Changes should mostly be in the wake_affine
path for most loads of interest.

Doesn't select_idle_sibling already try to do that by calling
select_idle_core? For our OLTP workload we infact found the cost of
select_idle_core was actually hurting more than it helped to find a fully
idle core, so a net negative.



Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-07-13 Thread Subhra Mazumdar



On 7/4/19 6:04 PM, Parth Shah wrote:

Same experiment with hackbench and with perf analysis shows increase in L1
cache miss rate with these patches
(Lower is better)
   Baseline(%)   Patch(%)
  --- - ---
   Total Cache miss rate 17.01   19(-11%)
   L1 icache miss rate5.45   6.7(-22%)



So is is possible for idle_cpu search to try checking target_cpu first and
then goto sliding window if not found.
Below diff works as expected in IBM POWER9 system and resolves the problem
of far wakeup upto large extent.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff2e9b5c3ac5..fae035ce1162 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 u64 time, cost;
 s64 delta;
 int cpu, limit, floor, target_tmp, nr = INT_MAX;
+   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
  
 this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));

 if (!this_sd)
@@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
  
 time = local_clock();
  
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {

+   cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+   for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
+   __cpumask_clear_cpu(cpu, cpus);
+   if (available_idle_cpu(cpu))
+   goto idle_cpu_exit;
+   }
+
+   for_each_cpu_wrap(cpu, cpus, target_tmp) {
 per_cpu(next_cpu, target) = cpu;
 if (!--nr)
 return -1;
-   if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-   continue;
 if (available_idle_cpu(cpu))
 break;
 }
  
+idle_cpu_exit:

 time = local_clock() - time;
 cost = this_sd->avg_scan_cost;
 delta = (s64)(time - cost) / 8;



Best,
Parth

How about calling select_idle_smt before select_idle_cpu from
select_idle_sibling? That should have the same effect.


[RFC PATCH 3/3] sched: introduce tunables to control soft affinity

2019-06-26 Thread subhra mazumdar
For different workloads the optimal "softness" of soft affinity can be
different. Introduce tunables sched_allowed and sched_preferred that can
be tuned via /proc. This allows to chose at what utilization difference
the scheduler will chose cpus_allowed over cpus_preferred in the first
level of search. Depending on the extent of data sharing, cache coherency
overhead of the system etc. the optimal point may vary.

Signed-off-by: subhra mazumdar 
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/fair.c  | 19 ++-
 kernel/sched/sched.h |  2 ++
 kernel/sysctl.c  | 14 ++
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d7..0e75602 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,8 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
+extern __read_mostly unsigned int sysctl_sched_preferred;
+extern __read_mostly unsigned int sysctl_sched_allowed;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53aa7f2..d222d78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -85,6 +85,8 @@ unsigned int sysctl_sched_wakeup_granularity  
= 100UL;
 static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL;
 
 const_debug unsigned int sysctl_sched_migration_cost   = 50UL;
+const_debug unsigned int sysctl_sched_preferred= 1UL;
+const_debug unsigned int sysctl_sched_allowed  = 100UL;
 
 #ifdef CONFIG_SMP
 /*
@@ -6739,7 +6741,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
int new_cpu = prev_cpu;
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
-   struct cpumask *cpus = &p->cpus_preferred;
+   int cpux, cpuy;
+   struct cpumask *cpus;
+
+   if (!p->affinity_unequal) {
+   cpus = &p->cpus_allowed;
+   } else {
+   cpux = cpumask_any(&p->cpus_preferred);
+   cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+   cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+   cpuy = cpumask_any(cpus);
+   if (sysctl_sched_preferred * cpu_rq(cpux)->cfs.avg.util_avg >
+   sysctl_sched_allowed * cpu_rq(cpuy)->cfs.avg.util_avg)
+   cpus = &p->cpus_allowed;
+   else
+   cpus = &p->cpus_preferred;
+   }
 
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..f856bdb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1863,6 +1863,8 @@ extern void check_preempt_curr(struct rq *rq, struct 
task_struct *p, int flags);
 
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
+extern const_debug unsigned int sysctl_sched_preferred;
+extern const_debug unsigned int sysctl_sched_allowed;
 
 #ifdef CONFIG_SCHED_HRTICK
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7d1008b..bdffb48 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -383,6 +383,20 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+   {
+   .procname   = "sched_preferred",
+   .data   = &sysctl_sched_preferred,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "sched_allowed",
+   .data   = &sysctl_sched_allowed,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
 #ifdef CONFIG_SCHEDSTATS
{
.procname   = "sched_schedstats",
-- 
2.9.3



[RFC PATCH 0/3] Scheduler Soft Affinity

2019-06-26 Thread subhra mazumdar
ent optimal setting.

(5:4)
Hackbench   %gain with soft affinity
2*4 1.43
2*8 1.36
2*161.01
2*321.45
1*4 -2.55
1*8 -5.06
1*16-8
1*32-7.32
   
DB  %gain with soft affinity
2*160.46
2*243.68
2*32-3.34
1*160.08
1*241.6
1*32-1.29

Finally I measured the overhead of soft affinity when it is NOT used by
comparing it with baseline kernel in case of no affinity and hard affinity
with Hackbench. The following is the improvement of soft affinity kernel
w.r.t baseline, but really numbers are in noise margin. This shows soft
affinity has no overhead when not used.

Hackbench   %diff of no affinity%diff of hard affinity
2*4 0.110.31
2*8 0.130.55
2*160.610.90
2*320.861.01
1*4 0.480.43
1*8 0.450.33
1*160.610.64
1*320.110.63

A final set of experiments were done (numbers not shown) having the memory
of each DB instance spread evenly across both NUMA nodes. This showed
similar improvements with soft affinity for 2 instance case, thus proving
the improvement is due to saving LLC coherence overhead.

subhra mazumdar (3):
  sched: Introduce new interface for scheduler soft affinity
  sched: change scheduler to give preference to soft affinity CPUs
  sched: introduce tunables to control soft affinity

 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/sched.h  |   5 +-
 include/linux/sched/sysctl.h   |   2 +
 include/linux/syscalls.h   |   3 +
 include/uapi/asm-generic/unistd.h  |   4 +-
 include/uapi/linux/sched.h |   3 +
 init/init_task.c   |   2 +
 kernel/compat.c|   2 +-
 kernel/rcu/tree_plugin.h   |   3 +-
 kernel/sched/core.c| 167 -
 kernel/sched/fair.c| 154 ++
 kernel/sched/sched.h   |   2 +
 kernel/sysctl.c|  14 +++
 13 files changed, 297 insertions(+), 65 deletions(-)

-- 
2.9.3



[RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity

2019-06-26 Thread subhra mazumdar
New system call sched_setaffinity2 is introduced for scheduler soft
affinity. It takes an extra parameter to specify hard or soft affinity,
where hard implies same as existing sched_setaffinity. New cpumask
cpus_preferred is introduced for this purpose which is always a subset of
cpus_allowed. A boolean affinity_unequal is used to store if they are
unequal for fast lookup. Setting hard affinity resets soft affinity set to
be equal to it. Soft affinity is only allowed for CFS class threads.

Signed-off-by: subhra mazumdar 
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/sched.h  |   5 +-
 include/linux/syscalls.h   |   3 +
 include/uapi/asm-generic/unistd.h  |   4 +-
 include/uapi/linux/sched.h |   3 +
 init/init_task.c   |   2 +
 kernel/compat.c|   2 +-
 kernel/rcu/tree_plugin.h   |   3 +-
 kernel/sched/core.c| 167 -
 9 files changed, 162 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e..1dccdd2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431common  fsconfig__x64_sys_fsconfig
 432common  fsmount __x64_sys_fsmount
 433common  fspick  __x64_sys_fspick
+434common  sched_setaffinity2  __x64_sys_sched_setaffinity2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b863fa8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -652,6 +652,8 @@ struct task_struct {
unsigned intpolicy;
int nr_cpus_allowed;
cpumask_t   cpus_allowed;
+   cpumask_t   cpus_preferred;
+   boolaffinity_unequal;
 
 #ifdef CONFIG_PREEMPT_RCU
int rcu_read_lock_nesting;
@@ -1784,7 +1786,8 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 # define vcpu_is_preempted(cpu)false
 #endif
 
-extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
+extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask,
+ int flags);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
 #ifndef TASK_SIZE_OF
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe..147a4e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -669,6 +669,9 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid,
struct __kernel_timespec __user *interval);
 asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid,
 struct old_timespec32 __user 
*interval);
+asmlinkage long sys_sched_setaffinity2(pid_t pid, unsigned int len,
+  unsigned long __user *user_mask_ptr,
+  int flags);
 
 /* kernel/signal.c */
 asmlinkage long sys_restart_syscall(void);
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index a87904d..d77b366 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_sched_setaffinity2 434
+__SYSCALL(__NR_sched_setaffinity2, sys_sched_setaffinity2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee17..f910cd5 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -52,6 +52,9 @@
 #define SCHED_FLAG_RECLAIM 0x02
 #define SCHED_FLAG_DL_OVERRUN  0x04
 
+#define SCHED_HARD_AFFINITY0
+#define SCHED_SOFT_AFFINITY1
+
 #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK   | \
 SCHED_FLAG_RECLAIM | \
 SCHED_FLAG_DL_OVERRUN)
diff --git a/init/init_task.c b/init/init_task.c
index c70ef65..aa226a3 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -73,6 +73,8 @@ struct task_struct init_task
.normal_prio= MAX_PRIO - 20,
.policy = SCHED_NORMAL,
.cpus_allowed   = CPU_MASK_ALL,
+   .cpus_preferred = CPU_MASK_ALL,
+   .affinity_unequal = false,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm  = &init_mm,
diff --git a/kernel/compat.c b/kernel/compat.c
index b5f7063..96621d7 100644
--- a/kernel/compat.c
+++ b/ke

[RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

2019-06-26 Thread subhra mazumdar
The soft affinity CPUs present in the cpumask cpus_preferred is used by the
scheduler in two levels of search. First is in determining wake affine
which choses the LLC domain and secondly while searching for idle CPUs in
LLC domain. In the first level it uses cpus_preferred to prune out the
search space. In the second level it first searches the cpus_preferred and
then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
any overhead in the scheduler fast path when soft affinity is not used.
This only changes the wake up path of the scheduler, the idle balancing
is unchanged; together they achieve the "softness" of scheduling.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 137 ++--
 1 file changed, 100 insertions(+), 37 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..53aa7f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,7 +5807,7 @@ static unsigned long capacity_spare_without(int cpu, 
struct task_struct *p)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag)
+ int this_cpu, int sd_flag, struct cpumask *cpus)
 {
struct sched_group *idlest = NULL, *group = sd->groups;
struct sched_group *most_spare_sg = NULL;
@@ -5831,7 +5831,7 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
 
/* Skip over this group if it has no CPUs allowed */
if (!cpumask_intersects(sched_group_span(group),
-   &p->cpus_allowed))
+   cpus))
continue;
 
local_group = cpumask_test_cpu(this_cpu,
@@ -5949,7 +5949,8 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
  * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
  */
 static int
-find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int 
this_cpu)
+find_idlest_group_cpu(struct sched_group *group, struct task_struct *p,
+ int this_cpu, struct cpumask *cpus)
 {
unsigned long load, min_load = ULONG_MAX;
unsigned int min_exit_latency = UINT_MAX;
@@ -5963,7 +5964,7 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
return cpumask_first(sched_group_span(group));
 
/* Traverse only the allowed CPUs */
-   for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) {
+   for_each_cpu_and(i, sched_group_span(group), cpus) {
if (available_idle_cpu(i)) {
struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
@@ -5999,7 +6000,8 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 }
 
 static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct 
*p,
- int cpu, int prev_cpu, int sd_flag)
+ int cpu, int prev_cpu, int sd_flag,
+ struct cpumask *cpus)
 {
int new_cpu = cpu;
 
@@ -6023,13 +6025,14 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
continue;
}
 
-   group = find_idlest_group(sd, p, cpu, sd_flag);
+   group = find_idlest_group(sd, p, cpu, sd_flag, cpus);
+
if (!group) {
sd = sd->child;
continue;
}
 
-   new_cpu = find_idlest_group_cpu(group, p, cpu);
+   new_cpu = find_idlest_group_cpu(group, p, cpu, cpus);
if (new_cpu == cpu) {
/* Now try balancing at a lower domain level of 'cpu': 
*/
sd = sd->child;
@@ -6104,6 +6107,27 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
 }
 
+static inline int
+scan_cpu_mask_for_idle_cores(struct cpumask *cpus, int target)
+{
+   int core, cpu;
+
+   for_each_cpu_wrap(core, cpus, target) {
+   bool idle = true;
+
+   for_each_cpu(cpu, cpu_smt_mask(core)) {
+   cpumask_clear_cpu(cpu, cpus);
+   if (!idle_cpu(cpu))
+   idle = false;
+   }
+
+   if (idle)
+   return core;
+   }
+
+   return -1;
+}
+
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
@@ -6112,7 +6136,7 @@ void __update_idle_core(struct rq *rq)
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mas

[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit

2019-06-26 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..80657fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..4cecfa2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr(&runqueues)
-- 
2.9.3



[PATCH v3 4/7] sched: add sched feature to disable idle core search

2019-06-26 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



[PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-06-26 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-26 Thread subhra mazumdar
Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search. This patch does it only for x86.

Signed-off-by: subhra mazumdar 
---
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)   (per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..20bf676 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,10 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
 
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = &cpu_data(i);
 
-   if ((i == cpu) || (has_smt && match_smt(c, o)))
+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   threads = cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = threads;
+   per_cpu(cpumask_weight_sibling, i) = threads;
+   }
 
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
else
physid_set_mask_of_physid(0, &phys_cpu_present_map);
cpumask_set_cpu(0, topology_sibling_cpumask(0));
+   per_cpu(cpumask_weight_sibling, 0) =
+   cpumask_weight(topology_sibling_cpumask(0));
cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
for_each_cpu(sibling, topology_core_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));
/*/
 * last thread sibling in this cpu core going down
 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = 0;
cpumask_clear(topology_core_cpumask(cpu));
c->cpu_core_id = 0;
c->booted_cores = 0;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)  cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_we

[PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-26 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58f08f..c1ca88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3



[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu

2019-06-26 Thread subhra mazumdar
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b58f08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = floor << 1;
+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}
 
time = local_clock();
-- 
2.9.3



[RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

2019-06-26 Thread subhra mazumdar
rease by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile  baseline  %stdev   patch   %stdev
50  942.82 93.33 (0.71%)   1.24  
75  124   2.13 122.67 (1.08%)  1.7   
90  152   1.74 149.33 (1.75%)  2.35  
95  171   2.11 167 (2.34%) 2.74
99  512.67104.96   206 (59.82%)8.86
99.52296  82.553121.67 (-35.96%)   97.37 
99.912517.33  2.38 12592 (-0.6%)   1.67

Changes from v2->v3:
-Use shift operator instead of multiplication to compute limit
-Use per-CPU variable to precompute the number of sibling SMTs for x86

subhra mazumdar (7):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  x86/smpboot: introduce per-cpu variable for HT siblings
  sched: use per-cpu variable cpumask_weight_sibling

 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 kernel/sched/core.c |  2 ++
 kernel/sched/fair.c | 31 +++
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  1 +
 8 files changed, 49 insertions(+), 9 deletions(-)

-- 
2.9.3



[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling

2019-06-26 Thread subhra mazumdar
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu.
This is the fast path of scheduler and every cycle is worth saving. Usage
of cpumask_weight can result in iterations.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a74808..878f11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   floor = cpumask_weight(topology_sibling_cpumask(target));
+   floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
limit = floor << 1;
-- 
2.9.3



Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-07-02 Thread Subhra Mazumdar



On 7/2/19 1:35 AM, Peter Zijlstra wrote:

On Mon, Jul 01, 2019 at 03:08:41PM -0700, Subhra Mazumdar wrote:

On 7/1/19 1:03 AM, Viresh Kumar wrote:

On 28-06-19, 18:16, Subhra Mazumdar wrote:

On 6/25/19 10:06 PM, Viresh Kumar wrote:

@@ -5376,6 +5376,15 @@ static struct {
#endif /* CONFIG_NO_HZ_COMMON */
+/* CPU only has SCHED_IDLE tasks enqueued */
+static int sched_idle_cpu(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+   rq->nr_running);
+}
+

Shouldn't this check if rq->curr is also sched idle?

Why wouldn't the current set of checks be enough to guarantee that ?

I thought nr_running does not include the on-cpu thread.

It very much does.


And why not drop the rq->nr_running non zero check?

Because CPU isn't sched-idle if nr_running and idle_h_nr_running are both 0,
i.e. it is an IDLE cpu in that case. And so I thought it is important to have
this check as well.


idle_cpu() not only checks nr_running is 0 but also rq->curr == rq->idle

idle_cpu() will try very hard to declare a CPU !idle. But I don't see
how that it relevant. sched_idle_cpu() will only return true if there
are only SCHED_IDLE tasks on the CPU. Viresh's test is simple and
straight forward.


OK makes sense.

Thanks,
Subhra



Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

2019-07-02 Thread Subhra Mazumdar



On 7/2/19 1:54 AM, Patrick Bellasi wrote:

Wondering if searching and preempting needs will ever be conflicting?
I guess the winning point is that we don't commit behaviors to
userspace, but just abstract concepts which are turned into biases.

I don't see conflicts right now: if you are latency tolerant that
means you can spend more time to try finding a better CPU (e.g. we can
use the energy model to compare multiple CPUs) _and/or_ give the
current task a better chance to complete by delaying its preemption.

OK



Otherwise sounds like a good direction to me. For the searching aspect, can
we map latency nice values to the % of cores we search in select_idle_cpu?
Thus the search cost can be controlled by latency nice value.

I guess that's worth a try, only caveat I see is that it's turning the
bias into something very platform specific. Meaning, the same
latency-nice value on different machines can have very different
results.

Would not be better to try finding a more platform independent mapping?

Maybe something time bounded, e.g. the higher the latency-nice the more
time we can spend looking for CPUs?

The issue I see is suppose we have a range of latency-nice values, then it
should cover the entire range of search (one core to all cores). As Peter
said some workloads will want to search the LLC fully. If we have absolute
time, the map of latency-nice values range to them will be arbitrary. If
you have something in mind let me know, may be I am thinking differently.



But the issue is if more latency tolerant workloads set to less
search, we still need some mechanism to achieve good spread of
threads.

I don't get this example: why more latency tolerant workloads should
require less search?

I guess I got the definition of "latency tolerant" backwards.



Can we keep the sliding window mechanism in that case?

Which one? Sorry did not went through the patches, can you briefly
resume the idea?

If a workload has set it to low latency tolerant, then the search will be
less. That can lead to localization of threads on a few CPUs as we are not
searching the entire LLC even if there are idle CPUs available. For this
I had introduced a per-CPU variable (for the target CPU) to track the
boundary of search so that every time it will start from the boundary, thus
sliding the window. So even if we are searching very little the search
window keeps shifting and gives us a good spread. This is orthogonal to the
latency-nice thing.



Also will latency nice do anything for select_idle_core and
select_idle_smt?

I guess principle the same bias can be used at different levels, maybe
with different mappings.

Doing it for select_idle_core will have the issue that the dynamic flag
(whether an idle core is present or not) can only be updated by threads
which are doing the full search.

Thanks,
Subhra


In the mobile world use-case we will likely use it only to switch from
select_idle_sibling to the energy aware slow path. And perhaps to see
if we can bias the wakeup preemption granularity.

Best,
Patrick



[PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-06-08 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu

2019-06-08 Thread subhra mazumdar
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b58f08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = floor << 1;
+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}
 
time = local_clock();
-- 
2.9.3



[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling

2019-06-08 Thread subhra mazumdar
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu.
This is the fast path of scheduler and every cycle is worth saving. Usage
of cpumask_weight can result in iterations.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a74808..878f11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   floor = cpumask_weight(topology_sibling_cpumask(target));
+   floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
limit = floor << 1;
-- 
2.9.3



[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit

2019-06-08 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..80657fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..4cecfa2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr(&runqueues)
-- 
2.9.3



[PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-08 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58f08f..c1ca88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3



[PATCH v3 0/7] Improve scheduler scalability for fast path

2019-06-08 Thread subhra mazumdar
   942.82 93.33 (0.71%)   1.24  
75  124   2.13 122.67 (1.08%)  1.7   
90  152   1.74 149.33 (1.75%)  2.35  
95  171   2.11 167 (2.34%) 2.74
99  512.67104.96   206 (59.82%)8.86
99.52296  82.553121.67 (-35.96%)   97.37 
99.912517.33  2.38 12592 (-0.6%)   1.67

Changes from v2->v3:
-Use shift operator instead of multiplication to compute limit
-Use per-CPU variable to precompute the number of sibling SMTs for x86

subhra mazumdar (7):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  x86/smpboot: introduce per-cpu variable for HT siblings
  sched: use per-cpu variable cpumask_weight_sibling

 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 kernel/sched/core.c |  2 ++
 kernel/sched/fair.c | 31 +++
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  1 +
 8 files changed, 49 insertions(+), 9 deletions(-)

-- 
2.9.3



[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-08 Thread subhra mazumdar
Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search. This patch does it only for x86.

Signed-off-by: subhra mazumdar 
---
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)   (per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..20bf676 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,10 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
 
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = &cpu_data(i);
 
-   if ((i == cpu) || (has_smt && match_smt(c, o)))
+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   threads = cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = threads;
+   per_cpu(cpumask_weight_sibling, i) = threads;
+   }
 
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
else
physid_set_mask_of_physid(0, &phys_cpu_present_map);
cpumask_set_cpu(0, topology_sibling_cpumask(0));
+   per_cpu(cpumask_weight_sibling, 0) =
+   cpumask_weight(topology_sibling_cpumask(0));
cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
for_each_cpu(sibling, topology_core_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));
/*/
 * last thread sibling in this cpu core going down
 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = 0;
cpumask_clear(topology_core_cpumask(cpu));
c->cpu_core_id = 0;
c->booted_cores = 0;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)  cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_we

[PATCH v3 4/7] sched: add sched feature to disable idle core search

2019-06-08 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-27 Thread Subhra Mazumdar



On 6/26/19 11:51 PM, Thomas Gleixner wrote:

On Wed, 26 Jun 2019, subhra mazumdar wrote:


Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search.

Why? The number of siblings is constant at least today unless you play
silly cpu hotplug games. A bit more justification for adding yet another
random storage would be appreciated.

Using cpumask_weight every time in select_idle_cpu to compute the no. of
SMT siblings can be costly as cpumask_weight may not be O(1) for systems
with large no. of CPUs (e.g 8 socket, each socket having lots of cores).
Over 512 CPUs the bitmask will span multiple cache lines and touching
multiple cache lines in the fast path of scheduler can cost more than we
save from this optimization. Even in single cache line it loops in longs.
We want to touch O(1) cache lines and do O(1) operations, hence
pre-compute it in per-CPU variable.



This patch does it only for x86.

# grep 'This patch' Documentation/process/submitting-patches.rst

IOW, we all know already that this is a patch and from the subject prefix
and the diffstat it's pretty obvious that this is x86 only.

So instead of documenting the obvious, please add proper context to justify
the change.

Ok. The extra per-CPU optimization was done only for x86 as we cared about
it the most and make it future proof. I will add for other architectures.
  

+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);

Why does this need an export? No module has any reason to access this.

I will remove it



  /* representing HT and core siblings of each logical CPU */
  DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
  
  	if (!has_mp) {

cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = &cpu_data(i);
  
-		if ((i == cpu) || (has_smt && match_smt(c, o)))

+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   threads = cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = threads;
+   per_cpu(cpumask_weight_sibling, i) = threads;

This only works for SMT=2, but fails to update the rest for SMT=4.


I guess I assumed that x86 will always be SMT2, will fix this.

Thanks,
Subhra




@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
  
  	for_each_cpu(sibling, topology_core_cpumask(cpu)) {

cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));

While remove does the right thing.

Thanks,

tglx


Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-27 Thread Subhra Mazumdar



On 6/26/19 11:54 PM, Thomas Gleixner wrote:

On Thu, 27 Jun 2019, Thomas Gleixner wrote:


On Wed, 26 Jun 2019, subhra mazumdar wrote:


Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search.

Why? The number of siblings is constant at least today unless you play
silly cpu hotplug games. A bit more justification for adding yet another
random storage would be appreciated.


This patch does it only for x86.

# grep 'This patch' Documentation/process/submitting-patches.rst

IOW, we all know already that this is a patch and from the subject prefix
and the diffstat it's pretty obvious that this is x86 only.

So instead of documenting the obvious, please add proper context to justify
the change.

Aside of that the right ordering is to introduce the default fallback in a
separate patch, which explains the reasoning and then in the next one add
the x86 optimized version.

OK. I will also add the extra optimization for other architectures.

Thanks,
Subhra


Thanks,

tglx


Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 11:36 AM, Parth Shah wrote:

Hi Subhra,

I ran your patch series on IBM POWER systems and this is what I have observed.

On 6/27/19 6:59 AM, subhra mazumdar wrote:

Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58f08f..c1ca88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}

+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();

-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;

This leads to a problem of cache hotness.
AFAIU, in most cases, `target = prev_cpu` of the task being woken up and
selecting an idle CPU nearest to the prev_cpu is favorable.
But since this doesn't keep track of last idle cpu per task, it fails to find 
the
nearest possible idle CPU in cases when the task is being woken up after other
scheduled task.


I had tested hackbench on SPARC SMT8 (see numbers in cover letter) and
showed improvement with this. Firstly it's a tradeoff between cache effects
vs time spent in searching idle CPU, and both x86 and SPARC numbers showed
tradeoff is worth it. Secondly there is a lot of cache affinity logic
in the beginning of select_idle_sibling. If select_idle_cpu is still called
that means we are past that and want any idle cpu. I don't think waking up
close to the prev cpu is the intention for starting search from there,
rather it is to spread threads across all cpus so that no cpu gets
victimized as there is no atomicity. Prev cpu just acts a good seed to do
the spreading.

Thanks,
Subhra


Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 11:47 AM, Parth Shah wrote:


On 6/27/19 6:59 AM, subhra mazumdar wrote:

Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 15 +++
  1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b58f08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t

if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = floor << 1;

Is upper limit an experimental value only or it has any arch specific 
significance?
Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores
due to its different cache model. So it can be tuned for arch specific builds 
then.

The lower bound and upper bound were 1 core and 2 core respectively. That
is done as to search beyond one core and at the same time not to search
too much. It is heuristic that seemed to work well on all archs coupled
with the moving window mechanism. Does 4 vs 2 make any difference on your
POWER9? AFAIR it didn't on SPARC SMT8.


Also variable names can be changed for better readability.
floor -> weight_clamp_min
limit -> weight_clamp_max
or something similar

OK.

Thanks,
Subhra




+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}

time = local_clock();



Best,
Parth



Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 12:01 PM, Parth Shah wrote:


On 6/27/19 6:59 AM, subhra mazumdar wrote:

Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;

-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }

This can have significant performance loss if disabled. The select_idle_core 
spreads
workloads quickly across the cores, hence disabling this leaves much of the 
work to
be offloaded to load balancer to move task across the cores. Latency sensitive
and long running multi-threaded workload should see the regression under this 
conditions.

Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why
it is a feature that is ON by default, but can be turned OFF for specific
workloads on x86 SMT2 that can benefit from it.

Also, systems like POWER9 has sd_llc as a pair of core only. So it
won't benefit from the limits and hence also hiding your code in select_idle_cpu
behind static keys will be much preferred.

If it doesn't hurt then I don't see the point.

Thanks,
Subhra



Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 4:54 AM, Srikar Dronamraju wrote:

* subhra mazumdar  [2019-06-26 18:29:15]:


Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

While rotating the cpu search window is good, not sure if this can find a
idle cpu quickly. The probability of finding an idle cpu still should remain
the same. No?


Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
  
+	if (per_cpu(next_cpu, target) != -1)

+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
  
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {

+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;

Shouldn't this assignment be outside the for loop.
With the current code,
1. We keep reassigning multiple times.
2. The last assignment happes for idle_cpu and sometimes the
assignment is for non-idle cpu.

We want the last assignment irrespective of it was an idle cpu or not since
in both cases we want to track the boundary of search.

Thanks,
Subhra


Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-06-28 Thread Subhra Mazumdar



On 6/25/19 10:06 PM, Viresh Kumar wrote:

We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.

A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).

This patch updates both fast and slow paths with this optimization.

Signed-off-by: Viresh Kumar 
---
  kernel/sched/fair.c | 43 +--
  1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1277adc3e7ed..2e0527fd468c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5376,6 +5376,15 @@ static struct {
  
  #endif /* CONFIG_NO_HZ_COMMON */
  
+/* CPU only has SCHED_IDLE tasks enqueued */

+static int sched_idle_cpu(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+   rq->nr_running);
+}
+

Shouldn't this check if rq->curr is also sched idle? And why not drop the
rq->nr_running non zero check?


Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-07-01 Thread Subhra Mazumdar




Also, systems like POWER9 has sd_llc as a pair of core only. So it
won't benefit from the limits and hence also hiding your code in select_idle_cpu
behind static keys will be much preferred.

If it doesn't hurt then I don't see the point.


So these is the result from POWER9 system with your patches:
System configuration: 2 Socket, 44 cores, 176 CPUs

Experiment setup:
===
=> Setup 1:
- 44 tasks doing just while(1), this is to make select_idle_core return -1 most 
times
- perf bench sched messaging -g 1 -l 100
+---++--++
| Baseline  | stddev |Patch | stddev |
+---++--++
|   135 |   3.21 | 158(-17.03%) |   4.69 |
+---++--++

=> Setup 2:
- schbench -m44 -t 1
+===+==+=+=+==+
| %ile  | Baseline | stddev  |  patch  |  stddev  |
+===+==+=+=+==+
|50 |   10 |3.49 |  10 | 2.29 |
+---+--+-+-+--+
|95 |  467 |4.47 | 469 | 0.81 |
+---+--+-+-+--+
|99 |  571 |   21.32 | 584 |18.69 |
+---+--+-+-+--+
|  99.5 |  629 |   30.05 | 641 |20.95 |
+---+--+-+-+--+
|  99.9 |  780 |   40.38 | 773 | 44.2 |
+---+--+-+-+--+

I guess it doesn't make much difference in schbench results but hackbench (perf 
bench)
seems to have an observable regression.


Best,
Parth


If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with
the select_idle_cpu changes as the limits are 1 and 2 core. Previously the
lower bound was 4 cpus and upper bound calculated by the prop. Now it is
1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra
computation of cpumask_weight causing the regression rather than the
sliding window itself (one way to check this would be hardcode 4 in place
of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit
suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per
core showed improvement with Hackbench.

Thanks,
Subhra


  1   2   >