Re: [RFC PATCH V2] sched: Improve scalability of select_idle_sibling using SMT balance
On 01/09/2018 06:50 AM, Steven Sistare wrote: On 1/8/2018 5:18 PM, Peter Zijlstra wrote: On Mon, Jan 08, 2018 at 02:12:37PM -0800, subhra mazumdar wrote: @@ -2751,6 +2763,31 @@ context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next, struct rq_flags *rf) { struct mm_struct *mm, *oldmm; + int this_cpu = rq->cpu; + struct sched_domain *sd; + int prev_busy, next_busy; + + if (rq->curr_util == UTIL_UNINITIALIZED) + prev_busy = 0; + else + prev_busy = (prev != rq->idle); + next_busy = (next != rq->idle); + + /* +* From sd_llc downward update the SMT utilization. +* Skip the lowest level 0. +*/ + sd = rcu_dereference_sched(per_cpu(sd_llc, this_cpu)); + if (next_busy != prev_busy) { + for_each_lower_domain(sd) { + if (sd->level == 0) + break; + sd_context_switch(sd, rq, next_busy - prev_busy); + } + } + No, we're not going to be adding atomic ops here. We've been arguing over adding a single memory barrier to this path, atomic are just not going to happen. Also this is entirely the wrong way to do this, we already have code paths that _know_ if they're going into or coming out of idle. Yes, it would be more efficient to adjust the busy-cpu count of each level of the hierarchy in pick_next_task_idle and put_prev_task_idle. OK, I have moved it to pick_next_task_idle/put_prev_task_idle. Will send out the v3. Thanks, Subhra - Steve
[RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem of potentially searching all cores or cpus by using a randomized approach called SMT balance. It maintains a utilization of the SMTs per scheduling group based on the number of busy CPUs in the group (henceforth referred to as SMT utilization). This is accounted at the time of context switch. The SMT utilization is maintained only for levels below the LLC domain, so only for cores. During context switch each cpu of a core atomically increments or decrements the SMT utilization variable of that core depending on whether the cpu is going busy or idle. Since the atomic variable is per core, it will be updated only by the hyperthreads in a core with minimal contention. In the fast path of wakeup the scheduler compares the target cpu group of select_idle_sibling with a random group in the LLC domain w.r.t SMT utilization to determine a better core to schedule. It chooses the core which has more SMT capacity (idle cpus) left. The SMT capacity is computed simply by subtracting SMT utilization from group weight. This comparison can be done in O(1). Finally it does an idle cpu search only in that core starting from a random cpu index. The random number generation needs to be fast and uses a per cpu pseudo random number generator. Following are the numbers with various benchmarks on a x86 2 socket system with 22 cores per socket and 2 hyperthreads per core: hackbench process: groups baseline-rc6(avg) %stdev patch(avg) %stdev 1 0.4797 15.75 0.4324 (+9.86%) 2.23 2 0.4877 9.990.4535 (+7.01%) 3.36 4 0.8603 1.090.8376 (+2.64%) 0.95 8 1.496 0.601.4516 (+2.97%) 1.38 16 2.6642 0.372.5857 (+2.95%) 0.68 32 4.6715 0.404.5158 (+3.33%) 0.67 uperf pingpong throughput with loopback interface and message size = 8k: threads baseline-rc6(avg) %stdev patch(avg) %stdev 8 49.47 0.3551.16 (+3.42%) 0.53 16 95.28 0.77101.02 (+6.03%) 0.43 32 156.77 1.17181.52 (+15.79%) 0.96 48 193.24 0.22212.90 (+10.17%) 0.45 64 216.21 9.33264.14 (+22.17%) 0.69 128 379.62 10.29 416.36 (+9.68%) 1.04 Oracle DB TPC-C throughput normalized to baseline: users baseline-rc6 norm(avg) %stdev patch norm(avg) %stdev 20 1 0.941.0071 (+0.71%) 1.03 40 1 0.821.0126 (+1.26%) 0.65 60 1 1.100.9928 (-0.72%) 0.67 80 1 0.631.003 (+0.30%)0.64 100 1 0.820.9957 (-0.43%) 0.15 120 1 0.461.0034 (+0.34%) 1.74 140 1 1.441.0247 (+2.47%) 0.15 160 1 0.851.0445 (+4.45%) 0.81 180 1 0.191.0382 (+3.82%) 0.57 200 1 1.401.0295 (+2.95%) 0.94 220 1 1.021.0242 (+2.42%) 0.85 Following is the cost (in us) of select_idle_sibling() with hackbench 16 groups: function baseline-rc6 %stdev patch %stdev select_idle_sibling()0.556 1.720.263 (-52.70%) 0.78 Signed-off-by: subhra mazumdar --- include/linux/sched/topology.h | 2 + kernel/sched/core.c| 43 +++ kernel/sched/fair.c| 251 - kernel/sched/idle_task.c | 3 +- kernel/sched/sched.h | 28 ++--- kernel/sched/topology.c| 35 +- 6 files changed, 208 insertions(+), 154 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7d065ab..cd1f129 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -146,6 +146,8 @@ struct sched_domain { struct sched_domain_shared *shared; unsigned int span_weight; + struct sched_group **sg_array; + int sg_num; /* * Span of all CPUs in this domain. * diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da..8e0f6bb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2743,6 +2743,48 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) put_user(task_pid_vnr(current), current->set_child_tid); } +#ifdef CONFIG_SCHED_SMT + +/* + * From sd_llc downward update the SMT utilization. + * Skip the lowest level 0. + */ +void smt_util(struct rq *rq, int prev_busy, int n
[RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem of potentially searching all cores or cpus by using a randomized approach called SMT balance. It maintains a utilization of the SMTs per scheduling group based on the number of busy CPUs in the group (henceforth referred to as SMT utilization). This is accounted at the time of context switch. The SMT utilization is maintained only for levels below the LLC domain, so only for cores. During context switch each cpu of a core atomically increments or decrements the SMT utilization variable of that core depending on whether the cpu is going busy or idle. Since the atomic variable is per core, it will be updated only by the hyperthreads in a core with minimal contention. In the fast path of wakeup the scheduler compares the target cpu group of select_idle_sibling with a random group in the LLC domain w.r.t SMT utilization to determine a better core to schedule. It chooses the core which has more SMT capacity (idle cpus) left. The SMT capacity is computed simply by subtracting SMT utilization from group weight. This comparison can be done in O(1). Finally it does an idle cpu search only in that core starting from a random cpu index. The random number generation needs to be fast and uses a per cpu pseudo random number generator. Following are the numbers with various benchmarks on a x86 2 socket system with 22 cores per socket and 2 hyperthreads per core: hackbench process: groups baseline-rc6(avg) %stdev patch(avg) %stdev 1 0.4797 15.75 0.4324 (+9.86%) 2.23 2 0.4877 9.990.4535 (+7.01%) 3.36 4 0.8603 1.090.8376 (+2.64%) 0.95 8 1.496 0.601.4516 (+2.97%) 1.38 16 2.6642 0.372.5857 (+2.95%) 0.68 32 4.6715 0.404.5158 (+3.33%) 0.67 uperf pingpong throughput with loopback interface and message size = 8k: threads baseline-rc6(avg) %stdev patch(avg) %stdev 8 49.47 0.3551.16 (+3.42%) 0.53 16 95.28 0.77101.02 (+6.03%) 0.43 32 156.77 1.17181.52 (+15.79%) 0.96 48 193.24 0.22212.90 (+10.17%) 0.45 64 216.21 9.33264.14 (+22.17%) 0.69 128 379.62 10.29 416.36 (+9.68%) 1.04 Oracle DB TPC-C throughput normalized to baseline: users baseline-rc6 norm(avg) %stdev patch norm(avg) %stdev 20 1 0.941.0071 (+0.71%) 1.03 40 1 0.821.0126 (+1.26%) 0.65 60 1 1.100.9928 (-0.72%) 0.67 80 1 0.631.003 (+0.30%)0.64 100 1 0.820.9957 (-0.43%) 0.15 120 1 0.461.0034 (+0.34%) 1.74 140 1 1.441.0247 (+2.47%) 0.15 160 1 0.851.0445 (+4.45%) 0.81 180 1 0.191.0382 (+3.82%) 0.57 200 1 1.401.0295 (+2.95%) 0.94 220 1 1.021.0242 (+2.42%) 0.85 Following is the cost (in us) of select_idle_sibling() with hackbench 16 groups: function baseline-rc6 %stdev patch %stdev select_idle_sibling()0.556 1.720.263 (-52.70%) 0.78 Signed-off-by: subhra mazumdar --- include/linux/sched/topology.h | 2 + kernel/sched/core.c| 43 +++ kernel/sched/fair.c| 247 - kernel/sched/idle_task.c | 3 +- kernel/sched/sched.h | 28 ++--- kernel/sched/topology.c| 35 +- 6 files changed, 206 insertions(+), 152 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index cf257c2..e63e4fb 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -147,6 +147,8 @@ struct sched_domain { struct sched_domain_shared *shared; unsigned int span_weight; + struct sched_group **sg_array; + int sg_num; /* * Span of all CPUs in this domain. * diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a7bf32a..58f8684 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2752,6 +2752,48 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) put_user(task_pid_vnr(current), current->set_child_tid); } +#ifdef CONFIG_SCHED_SMT + +/* + * From sd_llc downward update the SMT utilization. + * Skip the lowest level 0. + */ +void smt_util(struct rq *rq, int prev_busy, int n
[RFC PATCH V2] sched: Improve scalability of select_idle_sibling using SMT balance
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem of potentially searching all cores or cpus by using a randomized approach called SMT balance. It maintains a utilization of the SMTs per scheduling group based on the number of busy CPUs in the group (henceforth referred to as SMT utilization). This is accounted at the time of context switch. The SMT utilization is maintained only for levels below the LLC domain, so only for cores. During context switch each cpu of a core atomically increments or decrements the SMT utilization variable of that core depending on whether the cpu is going busy or idle. Since the atomic variable is per core, it will be updated only by the hyperthreads in a core with minimal contention. The cost of context_switch() function below with this patch confirms that the overhead is minimal and mostly within noise margin. In the fast path of wakeup the scheduler compares the target cpu group of select_idle_sibling with a random group in the LLC domain w.r.t SMT utilization to determine a better core to schedule. It chooses the core which has more SMT capacity (idle cpus) left. The SMT capacity is computed simply by subtracting SMT utilization from group weight. This comparison can be done in O(1). Finally it does an idle cpu search only in that core starting from a random cpu index. The random number generation needs to be fast and uses a per cpu pseudo random number generator. Following are the numbers with various benchmarks on a x86 2 socket system with 22 cores per socket and 2 hyperthreads per core: hackbench process: groups baseline-rc6(avg) %stdev patch(avg) %stdev 1 0.4797 15.75 0.4324 (+9.86%) 2.23 2 0.4877 9.990.4535 (+7.01%) 3.36 4 0.8603 1.090.8376 (+2.64%) 0.95 8 1.496 0.601.4516 (+2.97%) 1.38 16 2.6642 0.372.5857 (+2.95%) 0.68 32 4.6715 0.404.5158 (+3.33%) 0.67 uperf pingpong throughput with loopback interface and message size = 8k: threads baseline-rc6(avg) %stdev patch(avg) %stdev 8 49.47 0.3551.16 (+3.42%) 0.53 16 95.28 0.77101.02 (+6.03%) 0.43 32 156.77 1.17181.52 (+15.79%) 0.96 48 193.24 0.22212.90 (+10.17%) 0.45 64 216.21 9.33264.14 (+22.17%) 0.69 128 379.62 10.29 416.36 (+9.68%) 1.04 Oracle DB TPC-C throughput normalized to baseline: users baseline-rc6 norm(avg) %stdev patch norm(avg) %stdev 20 1 0.941.0071 (+0.71%) 1.03 40 1 0.821.0126 (+1.26%) 0.65 60 1 1.100.9928 (-0.72%) 0.67 80 1 0.631.003 (+0.30%)0.64 100 1 0.820.9957 (-0.43%) 0.15 120 1 0.461.0034 (+0.34%) 1.74 140 1 1.441.0247 (+2.47%) 0.15 160 1 0.851.0445 (+4.45%) 0.81 180 1 0.191.0382 (+3.82%) 0.57 200 1 1.401.0295 (+2.95%) 0.94 220 1 1.021.0242 (+2.42%) 0.85 Following are the cost (in us) of context_switch() and select_idle_sibling() with hackbench 16 groups: function baseline-rc6 %stdev patch %stdev context_switch() 663.8799 4.46687.4068 (+3.54%) 2.85 select_idle_sibling()0.556 1.720.263 (-52.70%) 0.78 Signed-off-by: subhra mazumdar --- include/linux/sched/topology.h | 2 + kernel/sched/core.c| 38 +++ kernel/sched/fair.c| 245 - kernel/sched/idle_task.c | 1 - kernel/sched/sched.h | 26 ++--- kernel/sched/topology.c| 35 +- 6 files changed, 197 insertions(+), 150 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7d065ab..cd1f129 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -146,6 +146,8 @@ struct sched_domain { struct sched_domain_shared *shared; unsigned int span_weight; + struct sched_group **sg_array; + int sg_num; /* * Span of all CPUs in this domain. * diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da..805451b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2743,6 +2743,18 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) put_user(task_pi
Re: [RFC 00/60] Coscheduling for Linux
On 09/24/2018 08:43 AM, Jan H. Schönherr wrote: On 09/19/2018 11:53 PM, Subhra Mazumdar wrote: Can we have a more generic interface, like specifying a set of task ids to be co-scheduled with a particular level rather than tying this with cgroups? KVMs may not always run with cgroups and there might be other use cases where we might want co-scheduling that doesn't relate to cgroups. Currently: no. At this point the implementation is tightly coupled to the cpu cgroup controller. This *might* change, if the task group optimizations mentioned in other parts of this e-mail thread are done, as I think, that it would decouple the various mechanisms. That said, what if you were able to disable the "group-based fairness" aspect of the cpu cgroup controller? Then you would be able to control just the coscheduling aspects on their own. Would that satisfy the use case you have in mind? Regards Jan Yes that will suffice the use case. We wish to experiment at some point with co-scheduling of certain workers threads in DB parallel query and see if there is any benefit Thanks, Subhra
Re: [RFC 00/60] Coscheduling for Linux
On 09/26/2018 02:58 AM, Jan H. Schönherr wrote: On 09/17/2018 02:25 PM, Peter Zijlstra wrote: On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote: Assuming, there is a cgroup-less solution that can prevent simultaneous execution of tasks on a core, when they're not supposed to. How would you tell the scheduler, which tasks these are? Specifically for L1TF I hooked into/extended KVM's preempt_notifier registration interface, which tells us which tasks are VCPUs and to which VM they belong. But if we want to actually expose this to userspace, we can either do a prctl() or extend struct sched_attr. Both, Peter and Subhra, seem to prefer an interface different than cgroups to specify what to coschedule. Can you provide some extra motivation for me, why you feel that way? (ignoring the current scalability issues with the cpu group controller) After all, cgroups where designed to create arbitrary groups of tasks and to attach functionality to those groups. If we were to introduce a different interface to control that, we'd need to introduce a whole new group concept, so that you make tasks part of some group while at the same time preventing unauthorized tasks from joining a group. I currently don't see any wins, just a loss in flexibility. Regards Jan I think cgroups will the get the job done for any use case. But we have, e.g. affinity control via both sched_setaffinity and cgroup cpusets. It will be good to have an alternative way to specify co-scheduling too for those who don't want to use cgroup for some reason. It can be added later on though, only how one will override the other will need to be sorted out.
[PATCH 5/5] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 574cb14..33fbc47 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6464,9 +6464,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[PATCH 4/5] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 85ae848..de15733 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
[PATCH 1/5] sched: limit cpu search in select_idle_cpu
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e497c05..7243146 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6390,10 +6390,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = 2*floor; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3
[PATCH 3/5] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7243146..574cb14 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6403,9 +6403,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) -- 2.9.3
[PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8d59b25..b3e4ec1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -20,6 +20,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) /* @@ -5996,6 +5997,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 67702b4..eb12b50 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -912,6 +912,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) -- 2.9.3
[RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path
(1.07%) 1.05 Following are the schbench performance numbers with SIS_CORE false and SIS_PROP false. This recovers the latency increase by having SIS_CORE false. Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67104.96 206 (59.82%)8.86 99.52296 82.553121.67 (-35.96%) 97.37 99.912517.33 2.38 12592 (-0.6%) 1.67 Changes since v1 - Compute the upper and lower limit based on number of cpus in a core - Split up the search limit and search window rotation into separate patches - Add new sched feature to have option of disabling idle core search subhra mazumdar (5): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++ kernel/sched/features.h | 1 + kernel/sched/sched.h| 1 + 4 files changed, 27 insertions(+), 8 deletions(-) -- 2.9.3
Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
On 05/29/2018 02:36 PM, Peter Zijlstra wrote: On Wed, May 02, 2018 at 02:58:42PM -0700, Subhra Mazumdar wrote: I re-ran the test after fixing that bug but still get similar regressions for hackbench Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5742 21.13 0.5131 (10.64%) 4.11 2 0.5776 7.87 0.5387 (6.73%) 2.39 4 0.9578 1.12 1.0549 (-10.14%) 0.85 8 1.7018 1.35 1.8516 (-8.8%) 1.56 16 2.9955 1.36 3.2466 (-8.38%) 0.42 32 5.4354 0.59 5.7738 (-6.23%) 0.38 On my IVB-EP (2 socket, 10 core/socket, 2 threads/core): bench: perf stat --null --repeat 10 -- perf bench sched messaging -g $i -t -l 1 2>&1 | grep "seconds time elapsed" config + results: ORIG (SIS_PROP, shift=9) 1:0.557325175 seconds time elapsed ( +- 0.83% ) 2:0.620646551 seconds time elapsed ( +- 1.46% ) 5:2.313514786 seconds time elapsed ( +- 2.11% ) 10:3.796233615 seconds time elapsed ( +- 1.57% ) 20:6.319403172 seconds time elapsed ( +- 1.61% ) 40:9.313219134 seconds time elapsed ( +- 1.03% ) PROP+AGE+ONCE shift=0 1:0.559497993 seconds time elapsed ( +- 0.55% ) 2:0.631549599 seconds time elapsed ( +- 1.73% ) 5:2.195464815 seconds time elapsed ( +- 1.77% ) 10:3.703455811 seconds time elapsed ( +- 1.30% ) 20:6.440869566 seconds time elapsed ( +- 1.23% ) 40:9.537849253 seconds time elapsed ( +- 2.00% ) FOLD+AGE+ONCE+PONIES shift=0 1:0.558893325 seconds time elapsed ( +- 0.98% ) 2:0.617426276 seconds time elapsed ( +- 1.07% ) 5:2.342727231 seconds time elapsed ( +- 1.34% ) 10:3.850449091 seconds time elapsed ( +- 1.07% ) 20:6.622412262 seconds time elapsed ( +- 0.85% ) 40:9.487138039 seconds time elapsed ( +- 2.88% ) FOLD+AGE+ONCE+PONIES+PONIES2 shift=0 10: 3.695294317 seconds time elapsed ( +- 1.21% ) Which seems to not hurt anymore.. can you confirm? Also, I didn't run anything other than hackbench on it so far. (sorry, the code is a right mess, it's what I ended up with after a day of poking with no cleanups) I tested with FOLD+AGE+ONCE+PONIES+PONIES2 shift=0 vs baseline but see some regression for hackbench and uperf: hackbench BL stdev% test stdev% %gain 1(40 tasks) 0.5816 8.94 0.5607 2.89 3.593535 2(80 tasks) 0.6428 10.64 0.5984 3.38 6.907280 4(160 tasks) 1.0152 1.99 1.0036 2.03 1.142631 8(320 tasks) 1.8128 1.40 1.7931 0.97 1.086716 16(640 tasks) 3.1666 0.80 3.2332 0.48 -2.103207 32(1280 tasks) 5.6084 0.83 5.8489 0.56 -4.288210 Uperf BL stdev% test stdev% %gain 8 threads 45.36 0.43 45.16 0.49 -0.433536 16 threads 87.81 0.82 88.6 0.38 0.899669 32 threads 151.18 0.01 149.98 0.04 -0.795925 48 threads 190.19 0.21 184.77 0.23 -2.849681 64 threads 190.42 0.35 183.78 0.08 -3.485217 128 threads 323.85 0.27 266.32 0.68 -17.766089 sysbench BL stdev% test stdev% %gain 8 threads 2095.44 1.82 2102.63 0.29 0.343006 16 threads 4218.44 0.06 4179.82 0.49 -0.915413 32 threads 7531.36 0.48 7744.72 0.13 2.832912 48 threads 10206.42 0.20 10144.65 0.19 -0.605163 64 threads 12053.72 0.09 11784.38 0.32 -2.234547 128 threads 14810.33 0.04 14741.78 0.16 -0.462867 I have a patch which is much smaller but seems to work well so far for both x86 and SPARC across benchmarks I have run so far. It keeps the idle cpu search between 1 core and 2 core amount of cpus and also puts a new sched feature of doing idle core search or not. It can be on by default but for workloads (like Oracle DB on x86) we can turn it off. I plan to send that after some more testing.
[PATCH 5/5] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 849c7c8..35a076e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6464,9 +6464,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path
(1.07%) 1.05 Following are the schbench performance numbers with SIS_CORE false and SIS_PROP false. This recovers the latency increase by having SIS_CORE false. Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67104.96 206 (59.82%)8.86 99.52296 82.553121.67 (-35.96%) 97.37 99.912517.33 2.38 12592 (-0.6%) 1.67 Changes since v1 - Compute the upper and lower limit based on number of cpus in a core - Split up the search limit and search window rotation into separate patches - Add new sched feature to have option of disabling idle core search subhra mazumdar (5): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++ kernel/sched/features.h | 1 + kernel/sched/sched.h| 1 + 4 files changed, 27 insertions(+), 8 deletions(-) -- 2.9.3
[PATCH 1/5] sched: limit cpu search in select_idle_cpu
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e497c05..9a6d28d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6389,11 +6389,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (sched_feat(SIS_PROP)) { + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = 2*floor; u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3
[PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8d59b25..b3e4ec1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -20,6 +20,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) /* @@ -5996,6 +5997,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 67702b4..eb12b50 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -912,6 +912,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) -- 2.9.3
[PATCH 3/5] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9a6d28d..849c7c8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6403,9 +6403,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) -- 2.9.3
[PATCH 4/5] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 85ae848..de15733 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
Re: [PATCH 1/5] sched: limit cpu search in select_idle_cpu
On 06/12/2018 01:33 PM, kbuild test robot wrote: Hi subhra, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on v4.17 next-20180612] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/subhra-mazumdar/Improve-scheduler-scalability-for-fast-path/20180613-015158 config: i386-randconfig-x070-201823 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): kernel/sched/fair.c: In function 'select_idle_cpu': kernel/sched/fair.c:6396:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] u64 span_avg = sd->span_weight * avg_idle; ^~~ I fixed this patch, please try the following ---8<--- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e497c05..7243146 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6390,10 +6390,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = 2*floor; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock();
Gang scheduling
Hi, I was following the Coscheduling patch discussion on lkml and Peter mentioned he had a patch series. I found the following on github. https://github.com/pdxChen/gang/commits/sched_1.23-loadbal I would like to test this with KVMs. Are the commits from 38d5acb to f019876 sufficient? Also is there any documentaion on how to use it (any knobs I need to turn on for gang scheduling to happen?) or is it enabled by default for KVMs? Thanks, Subhra
Re: [RFC 00/60] Coscheduling for Linux
On 10/26/18 4:44 PM, Jan H. Schönherr wrote: On 19/10/2018 02.26, Subhra Mazumdar wrote: Hi Jan, Hi. Sorry for the delay. On 9/7/18 2:39 PM, Jan H. Schönherr wrote: The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. Do you know how much is the delay? i.e what is overlap time when a thread of new group starts executing on one HT while there is still thread of another group running on the other HT? The delay is roughly equivalent to the IPI latency, if we're just talking about coscheduling at SMT level: one sibling decides to schedule another group, sends an IPI to the other sibling(s), and may already start executing a task of that other group, before the IPI is received on the other end. Can you point to where the leader is sending the IPI to other siblings? I did some experiment and delay seems to be sub microsec. I ran 2 threads that are just looping in one cosched group and affinitized to the 2 HTs of a core. And another thread in a different cosched group starts running affinitized to the first HT of the same core. I time stamped just before context_switch() in __schedule() for the threads switching from one to another and one to idle. Following is what I get on cpu 1 and 45 that are siblings, cpu 1 is where the other thread preempts: [ 403.216625] cpu:45 sub1->idle:403216624579 [ 403.238623] cpu:1 sub1->sub2:403238621585 [ 403.238624] cpu:45 sub1->idle:403238621787 [ 403.260619] cpu:1 sub1->sub2:403260619182 [ 403.260620] cpu:45 sub1->idle:403260619413 [ 403.282617] cpu:1 sub1->sub2:403282617157 [ 403.282618] cpu:45 sub1->idle:403282617317 .. Not sure why the first switch on cpu to idle happened. But then onwards the difference in timestamps is less than a microsec. This is just a crude way to get a sense of the delay, may not be exact. Thanks, Subhra Now, there are some things that may delay processing an IPI, but in those cases the target CPU isn't executing user code. I've yet to produce some current numbers for SMT-only coscheduling. An older ballpark number I have is about 2 microseconds for the collective context switch of one hierarchy level, but take that with a grain of salt. Regards Jan
Re: Gang scheduling
On 10/12/2018 11:01 AM, Tim Chen wrote: On 10/10/2018 05:09 PM, Subhra Mazumdar wrote: Hi, I was following the Coscheduling patch discussion on lkml and Peter mentioned he had a patch series. I found the following on github. https://github.com/pdxChen/gang/commits/sched_1.23-loadbal I would like to test this with KVMs. Are the commits from 38d5acb to f019876 sufficient? Also is there any documentaion on how to use it (any knobs I need to turn on for gang scheduling to happen?) or is it enabled by default for KVMs? Thanks, Subhra I would suggest you try https://github.com/pdxChen/gang/tree/sched_1.23-base without the load balancing part of gang scheduling. It is enabled by default for KVMs. Due to the constant change in gang scheduling status of the QEMU thread depending on whether vcpu is loaded or unloaded, the load balancing part of the code doesn't work very well. Thanks. Does this mean each vcpu thread need to be affinitized to a cpu? The current version of the code need to be optimized further. Right now the QEMU thread constantly does vcpu load and unload during VM enter and exit. We gang schedule only after vcpu load and register the thread to be gang scheduled. When we do vcpu unload, the thread is removed from the set to be gang scheduled. Each time there's a synchronization with the sibling thread that's expensive. However, for QEMU, there's a one to one correspondence between the QEMU thread and vcpu. So we don't have to change the gang scheduling status for such thread to avoid the church and sync with the sibling. That should be helpful for VM with lots of I/O causing constant VM exits. We're still working on this optimization. And the load balancing should be better after this change. Tim Also FYI I get the following error while building sched_1.23-base: ERROR: "sched_ttwu_pending" [arch/x86/kvm/kvm-intel.ko] undefined! scripts/Makefile.modpost:92: recipe for target '__modpost' failed Adding the following fixed it: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 46807dc..302b77d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -21,6 +21,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +EXPORT_SYMBOL_GPL(sched_ttwu_pending); #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) /*
Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe
On 11/5/18 2:08 AM, Mel Gorman wrote: Adding Al Viro as per get_maintainers.pl. On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote: Introduce pipe_ll_usec field for pipes that indicates the amount of micro seconds a thread should spin if pipe is empty or full before sleeping. This is similar to network sockets. Can you point to what pattern from network sockets you are duplicated exactly? One would assume it's busy_loop_current_time and busy_loop_timeout but it should be in the changelog because there are differences in polling depending on where you are in the network subsystem (which I'm not very familiar with). I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By similar I meant having a similar mechanism for pipes to busy wait Workloads like hackbench in pipe mode benefits significantly from this by avoiding the sleep and wakeup overhead. Other similar usecases can benefit. A tunable pipe_busy_poll is introduced to enable or disable busy waiting via /proc. The value of it specifies the amount of spin in microseconds. Default value is 0 indicating no spin. Your lead mail indicates the spin was set to a "suitable spin time". How should an administrator select this spin time? What works for hackbench might not be suitable for another workload. What if the spin time selected happens to be just slightly longer than the time it takes the reader to respond? In such a case, the result would be "all spin, no gain". While networking potentially suffers the same problem, it appears to be opt-in per socket so it's up to the application not to shoot itself in the foot. Even for network, sk_ll_usec is assigned the value of the tunable sysctl_net_busy_read in sock_init_data() for all sockets initialized by default. There is way for per socket setting using sock_setsockopt(), for pipes that can be added later if needed in case of different apps running in one system. But there are cases where only one app runs (e.g big DBs) and one tunable will suffice. It can be set to a value that is tested to be beneficial under the operating conditions. Signed-off-by: subhra mazumdar --- fs/pipe.c | 12 include/linux/pipe_fs_i.h | 2 ++ kernel/sysctl.c | 7 +++ 3 files changed, 21 insertions(+) diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c..35d805b 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -26,6 +26,7 @@ #include #include +#include #include "internal.h" @@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576; */ unsigned long pipe_user_pages_hard; unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR; +unsigned int pipe_busy_poll; /* * We use a start+len construction, which provides full use of the @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, void pipe_wait(struct pipe_inode_info *pipe) { DEFINE_WAIT(wait); + u64 start; /* * Pipes are system-local resources, so sleeping on them @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe) */ prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); pipe_unlock(pipe); + start = local_clock(); + while (current->state != TASK_RUNNING && + ((local_clock() - start) >> 10) < pipe->pipe_ll_usec) + cpu_relax(); schedule(); finish_wait(&pipe->wait, &wait); pipe_lock(pipe); Networking breaks this out better in terms of options instead of hard-coding. This does not handle need_resched or signal delivery properly where as networking does for example. I don't disable preemption, so don't think checking need_resched is needed. Can you point to what you mean by handling signal delivery in case of networking? Not sure what I am missing. My initial version broke it out like networking but after Peter's suggestion I clubbed it. I don't feel strongly either way. @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags) struct file *files[2]; int fd[2]; int error; + struct pipe_inode_info *pipe; error = __do_pipe_flags(fd, files, flags); if (!error) { @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags) fd_install(fd[0], files[0]); fd_install(fd[1], files[1]); } + pipe = files[0]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; + pipe = files[1]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; } return error; } You add a pipe field but the value in always based on the sysctl so the information is redundantu (barring a race condition on one pipe write per sysctl update which is an irrelevant corner case). In comparison, the network subsystem appears to
Re: [PATCH 00/10] steal tasks to improve CPU utilization
On 10/22/18 7:59 AM, Steve Sistare wrote: When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to reduce cache contention vs the usual bitmap when many threads concurrently set, clear, and visit elements. Is the bitmask saving much? I tried a simple stealing that just starts searching the domain from the current cpu and steals a thread from the first cpu that has more than one runnable thread. It seems to perform similar to your patch. hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs baseline %stdev patch %stdev 1(40 tasks) 0.5524 2.36 0.5522 (0.045%) 3.82 2(80 tasks) 0.6482 11.4 0.7241 (-11.7%) 20.34 4(160 tasks) 0.9756 0.95 0.8276 (15.1%) 5.8 8(320 tasks) 1.7699 1.62 1.6655 (5.9%) 1.57 16(640 tasks) 3.1018 0.77 2.9858 (3.74%) 1.4 32(1280 tasks) 5.565 0.62 5.3388 (4.1%) 0.72 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Oracle database OLTP, logging _enabled_ Users %speedup 20 1.2 40 -0.41 60 0.83 80 2.37 100 1.54 120 3.0 140 2.24 160 1.82 180 1.94 200 2.23 220 1.49 Below is the patch (not in best shape) --->8 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f808ddf..1690451 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3540,6 +3540,7 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) } static int idle_balance(struct rq *this_rq, struct rq_flags *rf); +static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf); static inline unsigned long task_util(struct task_struct *p) { @@ -6619,6 +6620,8 @@ done: __maybe_unused; idle: new_tasks = idle_balance(rq, rf); + if (new_tasks == 0) + new_tasks = my_idle_balance(rq, rf); /* * Because idle_balance() releases (and re-acquires) rq->lock, it is @@ -8434,6 +8437,75 @@ static int should_we_balance(struct lb_env *env) return balance_cpu == env->dst_cpu; } +int get_best_cpu(int this_cpu, struct sched_domain *sd) +{ + struct rq *this_rq, *rq; + int i; + int best_cpu = -1; + + this_rq = cpu_rq(this_cpu); + for_each_cpu_wrap(i, sched_domain_span(sd), this_cpu) { + if (this_rq->nr_running > 0) + return (-1); + if (i == this_cpu) + continue; + rq = cpu_rq(i); + if (rq->nr_running <= 1) + continue; + best_cpu = i; + break; + } + return (best_cpu); +} +static int my_load_balance(int this_cpu, struct rq *this_rq, + struct sched_domain *sd, enum cpu_idle_type idle) +{ + int ld_moved = 0; + struct rq *busiest; + unsigned long flags; + struct task_struct *p = NULL; + struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask); + int best_cpu; + + struct lb_env env = { + .sd = sd, + .dst_cpu = this_cpu, + .dst_rq = this_rq, + .dst_grpmask = sched_group_span(sd->groups), + .idle = idle, + .cpus = cpus, + .tasks = LIST_HEAD_INIT(env.tasks), + }; + + if (idle == CPU_NEWLY_IDLE) + env.dst_grpmask = NULL; + + cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask); + + best_cpu = get_best_cpu(this_cpu, sd); + + if (best_cpu >= 0) + busiest = cpu_rq(best_cpu); + else + goto out; + + env.src_cpu = busiest->cpu; + env.src_rq = busiest; + raw_spin_lock_irqsave(&busiest->lock, flags); + + p = detach_one_task(&env); + raw_spin_unlock(&busiest->lock); + if (p) { + attach_one_task(this_rq, p); + ld_moved++; + } + local_irq_restore(flags); + +out: + return ld_moved; +} + /* * Check this_cpu to ensure it is balanced within domain. Attempt to move
Re: [RFC 00/60] Coscheduling for Linux
D) What can I *not* do with this? - Besides the missing load-balancing within coscheduled task-groups, this implementation has the following properties, which might be considered short-comings. This particular implementation focuses on SCHED_OTHER tasks managed by CFS and allows coscheduling them. Interrupts as well as tasks in higher scheduling classes are currently out-of-scope: they are assumed to be negligible interruptions as far as coscheduling is concerned and they do *not* cause a preemption of a whole group. This implementation could be extended to cover higher scheduling classes. Interrupts, however, are an orthogonal issue. The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. The leader doesn't kick the other cpus _immediately_ to switch to a different cosched group. So threads from previous cosched group will keep running in other HTs till their sched_slice is over (in worst case). This can still keep the window of L1TF vulnerability open?
Re: [RFC 00/60] Coscheduling for Linux
Hi Jan, On 9/7/18 2:39 PM, Jan H. Schönherr wrote: The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. Do you know how much is the delay? i.e what is overlap time when a thread of new group starts executing on one HT while there is still thread of another group running on the other HT? Thanks, Subhra
Re: [RFC PATCH 2/2] pipe: use pipe busy wait
On 09/07/2018 05:25 AM, Peter Zijlstra wrote: On Thu, Aug 30, 2018 at 01:24:58PM -0700, subhra mazumdar wrote: +void pipe_busy_wait(struct pipe_inode_info *pipe) +{ + unsigned long wait_flag = pipe->pipe_wait_flag; + unsigned long start_time = pipe_busy_loop_current_time(); + + pipe_unlock(pipe); + preempt_disable(); + for (;;) { + if (pipe->pipe_wait_flag > wait_flag) { + preempt_enable(); + pipe_lock(pipe); + return; + } + if (pipe_busy_loop_timeout(pipe, start_time)) + break; + cpu_relax(); + } + preempt_enable(); + pipe_lock(pipe); + if (pipe->pipe_wait_flag > wait_flag) + return; + pipe_wait(pipe); +} + +void wake_up_busy_poll(struct pipe_inode_info *pipe) +{ + pipe->pipe_wait_flag++; +} Why not just busy wait on current->state ? A little something like: diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c0977d..8d9f1c95ff99 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -106,6 +106,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, void pipe_wait(struct pipe_inode_info *pipe) { DEFINE_WAIT(wait); + u64 start; /* * Pipes are system-local resources, so sleeping on them @@ -113,7 +114,15 @@ void pipe_wait(struct pipe_inode_info *pipe) */ prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); pipe_unlock(pipe); - schedule(); + + preempt_disable(); + start = local_clock(); + while (!need_resched() && current->state != TASK_RUNNING && + (local_clock() - start) < pipe->poll_usec) + cpu_relax(); + schedule_preempt_disabled(); + preempt_enable(); + finish_wait(&pipe->wait, &wait); pipe_lock(pipe); } This will make the current thread always spin and block as it itself does the state change to TASK_RUNNING in finish_wait.
Re: [RFC 00/60] Coscheduling for Linux
On 09/07/2018 02:39 PM, Jan H. Schönherr wrote: This patch series extends CFS with support for coscheduling. The implementation is versatile enough to cover many different coscheduling use-cases, while at the same time being non-intrusive, so that behavior of legacy workloads does not change. Peter Zijlstra once called coscheduling a "scalability nightmare waiting to happen". Well, with this patch series, coscheduling certainly happened. However, I disagree on the scalability nightmare. :) In the remainder of this email, you will find: A) Quickstart guide for the impatient. B) Why would I want this? C) How does it work? D) What can I *not* do with this? E) What's the overhead? F) High-level overview of the patches in this series. Regards Jan A) Quickstart guide for the impatient. -- Here is a quickstart guide to set up coscheduling at core-level for selected tasks on an SMT-capable system: 1. Apply the patch series to v4.19-rc2. 2. Compile with "CONFIG_COSCHEDULING=y". 3. Boot into the newly built kernel with an additional kernel command line argument "cosched_max_level=1" to enable coscheduling up to core-level. 4. Create one or more cgroups and set their "cpu.scheduled" to "1". 5. Put tasks into the created cgroups and set their affinity explicitly. 6. Enjoy tasks of the same group and on the same core executing simultaneously, whenever they are executed. You are not restricted to coscheduling at core-level. Just select higher numbers in steps 3 and 4. See also further below for more information, esp. when you want to try higher numbers on larger systems. Setting affinity explicitly for tasks within coscheduled cgroups is currently necessary, as the load balancing portion is still missing in this series. I don't get the affinity part. If I create two cgroups by giving them only cpu shares (no cpuset) and set their cpu.scheduled=1, will this ensure co-scheduling of each group on core level for all cores in the system? Thanks, Subhra
Re: [RFC PATCH 2/2] pipe: use pipe busy wait
On 09/17/2018 03:43 PM, Peter Zijlstra wrote: On Mon, Sep 17, 2018 at 02:05:40PM -0700, Subhra Mazumdar wrote: On 09/07/2018 05:25 AM, Peter Zijlstra wrote: Why not just busy wait on current->state ? A little something like: diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c0977d..8d9f1c95ff99 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -106,6 +106,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, void pipe_wait(struct pipe_inode_info *pipe) { DEFINE_WAIT(wait); + u64 start; /* * Pipes are system-local resources, so sleeping on them @@ -113,7 +114,15 @@ void pipe_wait(struct pipe_inode_info *pipe) */ prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); pipe_unlock(pipe); - schedule(); + + preempt_disable(); + start = local_clock(); + while (!need_resched() && current->state != TASK_RUNNING && + (local_clock() - start) < pipe->poll_usec) + cpu_relax(); + schedule_preempt_disabled(); + preempt_enable(); + finish_wait(&pipe->wait, &wait); pipe_lock(pipe); } This will make the current thread always spin and block as it itself does the state change to TASK_RUNNING in finish_wait. Nah, the actual wakeup will also do that state change. The one in finish_wait() is for the case where the wait condition became true without wakeup, such that we don't 'leak' the INTERRUPTIBLE state. Ok, it works. I see similar improvements with hackbench as the original patch.
Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
On 04/25/2018 10:49 AM, Peter Zijlstra wrote: On Tue, Apr 24, 2018 at 02:45:50PM -0700, Subhra Mazumdar wrote: So what you said makes sense in theory but is not borne out by real world results. This indicates that threads of these benchmarks care more about running immediately on any idle cpu rather than spending time to find fully idle core to run on. But you only ran on Intel which emunerates siblings far apart in the cpuid space. Which is not something we should rely on. So by only doing a linear scan on CPU number you will actually fill cores instead of equally spreading across cores. Worse still, by limiting the scan to _4_ you only barely even get onto a next core for SMT4 hardware, never mind SMT8. Again this doesn't matter for the benchmarks I ran. Most are happy to make the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that the scan window is rotated over all cpus, so idle cpus will be found soon. You've not been reading well. The Intel machine you tested this on most likely doesn't suffer that problem because of the way it happens to iterate SMT threads. How does Sparc iterate its SMT siblings in cpuid space? SPARC does sequential enumeration of siblings first, although this needs to be confirmed if non-sequential enumeration on x86 is the reason of the improvements through tests. I don't have a SPARC test system handy now. Also, your benchmarks chose an unfortunate nr of threads vs topology. The 2^n thing chosen never hits the 100% core case (6,22 resp.). So while I'm not adverse to limiting the empty core search; I do feel it is important to have. Overloading cores when you don't have to is not good. Can we have a config or a way for enabling/disabling select_idle_core? I like Rohit's suggestion of folding select_idle_core and select_idle_cpu much better, then it stays SMT aware. Something like the completely untested patch below. I tried both the patches you suggested, the first with merging of select_idle_core and select_idle_cpu and second with the new way of calculating avg_idle and finally both combined. I ran the following benchmarks for each, the merge only patch seems to giving similar improvements as my original patch for Uperf and Oracle DB tests, but it regresses for hackbench. If we can fix this I am OK with it. I can do a run of other benchamrks after that. I also noticed a possible bug later in the merge code. Shouldn't it be: if (busy < best_busy) { best_busy = busy; best_cpu = first_idle; } Unfortunately I noticed it after all runs. merge: Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5742 21.13 0.5099 (11.2%) 2.24 2 0.5776 7.87 0.5385 (6.77%) 3.38 4 0.9578 1.12 1.0626 (-10.94%) 1.35 8 1.7018 1.35 1.8615 (-9.38%) 0.73 16 2.9955 1.36 3.2424 (-8.24%) 0.66 32 5.4354 0.59 5.749 (-5.77%) 0.55 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 49.47 0.35 49.98 (1.03%) 1.36 16 95.28 0.77 97.46 (2.29%) 0.11 32 156.77 1.17 167.03 (6.54%) 1.98 48 193.24 0.22 230.96 (19.52%) 2.44 64 216.21 9.33 299.55 (38.54%) 4 128 379.62 10.29 357.87 (-5.73%) 0.85 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 1.35 0.9919 (-0.81%) 0.14 40 1 0.42 0.9959 (-0.41%) 0.72 60 1 1.54 0.9872 (-1.28%) 1.27 80 1 0.58 0.9925 (-0.75%) 0.5 100 1 0.77 1.0145 (1.45%) 1.29 120 1 0.35 1.0136 (1.36%) 1.15 140 1 0.19 1.0404 (4.04%) 0.91 160 1 0.09 1.0317 (3.17%) 1.41 180 1 0.99 1.0322 (3.22%) 0.51 200 1 1.03 1.0245 (2.45%) 0.95 220 1 1.69 1.0296 (2.96%) 2.83 new avg_idle: Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5742 21.13 0.5241 (8.73%) 8.26 2 0.5776 7.87 0.5436 (5.89%) 8.53 4 0.9578 1.12 0.989 (-3.26%) 1.9 8 1.7018 1.35 1.7568 (-3.23%) 1.22 16 2.9955 1.36 3.1119 (-3.89%) 0.92 32 5.4354 0.59 5.5889 (-2.82%) 0.64 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 49.47 0.35 48.11 (-2.75%) 0.29 16 95.28 0.77 93.67 (-1.68%) 0.68 32 156.77
Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
On 05/01/2018 11:03 AM, Peter Zijlstra wrote: On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote: I also noticed a possible bug later in the merge code. Shouldn't it be: if (busy < best_busy) { best_busy = busy; best_cpu = first_idle; } Uhh, quite. I did say it was completely untested, but yes.. /me dons the brown paper bag. I re-ran the test after fixing that bug but still get similar regressions for hackbench, while similar improvements on Uperf. I didn't re-run the Oracle DB tests but my guess is it will show similar improvement. merge: Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5742 21.13 0.5131 (10.64%) 4.11 2 0.5776 7.87 0.5387 (6.73%) 2.39 4 0.9578 1.12 1.0549 (-10.14%) 0.85 8 1.7018 1.35 1.8516 (-8.8%) 1.56 16 2.9955 1.36 3.2466 (-8.38%) 0.42 32 5.4354 0.59 5.7738 (-6.23%) 0.38 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 49.47 0.35 51.1 (3.29%) 0.13 16 95.28 0.77 98.45 (3.33%) 0.61 32 156.77 1.17 170.97 (9.06%) 5.62 48 193.24 0.22 245.89 (27.25%) 7.26 64 216.21 9.33 316.43 (46.35%) 0.37 128 379.62 10.29 337.85 (-11%) 3.68 I tried using the next_cpu technique with the merge but didn't help. I am open to suggestions. merge + next_cpu: Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5742 21.13 0.5107 (11.06%) 6.35 2 0.5776 7.87 0.5917 (-2.44%) 11.16 4 0.9578 1.12 1.0761 (-12.35%) 1.1 8 1.7018 1.35 1.8748 (-10.17%) 0.8 16 2.9955 1.36 3.2419 (-8.23%) 0.43 32 5.4354 0.59 5.6958 (-4.79%) 0.58 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 49.47 0.35 51.65 (4.41%) 0.26 16 95.28 0.77 99.8 (4.75%) 1.1 32 156.77 1.17 168.37 (7.4%) 0.6 48 193.24 0.22 228.8 (18.4%) 1.75 64 216.21 9.33 287.11 (32.79%) 10.82 128 379.62 10.29 346.22 (-8.8%) 4.7 Finally there was earlier suggestion by Peter in select_task_rq_fair to transpose the cpu offset that I had tried earlier but also regressed on hackbench. Just wanted to mention that so we have closure on that. transpose cpu offset in select_task_rq_fair: Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5742 21.13 0.5251 (8.55%) 2.57 2 0.5776 7.87 0.5471 (5.28%) 11 4 0.9578 1.12 1.0148 (-5.95%) 1.97 8 1.7018 1.35 1.798 (-5.65%) 0.97 16 2.9955 1.36 3.088 (-3.09%) 2.7 32 5.4354 0.59 5.2815 (2.8%) 1.26
Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
On 04/24/2018 05:46 AM, Peter Zijlstra wrote: On Mon, Apr 23, 2018 at 05:41:14PM -0700, subhra mazumdar wrote: select_idle_core() can potentially search all cpus to find the fully idle core even if there is one such core. Removing this is necessary to achieve scalability in the fast path. So this removes the whole core awareness from the wakeup path; this needs far more justification. In general running on pure cores is much faster than running on threads. If you plot performance numbers there's almost always a fairly significant drop in slope at the moment when we run out of cores and start using threads. The only justification I have is the benchmarks I ran all most all improved, most importantly our internal Oracle DB tests which we care about a lot. So what you said makes sense in theory but is not borne out by real world results. This indicates that threads of these benchmarks care more about running immediately on any idle cpu rather than spending time to find fully idle core to run on. Also, depending on cpu enumeration, your next patch might not even leave the core scanning for idle CPUs. Now, typically on Intel systems, we first enumerate cores and then siblings, but I've seen Intel systems that don't do this and enumerate all threads together. Also other architectures are known to iterate full cores together, both s390 and Power for example do this. So by only doing a linear scan on CPU number you will actually fill cores instead of equally spreading across cores. Worse still, by limiting the scan to _4_ you only barely even get onto a next core for SMT4 hardware, never mind SMT8. Again this doesn't matter for the benchmarks I ran. Most are happy to make the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that the scan window is rotated over all cpus, so idle cpus will be found soon. There is also stealing by idle cpus. Also this was an RFT so I request this to be tested on other architectrues like SMT4/SMT8. So while I'm not adverse to limiting the empty core search; I do feel it is important to have. Overloading cores when you don't have to is not good. Can we have a config or a way for enabling/disabling select_idle_core?
Re: [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit
On 04/24/2018 05:47 AM, Peter Zijlstra wrote: On Mon, Apr 23, 2018 at 05:41:15PM -0700, subhra mazumdar wrote: @@ -17,6 +17,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) /* @@ -6018,6 +6019,7 @@ void __init sched_init(void) struct rq *rq; rq = cpu_rq(i); + per_cpu(next_cpu, i) = -1; If you leave it uninitialized it'll be 0, and we can avoid that extra branch in the next patch, no? 0 can be a valid cpu id. I wanted to distinguish the first time. The branch predictor will be fully trained so will not have any cost.
Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
On 04/24/2018 05:48 AM, Peter Zijlstra wrote: On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote: + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + This one; what's the point here? Want to start search from target first time and from the next_cpu next time onwards. If this doesn't many any difference in performance I can change it. Will require re-running all the tests.
Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
On 04/24/2018 05:48 AM, Peter Zijlstra wrote: On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote: Lower the lower limit of idle cpu search in select_idle_cpu() and also put an upper limit. This helps in scalability of the search by restricting the search window. Also rotating the search window with help of next_cpu ensures any idle cpu is eventually found in case of high load. So this patch does 2 (possibly 3) things, that's not good. During testing I did try with first only restricting the search window. That alone wasn't enough to give the full benefit, rotating search window was essential to get best results. I will break this up in next version.
Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
On 04/24/2018 05:53 AM, Peter Zijlstra wrote: On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote: Lower the lower limit of idle cpu search in select_idle_cpu() and also put an upper limit. This helps in scalability of the search by restricting the search window. @@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + if (span_avg > 2*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > 4) + nr = 4; + } else { + nr = 2; + } } Why do you need to put a max on? Why isn't the proportional thing working as is? (is the average no good because of big variance or what) Firstly the choosing of 512 seems arbitrary. Secondly the logic here is that the enqueuing cpu should search up to time it can get work itself. Why is that the optimal amount to search? Again, why do you need to lower the min; what's wrong with 4? The reason I picked 4 is that many laptops have 4 CPUs and desktops really want to avoid queueing if at all possible. To find the optimum upper and lower limit I varied them over many combinations. 4 and 2 gave the best results across most benchmarks.
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/18/19 8:56 AM, Peter Zijlstra wrote: A much 'demanded' feature: core-scheduling :-( I still hate it with a passion, and that is part of why it took a little longer than 'promised'. While this one doesn't have all the 'features' of the previous (never published) version and isn't L1TF 'complete', I tend to like the structure better (relatively speaking: I hate it slightly less). This one is sched class agnostic and therefore, in principle, doesn't horribly wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task' to force-idle siblings). Now, as hinted by that, there are semi sane reasons for actually having this. Various hardware features like Intel RDT - Memory Bandwidth Allocation, work per core (due to SMT fundamentally sharing caches) and therefore grouping related tasks on a core makes it more reliable. However; whichever way around you turn this cookie; it is expensive and nasty. I am seeing the following hard lockup frequently now. Following is full kernel output: [ 5846.412296] drop_caches (8657): drop_caches: 3 [ 5846.624823] drop_caches (8658): drop_caches: 3 [ 5850.604641] hugetlbfs: oracle (8671): Using mlock ulimits for SHM_HUGETL B is deprecated [ 5962.930812] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32 [ 5962.930814] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [ 5962.930828] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted 5.0.0-rc7core_sched #1 [ 5962.930828] Hardware name: Oracle Corporation ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016 [ 5962.930829] RIP: 0010:try_to_wake_up+0x98/0x470 [ 5962.930830] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43 3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38 <85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f [ 5962.930831] RSP: 0018:c9000f4dbcb8 EFLAGS: 0002 [ 5962.930832] RAX: 0001 RBX: 88dfb4af1680 RCX: 0041 [ 5962.930832] RDX: 0001 RSI: RDI: 88dfb4af214c [ 5962.930833] RBP: R08: 0001 R09: c9000f4dbd80 [ 5962.930833] R10: 8880 R11: ea00f0003d80 R12: 88dfb4af214c [ 5962.930834] R13: 0001 R14: 0046 R15: 0001 [ 5962.930834] FS: 7ff4fabd9ae0() GS:88dfbe28() knlGS: [ 5962.930834] CS: 0010 DS: ES: CR0: 80050033 [ 5962.930835] CR2: 000f4cc84000 CR3: 003b93d36002 CR4: 003606e0 [ 5962.930835] DR0: DR1: DR2: [ 5962.930836] DR3: DR6: fffe0ff0 DR7: 0400 [ 5962.930836] Call Trace: [ 5962.930837] ? __switch_to_asm+0x34/0x70 [ 5962.930837] ? __switch_to_asm+0x40/0x70 [ 5962.930838] ? __switch_to_asm+0x34/0x70 [ 5962.930838] autoremove_wake_function+0x11/0x50 [ 5962.930838] __wake_up_common+0x8f/0x160 [ 5962.930839] ? __switch_to_asm+0x40/0x70 [ 5962.930839] __wake_up_common_lock+0x7c/0xc0 [ 5962.930840] pipe_write+0x24e/0x3f0 [ 5962.930840] __vfs_write+0x127/0x1b0 [ 5962.930840] vfs_write+0xb3/0x1b0 [ 5962.930841] ksys_write+0x52/0xc0 [ 5962.930841] do_syscall_64+0x5b/0x170 [ 5962.930842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 5962.930842] RIP: 0033:0x3b5900e7b0 [ 5962.930843] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90 90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24 [ 5962.930843] RSP: 002b:7ffedbcd93a8 EFLAGS: 0246 ORIG_RAX: 0001 [ 5962.930844] RAX: ffda RBX: 7ff4faa86e24 RCX: 003b5900e7b0 [ 5962.930845] RDX: 028f RSI: 7ff4faa9688e RDI: 000a [ 5962.930845] RBP: 7ffedbcd93c0 R08: 7ffedbcd9458 R09: 0020 [ 5962.930846] R10: R11: 0246 R12: 7ffedbcd9458 [ 5962.930847] R13: 7ff4faa9688e R14: 7ff4faa89cc8 R15: 7ff4faa86bd0 [ 5962.930847] Kernel panic - not syncing: Hard LOCKUP [ 5962.930848] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted 5.0.0-rc7core_sched #1 [ 5962.930848] Hardware name: Oracle Corporation ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016 [ 5962.930849] Call Trace: [ 5962.930849] [ 5962.930849] dump_stack+0x5c/0x7b [ 5962.930850] panic+0xfe/0x2b2 [ 5962.930850] nmi_panic+0x35/0x40 [ 5962.930851] watchdog_overflow_callback+0xef/0x100 [ 5962.930851] __perf_event_overflow+0x5a/0xe0 [ 5962.930852] handle_pmi_common+0x1d1/0x280 [ 5962.930852] ? __set_pte_vaddr+0x32/0x50 [ 5962.930852] ? __set
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/22/19 5:06 PM, Subhra Mazumdar wrote: On 3/21/19 2:20 PM, Julien Desfossez wrote: On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar wrote: On 3/18/19 8:41 AM, Julien Desfossez wrote: On further investigation, we could see that the contention is mostly in the way rq locks are taken. With this patchset, we lock the whole core if cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or less serialized for the core and that attributes to the performance loss that we are seeing. We also saw that newidle_balance() takes considerably long time in load_balance() due to the rq spinlock contention. Do you think it would help if the core-wide locking was only performed when absolutely needed ? Is the core wide lock primarily responsible for the regression? I ran upto patch 12 which also has the core wide lock for tagged cgroups and also calls newidle_balance() from pick_next_task(). I don't see any regression. Of course the core sched version of pick_next_task() may be doing more but comparing with the __pick_next_task() it doesn't look too horrible. I gathered some data with only 1 DB instance running (which also has 52% slow down). Following are the numbers of pick_next_task() calls and their avg cost for patch 12 and patch 15. The total number of calls seems to be similar but the avg cost (in us) has more than doubled. For both the patches I had put the DB instance into a cpu tagged cgroup. patch12 patch15 count pick_next_task 62317898 58925395 avg cost pick_next_task 0.6566323209 1.4223810108
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/29/19 3:23 PM, Subhra Mazumdar wrote: On 3/29/19 6:35 AM, Julien Desfossez wrote: On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar wrote: Is the core wide lock primarily responsible for the regression? I ran upto patch 12 which also has the core wide lock for tagged cgroups and also calls newidle_balance() from pick_next_task(). I don't see any regression. Of course the core sched version of pick_next_task() may be doing more but comparing with the __pick_next_task() it doesn't look too horrible. On further testing and investigation, we also agree that spinlock contention is not the major cause for the regression, but we feel that it should be one of the major contributing factors to this performance loss. I finally did some code bisection and found the following lines are basically responsible for the regression. Commenting them out I don't see the regressions. Can you confirm? I am yet to figure if this is needed for the correctness of core scheduling and if so can we do this better? >8- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fe3918c..3b3388a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) * If there weren't no cookies; we don't need * to bother with the other siblings. */ - if (i == cpu && !rq->core->core_cookie) - goto next_class; + //if (i == cpu && !rq->core->core_cookie) + //goto next_class; continue; } AFAICT this condition is not needed for correctness as cookie matching will sill be enforced. Peter any thoughts? I get the following numbers with 1 DB and 2 DB instance. 1 DB instance users baseline %idle core_sched %idle 16 1 84 -5.5% 84 24 1 76 -5% 76 32 1 69 -0.45% 69 2 DB instance users baseline %idle core_sched %idle 16 1 66 -23.8% 69 24 1 54 -3.1% 57 32 1 42 -21.1% 48
Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
On 4/9/19 11:38 AM, Julien Desfossez wrote: We found the source of the major performance regression we discussed previously. It turns out there was a pattern where a task (a kworker in this case) could be woken up, but the core could still end up idle before that task had a chance to run. Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and task2 are in the same cgroup with the tag enabled (each following line happens in the increasing order of time): - task1 running on cpu0, task2 running on cpu1 - sched_waking(kworker/0, target_cpu=cpu0) - task1 scheduled out of cpu0 - kworker/0 cannot run on cpu0 because of task2 is still running on cpu1 cpu0 is idle - task2 scheduled out of cpu1 - cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends the task selection if core_cookie is NULL for currently selected process and the cpu1’s runqueue. - cpu1 is idle --> both siblings are idle but kworker/0 is still in the run queue of cpu0. Cpu0 may stay idle for longer if it goes deep idle. With the fix below, we ensure to send an IPI to the sibling if it is idle and has tasks waiting in its runqueue. This fixes the performance issue we were seeing. Now here is what we can measure with a disk write-intensive benchmark: - no performance impact with enabling core scheduling without any tagged task, - 5% overhead if one tagged task is competing with an untagged task, - 10% overhead if 2 tasks tagged with a different tag are competing against each other. We are starting more scaling tests, but this is very encouraging ! diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e1fa10561279..02c862a5e973 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) trace_printk("unconstrained pick: %s/%d %lx\n", next->comm, next->pid, next->core_cookie); + rq->core_pick = NULL; +/* +* If the sibling is idling, we might want to wake it +* so that it can check for any runnable but blocked tasks +* due to previous task matching. +*/ + for_each_cpu(j, smt_mask) { + struct rq *rq_j = cpu_rq(j); + rq_j->core_pick = NULL; + if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) { + resched_curr(rq_j); + trace_printk("IPI(%d->%d[%d]) idle preempt\n", +cpu, j, rq_j->nr_running); + } + } goto done; } I see similar improvement with this patch as removing the condition I earlier mentioned. So that's not needed. I also included the patch for the priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users (from earlier emails). 1 DB instance users baseline %idle core_sched %idle 16 1 84 -4.9% 84 24 1 76 -6.7% 75 32 1 69 -2.4% 69 2 DB instance users baseline %idle core_sched %idle 16 1 66 -19.5% 69 24 1 54 -9.8% 57 32 1 42 -27.2% 48
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/18/19 8:41 AM, Julien Desfossez wrote: The case where we try to acquire the lock on 2 runqueues belonging to 2 different cores requires the rq_lockp wrapper as well otherwise we frequently deadlock in there. This fixes the crash reported in 1552577311-8218-1-git-send-email-jdesfos...@digitalocean.com diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 76fee56..71bb71f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) raw_spin_lock(rq_lockp(rq1)); __acquire(rq2->lock);/* Fake it out ;) */ } else { - if (rq1 < rq2) { + if (rq_lockp(rq1) < rq_lockp(rq2)) { raw_spin_lock(rq_lockp(rq1)); raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING); } else { With this fix and my previous NULL pointer fix my stress tests are surviving. I re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB instance in separate core scheduling group. The numbers look much worse now. users baseline %stdev %idle core_sched %stdev %idle 16 1 0.3 66 -73.4% 136.8 82 24 1 1.6 54 -95.8% 133.2 81 32 1 1.5 42 -97.5% 124.3 89 I also notice that if I enable a bunch of debug configs related to mutexes, spin locks, lockdep etc. (which I did earlier to debug the dead lock), it opens up a can of worms with multiple crashes.
[RFC PATCH 4/9] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c31082d..23ec9c6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[RFC PATCH 2/9] sched: add search limit as per latency-nice
Put upper and lower limit on CPU search in select_idle_cpu. The lower limit is set to amount of CPUs in a core while upper limit is derived from the latency-nice of the thread. This ensures for any architecture we will usually search beyond a core. Changing the latency-nice value by user will change the search cost making it appropriate for given workload. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b08d00c..c31082d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) - nr = div_u64(span_avg, avg_cost); - else - nr = 4; + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX; + if (nr < floor) + nr = floor; } time = local_clock(); -- 2.9.3
[RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file "latency-nice" which is shared by all the threads in that Cgroup. Signed-off-by: subhra mazumdar --- include/linux/sched.h | 1 + kernel/sched/core.c | 40 kernel/sched/fair.c | 1 + kernel/sched/sched.h | 8 4 files changed, 50 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1183741..b4a79c3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -631,6 +631,7 @@ struct task_struct { int static_prio; int normal_prio; unsigned intrt_priority; + u64 latency_nice; const struct sched_class*sched_class; struct sched_entity se; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..47969bc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5976,6 +5976,7 @@ void __init sched_init(void) init_dl_rq(&rq->dl); #ifdef CONFIG_FAIR_GROUP_SCHED root_task_group.shares = ROOT_TASK_GROUP_LOAD; + root_task_group.latency_nice = LATENCY_NICE_DEFAULT; INIT_LIST_HEAD(&rq->leaf_cfs_rq_list); rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; /* @@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type) */ tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), struct task_group, css); + tsk->latency_nice = tg->latency_nice; tg = autogroup_task_group(tsk, tg); tsk->sched_task_group = tg; @@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ +static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css, +struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + + return tg->latency_nice; +} + +static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css, + struct cftype *cft, u64 latency_nice) +{ + struct task_group *tg = css_tg(css); + struct css_task_iter it; + struct task_struct *p; + + if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX) + return -ERANGE; + + tg->latency_nice = latency_nice; + + css_task_iter_start(css, 0, &it); + while ((p = css_task_iter_next(&it))) + p->latency_nice = latency_nice; + css_task_iter_end(&it); + + return 0; +} + static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { @@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = { .write_u64 = cpu_rt_period_write_uint, }, #endif + { + .name = "latency-nice", + .read_u64 = cpu_latency_nice_read_u64, + .write_u64 = cpu_latency_nice_write_u64, + }, { } /* Terminate */ }; @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = { .write = cpu_max_write, }, #endif + { + .name = "latency-nice", + .read_u64 = cpu_latency_nice_read_u64, + .write_u64 = cpu_latency_nice_write_u64, + }, { } /* terminate */ }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b08d00c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) goto err; tg->shares = NICE_0_LOAD; + tg->latency_nice = LATENCY_NICE_DEFAULT; init_cfs_bandwidth(tg_cfs_bandwidth(tg)); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..365c928 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT) /* + * Latency-nice default value + */ +#defineLATENCY_NICE_DEFAULT5 +#defineLATENCY_NICE_MIN1 +#defineLATENCY_NICE_MAX100 + +/* * Single value that decides SCHED_DEADLINE internal math precision. * 10 -> just above 1us * 9 -> just above 0.5us @@ -362,6 +369,7 @@ struct cfs_bandwidth { /* Task group related information */ struct task_group { struct cgroup_subsys_state css; + u64 latency_nice; #ifdef CONFIG_FAIR_GROUP_SCHED /* schedulable entities of this group on each CPU */ -- 2.9.3
[RFC PATCH 9/9] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 94dd4a32..7419b47 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, floor, nr = INT_MAX; + int cpu, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = floor; } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) -- 2.9.3
[RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 47969bc..5862d54 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 365c928..cca2b09 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1002,6 +1002,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) -- 2.9.3
[RFC PATCH 7/9] sched: search SMT before LLC domain
Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps in L1 cache locality. --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8856503..94dd4a32 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) return i; } - i = select_idle_cpu(p, sd, target); + i = select_idle_smt(p, target); if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; -- 2.9.3
[RFC PATCH 0/9] Task latency-nice
Introduce new per task property latency-nice for controlling scalability in scheduler idle CPU search path. Valid latency-nice values are from 1 to 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New CPU cgroup file cpu.latency-nice is added as an interface to set and get. All tasks in the same cgroup share the same latency-nice value. Using a lower latency-nice value can help latency intolerant tasks e.g very short running OLTP threads where full LLC search cost can be significant compared to run time of the threads. The default latency-nice value is 5. In addition to latency-nice, it also adds a new sched feature SIS_CORE to be able to disable idle core search altogether which is costly and hurts more than it helps in short running workloads. Finally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE 8 64.66 64.38 (-0.43%) 64.79 (0.2%) 16 123.34 122.88 (-0.37%) 125.87 (2.05%) 32 215.18 215.55 (0.17%) 247.77 (15.15%) 48 278.56 321.6 (15.45%) 321.2 (15.3%) 64 259.99 319.45 (22.87%) 333.95 (28.44%) 128 431.1 437.69 (1.53%) 431.09 (0%) subhra mazumdar (9): sched,cgroup: Add interface for latency-nice sched: add search limit as per latency-nice sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search sched: Define macro for number of CPUs in core x86/smpboot: Optimize cpumask_weight_sibling macro for x86 sched: search SMT before LLC domain sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/sched.h | 1 + include/linux/topology.h| 4 kernel/sched/core.c | 42 + kernel/sched/fair.c | 34 + kernel/sched/features.h | 1 + kernel/sched/sched.h| 9 + 9 files changed, 97 insertions(+), 13 deletions(-) -- 2.9.3
[RFC PATCH 5/9] sched: Define macro for number of CPUs in core
Introduce macro topology_sibling_weight for number of sibling CPUs in a core and use in select_idle_cpu Signed-off-by: subhra mazumdar --- include/linux/topology.h | 4 kernel/sched/fair.c | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_weight +#define topology_sibling_weight(cpu) \ + cpumask_weight(topology_sibling_cpumask(cpu)) +#endif #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 23ec9c6..8856503 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6205,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (sched_feat(SIS_PROP)) { - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX; -- 2.9.3
[RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86
Use per-CPU variable for cpumask_weight_sibling macro in case of x86 for fast lookup in select_idle_cpu. This avoids reading multiple cache lines in case of systems with large numbers of CPUs where bitmask can span multiple cache lines. Even if bitmask spans only one cache line this avoids looping through it to find the number of bits and gets it in O(1). Signed-off-by: subhra mazumdar --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..57ad88d 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,9 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +523,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +534,13 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, i) = + cpumask_weight(topology_sibling_cpumask(i)); + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, &phys_cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; -- 2.9.3
[RFC PATCH 3/9] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
Panic on v5.3-rc4
I am getting the following panic during boot of tag v5.3-rc4 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git. I don't see the panic on tag v5.2 on same rig. Is it a bug or something legitimately changed? Thanks, Subhra [ 147.184948] dracut Warning: No root device "block:/dev/mapper/vg_paex623-lv_root" found [ 147.282665] dracut Warning: LVM vg_paex623/lv_swap not found [ 147.354854] dracut Warning: LVM vg_paex623/lv_root not found [ 147.432099] dracut Warning: Boot has failed. To debug this issue add "rdshell" to the kernel command line. [ 147.549737] dracut Warning: Signal caught! [ 147.600145] dracut Warning: LVM vg_paex623/lv_swap not found [ 147.670692] dracut Warning: LVM vg_paex623/lv_root not found [ 147.738593] dracut Warning: Boot has failed. To debug this issue add "rdshell" to the kernel command line. [ 147.856206] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0100 [ 147.947859] CPU: 3 PID: 1 Comm: init Not tainted 5.3.0-rc4latency_nice_BL #3 [ 148.032225] Hardware name: Oracle Corporation ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016 [ 148.149879] Call Trace: [ 148.179117] dump_stack+0x5c/0x7b [ 148.218760] panic+0xfe/0x2e2 [ 148.254242] do_exit+0xbd8/0xbe0 [ 148.292842] do_group_exit+0x3a/0xa0 [ 148.335597] __x64_sys_exit_group+0x14/0x20 [ 148.385644] do_syscall_64+0x5b/0x1d0 [ 148.429448] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 148.489900] RIP: 0033:0x3a5ccad018 [ 148.530582] Code: Bad RIP value. [ 148.569178] RSP: 002b:7ffc147a5b48 EFLAGS: 0246 ORIG_RAX: 00e7 [ 148.659791] RAX: ffda RBX: 0004 RCX: 003a5ccad018 [ 148.745197] RDX: 0001 RSI: 003c RDI: 0001 [ 148.830605] RBP: R08: 00e7 R09: ffa8 [ 148.916011] R10: 0001 R11: 0246 R12: 00401fb0 [ 149.001414] R13: 7ffc147a5e20 R14: R15: [ 149.086929] Kernel Offset: disabled [ 149.132815] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0100 ]---
Re: [RFC PATCH 7/9] sched: search SMT before LLC domain
On 9/5/19 2:31 AM, Peter Zijlstra wrote: On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote: Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps in L1 cache locality. --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8856503..94dd4a32 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) return i; } - i = select_idle_cpu(p, sd, target); + i = select_idle_smt(p, target); if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; But it is absolutely conceptually wrong. An idle core is a much better target than an idle sibling. This is select_idle_smt not select_idle_core.
Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search
On 9/5/19 3:17 AM, Patrick Bellasi wrote: On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote... Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) Why do we need a sched_feature? If you think there are systems in which the usage of latency-nice does not make sense for in "Select Idle Sibling", then we should probably better add a new Kconfig option. This is not for latency-nice but to be able to disable a different aspect of the scheduler, i.e searching for idle cores. This can be made part of latency-nice (i.e not do idle core search if latency-nice is below a certain value) but even then having a feature to disable it doesn't hurt. If that's the case, you can probably use the init/Kconfig's "Scheduler features" section, recently added by: commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") /* * Issue a WARN when we do multiple update_rq_clock() calls Best, Patrick
Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
On 7/2/19 10:58 PM, Peter Zijlstra wrote: On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote: The soft affinity CPUs present in the cpumask cpus_preferred is used by the scheduler in two levels of search. First is in determining wake affine which choses the LLC domain and secondly while searching for idle CPUs in LLC domain. In the first level it uses cpus_preferred to prune out the search space. In the second level it first searches the cpus_preferred and then cpus_allowed. Using affinity_unequal flag it breaks early to avoid any overhead in the scheduler fast path when soft affinity is not used. This only changes the wake up path of the scheduler, the idle balancing is unchanged; together they achieve the "softness" of scheduling. I really dislike this implementation. I thought the idea was to remain work conserving (in so far as that we're that anyway), so changing select_idle_sibling() doesn't make sense to me. If there is idle, we use it. Same for newidle; which you already retained. The scheduler is already not work conserving in many ways. Soft affinity is only for those who want to use it and has no side effects when not used. Also the way scheduler is implemented in the first level of search it may not be possible to do it in a work conserving way, I am open to ideas. This then leaves regular balancing, and for that we can fudge with can_migrate_task() and nr_balance_failed or something. Possibly but I don't know if similar performance behavior can be achieved by the periodic load balancer. Do you want a performance comparison of the two approaches? And I also really don't want a second utilization tipping point; we already have the overloaded thing. The numbers in the cover letter show that a static tipping point will not work for all workloads. What soft affinity is doing is essentially trading off cache coherence for more CPU. The optimum tradeoff point will vary from workload to workload and the system metrics of coherence overhead etc. If we just use the domain overload that becomes a static definition of tipping point, we need something tunable that captures this tradeoff. The ratio of CPU util seemed to work well and capture that. I also still dislike how you never looked into the numa balancer, which already has peferred_nid stuff. Not sure if you mean using the existing NUMA balancer or enhancing it. If the former, I have numbers in the cover letter that show NUMA balancer is not making any difference. I allocated memory of each DB instance to one NUMA node using numactl, but NUMA balancer still migrated pages, so numactl only seems to control the initial allocation. Secondly even though NUMA balancer migrated pages it had no performance benefit as compared to disabling it.
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/20/19 1:42 AM, Peter Zijlstra wrote: A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote: Thanks for posting this patchset Peter. Based on the patch titled, "sched: A quick and dirty cgroup tagging interface," I believe cgroups are used to define co-scheduling groups in this implementation. Chrome OS engineers (kerr...@google.com, mpden...@google.com, and pal...@google.com) are considering an interface that is usable by unprivileged userspace apps. cgroups are a global resource that require privileged access. Have you considered an interface that is akin to namespaces? Consider the following strawperson API proposal (I understand prctl() is generally used for process specific actions, so we aren't married to using prctl()): I don't think we're anywhere near the point where I care about interfaces with this stuff. Interfaces are a trivial but tedious matter once the rest works to satisfaction. As it happens; there is actually a bug in that very cgroup patch that can cause undesired scheduling. Try spotting and fixing that. Another question is if we want to be L1TF complete (and how strict) or not, and if so, build the missing pieces (for instance we currently don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty and horrible code and missing for that reason). I remember asking Paul about this and he mentioned he has a Address Space Isolation proposal to cover this. So it seems this is out of scope of core scheduling? So first; does this provide what we need? If that's sorted we can bike-shed on uapi/abi.
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/18/19 9:49 AM, Linus Torvalds wrote: On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra wrote: However; whichever way around you turn this cookie; it is expensive and nasty. Do you (or anybody else) have numbers for real loads? Because performance is all that matters. If performance is bad, then it's pointless, since just turning off SMT is the answer. Linus I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system. This is on baremetal, no virtualization. In all cases I put each DB instance in separate cpu cgroup. Following are the avg throughput numbers of the 2 instances. %stdev is the standard deviation between the 2 instances. Baseline = build w/o CONFIG_SCHED_CORE core_sched = build w/ CONFIG_SCHED_CORE HT_disable = offlined sibling HT with baseline Users Baseline %stdev core_sched %stdev HT_disable %stdev 16 997768 3.28 808193(-19%) 34 1053888(+5.6%) 2.9 24 1157314 9.4 974555(-15.8%) 40.5 1197904(+3.5%) 4.6 32 1693644 6.4 1237195(-27%) 42.8 1308180(-22.8%) 5.3 The regressions are substantial. Also noticed one of the DB instances was having much less throughput than the other with core scheduling which brought down the avg and also reflected in the very high %stdev. Disabling HT has effect at 32 users but still better than core scheduling both in terms of avg and %stdev. There are some issue with the DB setup for which I couldn't go beyond 32 users.
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct.
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 11:19 AM, Subhra Mazumdar wrote: On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct. Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure why you think it is unlikely?
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 5:01 PM, Aubrey Li wrote: On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar wrote: On 5/8/19 11:19 AM, Subhra Mazumdar wrote: On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct. Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure why you think it is unlikely? Yeah, I meant 2 different cookies on the system, but unlikely 2 different cookies on one same rq. If I read the source correctly, for the sched_core_balance path, when try to steal cookie from another CPU, sched_core_find() uses dst's cookie to search if there is a cookie match in src's rq, and sched_core_find() returns idle or matched task, and later put this matched task onto dst's rq (activate_task() in sched_core_find()). At this moment, the nodes on the rq's rb tree should have same cookies. Thanks, -Aubrey Yes, but sched_core_find is also called from pick_task to find a local matching task. The enqueue side logic of the scheduler is unchanged with core scheduling, so it is possible tasks with different cookies are enqueued on the same rq. So while searching for a matching task locally doing it correctly should matter.
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 6:38 PM, Aubrey Li wrote: On Thu, May 9, 2019 at 8:29 AM Subhra Mazumdar wrote: On 5/8/19 5:01 PM, Aubrey Li wrote: On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar wrote: On 5/8/19 11:19 AM, Subhra Mazumdar wrote: On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct. Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure why you think it is unlikely? Yeah, I meant 2 different cookies on the system, but unlikely 2 different cookies on one same rq. If I read the source correctly, for the sched_core_balance path, when try to steal cookie from another CPU, sched_core_find() uses dst's cookie to search if there is a cookie match in src's rq, and sched_core_find() returns idle or matched task, and later put this matched task onto dst's rq (activate_task() in sched_core_find()). At this moment, the nodes on the rq's rb tree should have same cookies. Thanks, -Aubrey Yes, but sched_core_find is also called from pick_task to find a local matching task. Can a local searching introduce a different cookies? Where is it from? No. I meant the local search uses the same binary search of sched_core_find so it has to be correct. The enqueue side logic of the scheduler is unchanged with core scheduling, But only the task with cookies is placed onto this rb tree? so it is possible tasks with different cookies are enqueued on the same rq. So while searching for a matching task locally doing it correctly should matter. May I know how exactly? select_task_rq_* seems to be unchanged. So the search logic to find a cpu to enqueue when a task becomes runnable is same as before and doesn't do any kind of cookie matching. Thanks, -Aubrey
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
select_task_rq_* seems to be unchanged. So the search logic to find a cpu to enqueue when a task becomes runnable is same as before and doesn't do any kind of cookie matching. Okay, that's true in task wakeup path, and also load_balance seems to pull task without checking cookie too. But my system is not over loaded when I tested this patch, so there is none or only one task in rq and on the rq's rb tree, so this patch does not make a difference. I had same hypothesis for my tests. The question is, should we do cookie checking for task selecting CPU and load balance CPU pulling task? The basic issue is keeping the CPUs busy. In case of overloaded system, the trivial new idle balancer should be able to find a matching task in case of forced idle. More problematic is the lower load scenario when there aren't any matching task to be found but there are runnable tasks of other groups. Also wake up code path tries to balance threads across cores (select_idle_core) first which is opposite of what core scheduling wants. I will re-run my tests with select_idle_core disabled, but the issue is on x86 Intel systems (my test rig) the CPU ids are interleaved across cores so even select_idle_cpu will balance across cores first. May be others have some better ideas? Thanks, -Aubrey
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
We tried to comment those lines and it doesn’t seem to get rid of the performance regression we are seeing. Can you elaborate a bit more about the test you are performing, what kind of resources it uses ? I am running 1 and 2 Oracle DB instances each running TPC-C workload. The clients driving the instances also run in same node. Each server client pair is put in each cpu group and tagged. Can you also try to replicate our test and see if you see the same problem ? cgcreate -g cpu,cpuset:set1 cat /sys/devices/system/cpu/cpu{0,2,4,6}/topology/thread_siblings_list 0,36 2,38 4,40 6,42 echo "0,2,4,6,36,38,40,42" | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.cpus echo 0 | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.mems echo 1 | sudo tee /sys/fs/cgroup/cpu,cpuacct/set1/cpu.tag sysbench --test=fileio prepare cgexec -g cpu,cpuset:set1 sysbench --threads=4 --test=fileio \ --file-test-mode=seqwr run The reason we create a cpuset is to narrow down the investigation to just 4 cores on a highly powerful machine. It might not be needed if testing on a smaller machine. With this sysbench test I am not seeing any improvement with removing the condition. Also with hackbench I found it makes no difference but that has much lower regression to begin with (18%) Julien
Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 5/14/19 9:03 AM, Steven Sistare wrote: On 5/13/2019 7:35 AM, Peter Zijlstra wrote: On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: On 10-05-19, 09:21, Peter Zijlstra wrote: I don't hate his per se; but the whole select_idle_sibling() thing is something that needs looking at. There was the task stealing thing from Steve that looked interesting and that would render your apporach unfeasible. I am surely missing something as I don't see how that patchset will make this patchset perform badly, than what it already does. Nah; I just misremembered. I know Oracle has a patch set poking at select_idle_siblings() _somewhere_ (as do I), and I just found the wrong one. Basically everybody is complaining select_idle_sibling() is too expensive for checking the entire LLC domain, except for FB (and thus likely some other workloads too) that depend on it to kill their tail latency. But I suppose we could still do this, even if we scan only a subset of the LLC, just keep track of the last !idle CPU running only SCHED_IDLE tasks and pick that if you do not (in your limited scan) find a better candidate. Subhra posted a patch that incrementally searches for an idle CPU in the LLC, remembering the last CPU examined, and searching a fixed number of CPUs from there. That technique is compatible with the one that Viresh suggests; the incremental search would stop if a SCHED_IDLE cpu was found. This was the last version of patchset I sent: https://lkml.org/lkml/2018/6/28/810 Also select_idle_core is a net -ve for certain workloads like OLTP. So I had put a SCHED_FEAT to be able to disable it. Thanks, Subhra I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs, updated with atomic operations. Performance was basically unchanged for the workloads I tested, and I inserted timers around the idle search showing it was a very small fraction of time both before and after my changes. That led me to ignore the push side and optimize the pull side with task stealing. I would be very interested in hearing from folks that have workloads that demonstrate that select_idle_sibling is too expensive. - Steve
Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 5/14/19 10:27 AM, Subhra Mazumdar wrote: On 5/14/19 9:03 AM, Steven Sistare wrote: On 5/13/2019 7:35 AM, Peter Zijlstra wrote: On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: On 10-05-19, 09:21, Peter Zijlstra wrote: I don't hate his per se; but the whole select_idle_sibling() thing is something that needs looking at. There was the task stealing thing from Steve that looked interesting and that would render your apporach unfeasible. I am surely missing something as I don't see how that patchset will make this patchset perform badly, than what it already does. Nah; I just misremembered. I know Oracle has a patch set poking at select_idle_siblings() _somewhere_ (as do I), and I just found the wrong one. Basically everybody is complaining select_idle_sibling() is too expensive for checking the entire LLC domain, except for FB (and thus likely some other workloads too) that depend on it to kill their tail latency. But I suppose we could still do this, even if we scan only a subset of the LLC, just keep track of the last !idle CPU running only SCHED_IDLE tasks and pick that if you do not (in your limited scan) find a better candidate. Subhra posted a patch that incrementally searches for an idle CPU in the LLC, remembering the last CPU examined, and searching a fixed number of CPUs from there. That technique is compatible with the one that Viresh suggests; the incremental search would stop if a SCHED_IDLE cpu was found. This was the last version of patchset I sent: https://lkml.org/lkml/2018/6/28/810 Also select_idle_core is a net -ve for certain workloads like OLTP. So I had put a SCHED_FEAT to be able to disable it. Forgot to add, the cpumask_weight computation may not be O(1) with large number of CPUs, so needs to be precomputed in a per-cpu variable to further optimize. That part is missing from the above patchset. Thanks, Subhra I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs, updated with atomic operations. Performance was basically unchanged for the workloads I tested, and I inserted timers around the idle search showing it was a very small fraction of time both before and after my changes. That led me to ignore the push side and optimize the pull side with task stealing. I would be very interested in hearing from folks that have workloads that demonstrate that select_idle_sibling is too expensive. - Steve
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/29/19 6:35 AM, Julien Desfossez wrote: On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar wrote: Is the core wide lock primarily responsible for the regression? I ran upto patch 12 which also has the core wide lock for tagged cgroups and also calls newidle_balance() from pick_next_task(). I don't see any regression. Of course the core sched version of pick_next_task() may be doing more but comparing with the __pick_next_task() it doesn't look too horrible. On further testing and investigation, we also agree that spinlock contention is not the major cause for the regression, but we feel that it should be one of the major contributing factors to this performance loss. I finally did some code bisection and found the following lines are basically responsible for the regression. Commenting them out I don't see the regressions. Can you confirm? I am yet to figure if this is needed for the correctness of core scheduling and if so can we do this better? >8- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fe3918c..3b3388a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) * If there weren't no cookies; we don't need * to bother with the other siblings. */ - if (i == cpu && !rq->core->core_cookie) - goto next_class; + //if (i == cpu && !rq->core->core_cookie) + //goto next_class; continue; }
Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
On 4/19/19 1:40 AM, Ingo Molnar wrote: * Subhra Mazumdar wrote: I see similar improvement with this patch as removing the condition I earlier mentioned. So that's not needed. I also included the patch for the priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users (from earlier emails). 1 DB instance users baseline %idle core_sched %idle 16 1 84 -4.9% 84 24 1 76 -6.7% 75 32 1 69 -2.4% 69 2 DB instance users baseline %idle core_sched %idle 16 1 66 -19.5% 69 24 1 54 -9.8% 57 32 1 42 -27.2% 48 So HT disabling slows down the 2DB instance by -22%, while core-sched slows it down by -27.2%? Would it be possible to see all the results in two larger tables (1 DB instance and 2 DB instance) so that we can compare the performance of the 3 kernel variants with each other: - "vanilla +HT": Hyperthreading enabled, vanilla scheduler - "vanilla -HT": Hyperthreading disabled, vanilla scheduler - "core_sched": Hyperthreading enabled, core-scheduling enabled ? Thanks, Ingo Following are the numbers. Disabling HT gives improvement in some cases. 1 DB instance users vanilla+HT core_sched vanilla-HT 16 1 -4.9% -11.7% 24 1 -6.7% +13.7% 32 1 -2.4% +8% 2 DB instance users vanilla+HT core_sched vanilla-HT 16 1 -19.5% +5.6% 24 1 -9.8% +3.5% 32 1 -27.2% -22.8%
Re: [RFC PATCH v2 00/17] Core scheduling v2
On 4/26/19 3:43 AM, Mel Gorman wrote: On Fri, Apr 26, 2019 at 10:42:22AM +0200, Ingo Molnar wrote: It should, but it's not perfect. For example, wake_affine_idle does not take sibling activity into account even though select_idle_sibling *may* take it into account. Even select_idle_sibling in its fast path may use an SMT sibling instead of searching. There are also potential side-effects with cpuidle. Some workloads migration around the socket as they are communicating because of how the search for an idle CPU works. With SMT on, there is potentially a longer opportunity for a core to reach a deep c-state and incur a bigger wakeup latency. This is a very weak theory but I've seen cases where latency sensitive workloads with only two communicating tasks are affected by CPUs reaching low c-states due to migrations. Clearly it doesn't. It's more that it's best effort to wakeup quickly instead of being perfect by using an expensive search every time. Yeah, but your numbers suggest that for *most* not heavily interacting under-utilized CPU bound workloads we hurt in the 5-10% range compared to no-SMT - more in some cases. Indeed, it was higher than expected and we can't even use the excuse that more resources are available to a single logical CPU as the scheduler is meant to keep them apart. So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10% harm on the workload, and also blow up stddev by randomly co-scheduling two tasks on the same physical core? Not a good trade-off. I really think we should implement a relatively strict physical core placement policy in the under-utilized case, and resist any attempts to weaken this for special workloads that ping-pong quickly and benefit from sharing the same physical core. It's worth a shot at least. Changes should mostly be in the wake_affine path for most loads of interest. Doesn't select_idle_sibling already try to do that by calling select_idle_core? For our OLTP workload we infact found the cost of select_idle_core was actually hurting more than it helped to find a fully idle core, so a net negative.
Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
On 7/4/19 6:04 PM, Parth Shah wrote: Same experiment with hackbench and with perf analysis shows increase in L1 cache miss rate with these patches (Lower is better) Baseline(%) Patch(%) --- - --- Total Cache miss rate 17.01 19(-11%) L1 icache miss rate5.45 6.7(-22%) So is is possible for idle_cpu search to try checking target_cpu first and then goto sliding window if not found. Below diff works as expected in IBM POWER9 system and resolves the problem of far wakeup upto large extent. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff2e9b5c3ac5..fae035ce1162 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 time, cost; s64 delta; int cpu, limit, floor, target_tmp, nr = INT_MAX; + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed); + for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) { + __cpumask_clear_cpu(cpu, cpus); + if (available_idle_cpu(cpu)) + goto idle_cpu_exit; + } + + for_each_cpu_wrap(cpu, cpus, target_tmp) { per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; - if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) - continue; if (available_idle_cpu(cpu)) break; } +idle_cpu_exit: time = local_clock() - time; cost = this_sd->avg_scan_cost; delta = (s64)(time - cost) / 8; Best, Parth How about calling select_idle_smt before select_idle_cpu from select_idle_sibling? That should have the same effect.
[RFC PATCH 3/3] sched: introduce tunables to control soft affinity
For different workloads the optimal "softness" of soft affinity can be different. Introduce tunables sched_allowed and sched_preferred that can be tuned via /proc. This allows to chose at what utilization difference the scheduler will chose cpus_allowed over cpus_preferred in the first level of search. Depending on the extent of data sharing, cache coherency overhead of the system etc. the optimal point may vary. Signed-off-by: subhra mazumdar --- include/linux/sched/sysctl.h | 2 ++ kernel/sched/fair.c | 19 ++- kernel/sched/sched.h | 2 ++ kernel/sysctl.c | 14 ++ 4 files changed, 36 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 99ce6d7..0e75602 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -41,6 +41,8 @@ extern unsigned int sysctl_numa_balancing_scan_size; #ifdef CONFIG_SCHED_DEBUG extern __read_mostly unsigned int sysctl_sched_migration_cost; extern __read_mostly unsigned int sysctl_sched_nr_migrate; +extern __read_mostly unsigned int sysctl_sched_preferred; +extern __read_mostly unsigned int sysctl_sched_allowed; int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 53aa7f2..d222d78 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -85,6 +85,8 @@ unsigned int sysctl_sched_wakeup_granularity = 100UL; static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL; const_debug unsigned int sysctl_sched_migration_cost = 50UL; +const_debug unsigned int sysctl_sched_preferred= 1UL; +const_debug unsigned int sysctl_sched_allowed = 100UL; #ifdef CONFIG_SMP /* @@ -6739,7 +6741,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int new_cpu = prev_cpu; int want_affine = 0; int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); - struct cpumask *cpus = &p->cpus_preferred; + int cpux, cpuy; + struct cpumask *cpus; + + if (!p->affinity_unequal) { + cpus = &p->cpus_allowed; + } else { + cpux = cpumask_any(&p->cpus_preferred); + cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred); + cpuy = cpumask_any(cpus); + if (sysctl_sched_preferred * cpu_rq(cpux)->cfs.avg.util_avg > + sysctl_sched_allowed * cpu_rq(cpuy)->cfs.avg.util_avg) + cpus = &p->cpus_allowed; + else + cpus = &p->cpus_preferred; + } if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..f856bdb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1863,6 +1863,8 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags); extern const_debug unsigned int sysctl_sched_nr_migrate; extern const_debug unsigned int sysctl_sched_migration_cost; +extern const_debug unsigned int sysctl_sched_preferred; +extern const_debug unsigned int sysctl_sched_allowed; #ifdef CONFIG_SCHED_HRTICK diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7d1008b..bdffb48 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -383,6 +383,20 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "sched_preferred", + .data = &sysctl_sched_preferred, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "sched_allowed", + .data = &sysctl_sched_allowed, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #ifdef CONFIG_SCHEDSTATS { .procname = "sched_schedstats", -- 2.9.3
[RFC PATCH 0/3] Scheduler Soft Affinity
ent optimal setting. (5:4) Hackbench %gain with soft affinity 2*4 1.43 2*8 1.36 2*161.01 2*321.45 1*4 -2.55 1*8 -5.06 1*16-8 1*32-7.32 DB %gain with soft affinity 2*160.46 2*243.68 2*32-3.34 1*160.08 1*241.6 1*32-1.29 Finally I measured the overhead of soft affinity when it is NOT used by comparing it with baseline kernel in case of no affinity and hard affinity with Hackbench. The following is the improvement of soft affinity kernel w.r.t baseline, but really numbers are in noise margin. This shows soft affinity has no overhead when not used. Hackbench %diff of no affinity%diff of hard affinity 2*4 0.110.31 2*8 0.130.55 2*160.610.90 2*320.861.01 1*4 0.480.43 1*8 0.450.33 1*160.610.64 1*320.110.63 A final set of experiments were done (numbers not shown) having the memory of each DB instance spread evenly across both NUMA nodes. This showed similar improvements with soft affinity for 2 instance case, thus proving the improvement is due to saving LLC coherence overhead. subhra mazumdar (3): sched: Introduce new interface for scheduler soft affinity sched: change scheduler to give preference to soft affinity CPUs sched: introduce tunables to control soft affinity arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/sched.h | 5 +- include/linux/sched/sysctl.h | 2 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/sched.h | 3 + init/init_task.c | 2 + kernel/compat.c| 2 +- kernel/rcu/tree_plugin.h | 3 +- kernel/sched/core.c| 167 - kernel/sched/fair.c| 154 ++ kernel/sched/sched.h | 2 + kernel/sysctl.c| 14 +++ 13 files changed, 297 insertions(+), 65 deletions(-) -- 2.9.3
[RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity
New system call sched_setaffinity2 is introduced for scheduler soft affinity. It takes an extra parameter to specify hard or soft affinity, where hard implies same as existing sched_setaffinity. New cpumask cpus_preferred is introduced for this purpose which is always a subset of cpus_allowed. A boolean affinity_unequal is used to store if they are unequal for fast lookup. Setting hard affinity resets soft affinity set to be equal to it. Soft affinity is only allowed for CFS class threads. Signed-off-by: subhra mazumdar --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/sched.h | 5 +- include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/sched.h | 3 + init/init_task.c | 2 + kernel/compat.c| 2 +- kernel/rcu/tree_plugin.h | 3 +- kernel/sched/core.c| 167 - 9 files changed, 162 insertions(+), 28 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index b4e6f9e..1dccdd2 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -355,6 +355,7 @@ 431common fsconfig__x64_sys_fsconfig 432common fsmount __x64_sys_fsmount 433common fspick __x64_sys_fspick +434common sched_setaffinity2 __x64_sys_sched_setaffinity2 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/sched.h b/include/linux/sched.h index 1183741..b863fa8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,8 @@ struct task_struct { unsigned intpolicy; int nr_cpus_allowed; cpumask_t cpus_allowed; + cpumask_t cpus_preferred; + boolaffinity_unequal; #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; @@ -1784,7 +1786,8 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) # define vcpu_is_preempted(cpu)false #endif -extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); +extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask, + int flags); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); #ifndef TASK_SIZE_OF diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe..147a4e5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -669,6 +669,9 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid, struct __kernel_timespec __user *interval); asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid, struct old_timespec32 __user *interval); +asmlinkage long sys_sched_setaffinity2(pid_t pid, unsigned int len, + unsigned long __user *user_mask_ptr, + int flags); /* kernel/signal.c */ asmlinkage long sys_restart_syscall(void); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index a87904d..d77b366 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig) __SYSCALL(__NR_fsmount, sys_fsmount) #define __NR_fspick 433 __SYSCALL(__NR_fspick, sys_fspick) +#define __NR_sched_setaffinity2 434 +__SYSCALL(__NR_sched_setaffinity2, sys_sched_setaffinity2) #undef __NR_syscalls -#define __NR_syscalls 434 +#define __NR_syscalls 435 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index ed4ee17..f910cd5 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -52,6 +52,9 @@ #define SCHED_FLAG_RECLAIM 0x02 #define SCHED_FLAG_DL_OVERRUN 0x04 +#define SCHED_HARD_AFFINITY0 +#define SCHED_SOFT_AFFINITY1 + #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \ SCHED_FLAG_RECLAIM | \ SCHED_FLAG_DL_OVERRUN) diff --git a/init/init_task.c b/init/init_task.c index c70ef65..aa226a3 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -73,6 +73,8 @@ struct task_struct init_task .normal_prio= MAX_PRIO - 20, .policy = SCHED_NORMAL, .cpus_allowed = CPU_MASK_ALL, + .cpus_preferred = CPU_MASK_ALL, + .affinity_unequal = false, .nr_cpus_allowed= NR_CPUS, .mm = NULL, .active_mm = &init_mm, diff --git a/kernel/compat.c b/kernel/compat.c index b5f7063..96621d7 100644 --- a/kernel/compat.c +++ b/ke
[RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
The soft affinity CPUs present in the cpumask cpus_preferred is used by the scheduler in two levels of search. First is in determining wake affine which choses the LLC domain and secondly while searching for idle CPUs in LLC domain. In the first level it uses cpus_preferred to prune out the search space. In the second level it first searches the cpus_preferred and then cpus_allowed. Using affinity_unequal flag it breaks early to avoid any overhead in the scheduler fast path when soft affinity is not used. This only changes the wake up path of the scheduler, the idle balancing is unchanged; together they achieve the "softness" of scheduling. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 137 ++-- 1 file changed, 100 insertions(+), 37 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..53aa7f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5807,7 +5807,7 @@ static unsigned long capacity_spare_without(int cpu, struct task_struct *p) */ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int sd_flag) + int this_cpu, int sd_flag, struct cpumask *cpus) { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; @@ -5831,7 +5831,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, /* Skip over this group if it has no CPUs allowed */ if (!cpumask_intersects(sched_group_span(group), - &p->cpus_allowed)) + cpus)) continue; local_group = cpumask_test_cpu(this_cpu, @@ -5949,7 +5949,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group. */ static int -find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, + int this_cpu, struct cpumask *cpus) { unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; @@ -5963,7 +5964,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this return cpumask_first(sched_group_span(group)); /* Traverse only the allowed CPUs */ - for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { + for_each_cpu_and(i, sched_group_span(group), cpus) { if (available_idle_cpu(i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); @@ -5999,7 +6000,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this } static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, - int cpu, int prev_cpu, int sd_flag) + int cpu, int prev_cpu, int sd_flag, + struct cpumask *cpus) { int new_cpu = cpu; @@ -6023,13 +6025,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p continue; } - group = find_idlest_group(sd, p, cpu, sd_flag); + group = find_idlest_group(sd, p, cpu, sd_flag, cpus); + if (!group) { sd = sd->child; continue; } - new_cpu = find_idlest_group_cpu(group, p, cpu); + new_cpu = find_idlest_group_cpu(group, p, cpu, cpus); if (new_cpu == cpu) { /* Now try balancing at a lower domain level of 'cpu': */ sd = sd->child; @@ -6104,6 +6107,27 @@ void __update_idle_core(struct rq *rq) rcu_read_unlock(); } +static inline int +scan_cpu_mask_for_idle_cores(struct cpumask *cpus, int target) +{ + int core, cpu; + + for_each_cpu_wrap(core, cpus, target) { + bool idle = true; + + for_each_cpu(cpu, cpu_smt_mask(core)) { + cpumask_clear_cpu(cpu, cpus); + if (!idle_cpu(cpu)) + idle = false; + } + + if (idle) + return core; + } + + return -1; +} + /* * Scan the entire LLC domain for idle cores; this dynamically switches off if * there are no idle cores left in the system; tracked through @@ -6112,7 +6136,7 @@ void __update_idle_core(struct rq *rq) static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mas
[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..80657fc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..4cecfa2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) -- 2.9.3
[PATCH v3 4/7] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
[PATCH v3 5/7] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1ca88e..6a74808 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. This patch does it only for x86. Signed-off-by: subhra mazumdar --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 4 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..20bf676 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,10 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + threads = cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = threads; + per_cpu(cpumask_weight_sibling, i) = threads; + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, &phys_cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_we
[PATCH v3 3/7] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b58f08f..c1ca88e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) -- 2.9.3
[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3
[RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
rease by having SIS_CORE false. Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67104.96 206 (59.82%)8.86 99.52296 82.553121.67 (-35.96%) 97.37 99.912517.33 2.38 12592 (-0.6%) 1.67 Changes from v2->v3: -Use shift operator instead of multiplication to compute limit -Use per-CPU variable to precompute the number of sibling SMTs for x86 subhra mazumdar (7): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search x86/smpboot: introduce per-cpu variable for HT siblings sched: use per-cpu variable cpumask_weight_sibling arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++ kernel/sched/features.h | 1 + kernel/sched/sched.h| 1 + 8 files changed, 49 insertions(+), 9 deletions(-) -- 2.9.3
[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu. This is the fast path of scheduler and every cycle is worth saving. Usage of cpumask_weight can result in iterations. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a74808..878f11c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; limit = floor << 1; -- 2.9.3
Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 7/2/19 1:35 AM, Peter Zijlstra wrote: On Mon, Jul 01, 2019 at 03:08:41PM -0700, Subhra Mazumdar wrote: On 7/1/19 1:03 AM, Viresh Kumar wrote: On 28-06-19, 18:16, Subhra Mazumdar wrote: On 6/25/19 10:06 PM, Viresh Kumar wrote: @@ -5376,6 +5376,15 @@ static struct { #endif /* CONFIG_NO_HZ_COMMON */ +/* CPU only has SCHED_IDLE tasks enqueued */ +static int sched_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running && + rq->nr_running); +} + Shouldn't this check if rq->curr is also sched idle? Why wouldn't the current set of checks be enough to guarantee that ? I thought nr_running does not include the on-cpu thread. It very much does. And why not drop the rq->nr_running non zero check? Because CPU isn't sched-idle if nr_running and idle_h_nr_running are both 0, i.e. it is an IDLE cpu in that case. And so I thought it is important to have this check as well. idle_cpu() not only checks nr_running is 0 but also rq->curr == rq->idle idle_cpu() will try very hard to declare a CPU !idle. But I don't see how that it relevant. sched_idle_cpu() will only return true if there are only SCHED_IDLE tasks on the CPU. Viresh's test is simple and straight forward. OK makes sense. Thanks, Subhra
Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
On 7/2/19 1:54 AM, Patrick Bellasi wrote: Wondering if searching and preempting needs will ever be conflicting? I guess the winning point is that we don't commit behaviors to userspace, but just abstract concepts which are turned into biases. I don't see conflicts right now: if you are latency tolerant that means you can spend more time to try finding a better CPU (e.g. we can use the energy model to compare multiple CPUs) _and/or_ give the current task a better chance to complete by delaying its preemption. OK Otherwise sounds like a good direction to me. For the searching aspect, can we map latency nice values to the % of cores we search in select_idle_cpu? Thus the search cost can be controlled by latency nice value. I guess that's worth a try, only caveat I see is that it's turning the bias into something very platform specific. Meaning, the same latency-nice value on different machines can have very different results. Would not be better to try finding a more platform independent mapping? Maybe something time bounded, e.g. the higher the latency-nice the more time we can spend looking for CPUs? The issue I see is suppose we have a range of latency-nice values, then it should cover the entire range of search (one core to all cores). As Peter said some workloads will want to search the LLC fully. If we have absolute time, the map of latency-nice values range to them will be arbitrary. If you have something in mind let me know, may be I am thinking differently. But the issue is if more latency tolerant workloads set to less search, we still need some mechanism to achieve good spread of threads. I don't get this example: why more latency tolerant workloads should require less search? I guess I got the definition of "latency tolerant" backwards. Can we keep the sliding window mechanism in that case? Which one? Sorry did not went through the patches, can you briefly resume the idea? If a workload has set it to low latency tolerant, then the search will be less. That can lead to localization of threads on a few CPUs as we are not searching the entire LLC even if there are idle CPUs available. For this I had introduced a per-CPU variable (for the target CPU) to track the boundary of search so that every time it will start from the boundary, thus sliding the window. So even if we are searching very little the search window keeps shifting and gives us a good spread. This is orthogonal to the latency-nice thing. Also will latency nice do anything for select_idle_core and select_idle_smt? I guess principle the same bias can be used at different levels, maybe with different mappings. Doing it for select_idle_core will have the issue that the dynamic flag (whether an idle core is present or not) can only be updated by threads which are doing the full search. Thanks, Subhra In the mobile world use-case we will likely use it only to switch from select_idle_sibling to the energy aware slow path. And perhaps to see if we can bias the wakeup preemption granularity. Best, Patrick
[PATCH v3 5/7] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1ca88e..6a74808 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3
[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu. This is the fast path of scheduler and every cycle is worth saving. Usage of cpumask_weight can result in iterations. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a74808..878f11c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; limit = floor << 1; -- 2.9.3
[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..80657fc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..4cecfa2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) -- 2.9.3
[PATCH v3 3/7] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b58f08f..c1ca88e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) -- 2.9.3
[PATCH v3 0/7] Improve scheduler scalability for fast path
942.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67104.96 206 (59.82%)8.86 99.52296 82.553121.67 (-35.96%) 97.37 99.912517.33 2.38 12592 (-0.6%) 1.67 Changes from v2->v3: -Use shift operator instead of multiplication to compute limit -Use per-CPU variable to precompute the number of sibling SMTs for x86 subhra mazumdar (7): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search x86/smpboot: introduce per-cpu variable for HT siblings sched: use per-cpu variable cpumask_weight_sibling arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++ kernel/sched/features.h | 1 + kernel/sched/sched.h| 1 + 8 files changed, 49 insertions(+), 9 deletions(-) -- 2.9.3
[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. This patch does it only for x86. Signed-off-by: subhra mazumdar --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 4 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..20bf676 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,10 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + threads = cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = threads; + per_cpu(cpumask_weight_sibling, i) = threads; + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, &phys_cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_we
[PATCH v3 4/7] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
On 6/26/19 11:51 PM, Thomas Gleixner wrote: On Wed, 26 Jun 2019, subhra mazumdar wrote: Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. Why? The number of siblings is constant at least today unless you play silly cpu hotplug games. A bit more justification for adding yet another random storage would be appreciated. Using cpumask_weight every time in select_idle_cpu to compute the no. of SMT siblings can be costly as cpumask_weight may not be O(1) for systems with large no. of CPUs (e.g 8 socket, each socket having lots of cores). Over 512 CPUs the bitmask will span multiple cache lines and touching multiple cache lines in the fast path of scheduler can cost more than we save from this optimization. Even in single cache line it loops in longs. We want to touch O(1) cache lines and do O(1) operations, hence pre-compute it in per-CPU variable. This patch does it only for x86. # grep 'This patch' Documentation/process/submitting-patches.rst IOW, we all know already that this is a patch and from the subject prefix and the diffstat it's pretty obvious that this is x86 only. So instead of documenting the obvious, please add proper context to justify the change. Ok. The extra per-CPU optimization was done only for x86 as we cared about it the most and make it future proof. I will add for other architectures. +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling); Why does this need an export? No module has any reason to access this. I will remove it /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + threads = cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = threads; + per_cpu(cpumask_weight_sibling, i) = threads; This only works for SMT=2, but fails to update the rest for SMT=4. I guess I assumed that x86 will always be SMT2, will fix this. Thanks, Subhra @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); While remove does the right thing. Thanks, tglx
Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
On 6/26/19 11:54 PM, Thomas Gleixner wrote: On Thu, 27 Jun 2019, Thomas Gleixner wrote: On Wed, 26 Jun 2019, subhra mazumdar wrote: Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. Why? The number of siblings is constant at least today unless you play silly cpu hotplug games. A bit more justification for adding yet another random storage would be appreciated. This patch does it only for x86. # grep 'This patch' Documentation/process/submitting-patches.rst IOW, we all know already that this is a patch and from the subject prefix and the diffstat it's pretty obvious that this is x86 only. So instead of documenting the obvious, please add proper context to justify the change. Aside of that the right ordering is to introduce the default fallback in a separate patch, which explains the reasoning and then in the next one add the x86 optimized version. OK. I will also add the extra optimization for other architectures. Thanks, Subhra Thanks, tglx
Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
On 6/28/19 11:36 AM, Parth Shah wrote: Hi Subhra, I ran your patch series on IBM POWER systems and this is what I have observed. On 6/27/19 6:59 AM, subhra mazumdar wrote: Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b58f08f..c1ca88e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; This leads to a problem of cache hotness. AFAIU, in most cases, `target = prev_cpu` of the task being woken up and selecting an idle CPU nearest to the prev_cpu is favorable. But since this doesn't keep track of last idle cpu per task, it fails to find the nearest possible idle CPU in cases when the task is being woken up after other scheduled task. I had tested hackbench on SPARC SMT8 (see numbers in cover letter) and showed improvement with this. Firstly it's a tradeoff between cache effects vs time spent in searching idle CPU, and both x86 and SPARC numbers showed tradeoff is worth it. Secondly there is a lot of cache affinity logic in the beginning of select_idle_sibling. If select_idle_cpu is still called that means we are past that and want any idle cpu. I don't think waking up close to the prev cpu is the intention for starting search from there, rather it is to spread threads across all cpus so that no cpu gets victimized as there is no atomicity. Prev cpu just acts a good seed to do the spreading. Thanks, Subhra
Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
On 6/28/19 11:47 AM, Parth Shah wrote: On 6/27/19 6:59 AM, subhra mazumdar wrote: Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; Is upper limit an experimental value only or it has any arch specific significance? Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores due to its different cache model. So it can be tuned for arch specific builds then. The lower bound and upper bound were 1 core and 2 core respectively. That is done as to search beyond one core and at the same time not to search too much. It is heuristic that seemed to work well on all archs coupled with the moving window mechanism. Does 4 vs 2 make any difference on your POWER9? AFAIR it didn't on SPARC SMT8. Also variable names can be changed for better readability. floor -> weight_clamp_min limit -> weight_clamp_max or something similar OK. Thanks, Subhra + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); Best, Parth
Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
On 6/28/19 12:01 PM, Parth Shah wrote: On 6/27/19 6:59 AM, subhra mazumdar wrote: Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1ca88e..6a74808 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } This can have significant performance loss if disabled. The select_idle_core spreads workloads quickly across the cores, hence disabling this leaves much of the work to be offloaded to load balancer to move task across the cores. Latency sensitive and long running multi-threaded workload should see the regression under this conditions. Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why it is a feature that is ON by default, but can be turned OFF for specific workloads on x86 SMT2 that can benefit from it. Also, systems like POWER9 has sd_llc as a pair of core only. So it won't benefit from the limits and hence also hiding your code in select_idle_cpu behind static keys will be much preferred. If it doesn't hurt then I don't see the point. Thanks, Subhra
Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
On 6/28/19 4:54 AM, Srikar Dronamraju wrote: * subhra mazumdar [2019-06-26 18:29:15]: Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. While rotating the cpu search window is good, not sure if this can find a idle cpu quickly. The probability of finding an idle cpu still should remain the same. No? Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; Shouldn't this assignment be outside the for loop. With the current code, 1. We keep reassigning multiple times. 2. The last assignment happes for idle_cpu and sometimes the assignment is for non-idle cpu. We want the last assignment irrespective of it was an idle cpu or not since in both cases we want to track the boundary of search. Thanks, Subhra
Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 6/25/19 10:06 PM, Viresh Kumar wrote: We try to find an idle CPU to run the next task, but in case we don't find an idle CPU it is better to pick a CPU which will run the task the soonest, for performance reason. A CPU which isn't idle but has only SCHED_IDLE activity queued on it should be a good target based on this criteria as any normal fair task will most likely preempt the currently running SCHED_IDLE task immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one shall give better results as it should be able to run the task sooner than an idle CPU (which requires to be woken up from an idle state). This patch updates both fast and slow paths with this optimization. Signed-off-by: Viresh Kumar --- kernel/sched/fair.c | 43 +-- 1 file changed, 33 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1277adc3e7ed..2e0527fd468c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5376,6 +5376,15 @@ static struct { #endif /* CONFIG_NO_HZ_COMMON */ +/* CPU only has SCHED_IDLE tasks enqueued */ +static int sched_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running && + rq->nr_running); +} + Shouldn't this check if rq->curr is also sched idle? And why not drop the rq->nr_running non zero check?
Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
Also, systems like POWER9 has sd_llc as a pair of core only. So it won't benefit from the limits and hence also hiding your code in select_idle_cpu behind static keys will be much preferred. If it doesn't hurt then I don't see the point. So these is the result from POWER9 system with your patches: System configuration: 2 Socket, 44 cores, 176 CPUs Experiment setup: === => Setup 1: - 44 tasks doing just while(1), this is to make select_idle_core return -1 most times - perf bench sched messaging -g 1 -l 100 +---++--++ | Baseline | stddev |Patch | stddev | +---++--++ | 135 | 3.21 | 158(-17.03%) | 4.69 | +---++--++ => Setup 2: - schbench -m44 -t 1 +===+==+=+=+==+ | %ile | Baseline | stddev | patch | stddev | +===+==+=+=+==+ |50 | 10 |3.49 | 10 | 2.29 | +---+--+-+-+--+ |95 | 467 |4.47 | 469 | 0.81 | +---+--+-+-+--+ |99 | 571 | 21.32 | 584 |18.69 | +---+--+-+-+--+ | 99.5 | 629 | 30.05 | 641 |20.95 | +---+--+-+-+--+ | 99.9 | 780 | 40.38 | 773 | 44.2 | +---+--+-+-+--+ I guess it doesn't make much difference in schbench results but hackbench (perf bench) seems to have an observable regression. Best, Parth If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with the select_idle_cpu changes as the limits are 1 and 2 core. Previously the lower bound was 4 cpus and upper bound calculated by the prop. Now it is 1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra computation of cpumask_weight causing the regression rather than the sliding window itself (one way to check this would be hardcode 4 in place of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per core showed improvement with Hackbench. Thanks, Subhra