Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On 2020/9/26 0:45, Vincent Guittot wrote: > Le vendredi 25 sept. 2020 à 17:21:46 (+0800), Li, Aubrey a écrit : >> Hi Vicent, >> >> On 2020/9/24 21:09, Vincent Guittot wrote: >> >> Would you mind share uperf(netperf load) result on your side? That's the >> workload I have seen the most benefit this patch contributed under heavy >> load level. > > with uperf, i've got the same kind of result as sched pipe > tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%) > with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23% > regression as for sched pipe > In case this is caused by the logic error in this patch(sorry again), did you see any improvement in patch V2? Though it does not helps for nohz=off case, just want to know if it helps or does not help at all on arm platform. >>> >>> With the v2 which rate limit the update of the cpumask (but doesn't >>> support sched_idle stask), I don't see any performance impact: >> >> I agree we should go the way with cpumask update rate limited. >> >> And I think no performance impact for sched-pipe is expected, as this >> workload >> has only 2 threads and the platform has 8 cores, so mostly previous cpu is >> returned, and even if select_idle_sibling is called, select_idle_core is hit >> and rarely call select_idle_cpu. > > my platform is not smt so select_idle_core is nop. Nevertheless > select_idle_cpu > is almost never called because prev is idle and selected before calling it in > our case > >> >> But I'm more curious why there is 23% performance penalty? So for this >> patch, if >> you revert this change but keep cpumask updated, is 23% penalty still there? >> >> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); >> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); > > I was about to say that reverting this line should not change anything because > we never reach this point but it does in fact. And after looking at a trace, > I can see that the 2 threads of perf bench sched pipe are on the same CPU and > that the sds_idle_cpus(sd->shared) is always empty. In fact, the rq->curr is > not yet idle and still point to the cfs task when you call > update_idle_cpumask(). > This means that once cleared, the bit will never be set > You can remove the test in update_idle_cpumask() which is called either when > entering idle or when there is only sched_idle tasks that are runnable. > > @@ -6044,8 +6044,7 @@ void update_idle_cpumask(struct rq *rq) > sd = rcu_dereference(per_cpu(sd_llc, cpu)); > if (!sd || !sd->shared) > goto unlock; > - if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) > - goto unlock; > + > cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); > unlock: > rcu_read_unlock(); > > With this fix, the performance decrease is only 2% > >> >> I just wonder if it's caused by the atomic ops as you have two cache domains >> with >> sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me >> to find >> an ARM machine but I'll try. >> >> Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? >> no any >> performance impact? > > with v2 : Throughput 24.97Mb/s (+/- 0.07%) so there is no perf regression > Thanks Vincent, let me try to refine this patch. -Aubrey
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
Le vendredi 25 sept. 2020 à 17:21:46 (+0800), Li, Aubrey a écrit : > Hi Vicent, > > On 2020/9/24 21:09, Vincent Guittot wrote: > > Would you mind share uperf(netperf load) result on your side? That's the > workload I have seen the most benefit this patch contributed under heavy > load level. > >>> > >>> with uperf, i've got the same kind of result as sched pipe > >>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%) > >>> with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23% > >>> regression as for sched pipe > >>> > >> In case this is caused by the logic error in this patch(sorry again), did > >> you see any improvement in patch V2? Though it does not helps for nohz=off > >> case, just want to know if it helps or does not help at all on arm > >> platform. > > > > With the v2 which rate limit the update of the cpumask (but doesn't > > support sched_idle stask), I don't see any performance impact: > > I agree we should go the way with cpumask update rate limited. > > And I think no performance impact for sched-pipe is expected, as this workload > has only 2 threads and the platform has 8 cores, so mostly previous cpu is > returned, and even if select_idle_sibling is called, select_idle_core is hit > and rarely call select_idle_cpu. my platform is not smt so select_idle_core is nop. Nevertheless select_idle_cpu is almost never called because prev is idle and selected before calling it in our case > > But I'm more curious why there is 23% performance penalty? So for this patch, > if > you revert this change but keep cpumask updated, is 23% penalty still there? > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); I was about to say that reverting this line should not change anything because we never reach this point but it does in fact. And after looking at a trace, I can see that the 2 threads of perf bench sched pipe are on the same CPU and that the sds_idle_cpus(sd->shared) is always empty. In fact, the rq->curr is not yet idle and still point to the cfs task when you call update_idle_cpumask(). This means that once cleared, the bit will never be set You can remove the test in update_idle_cpumask() which is called either when entering idle or when there is only sched_idle tasks that are runnable. @@ -6044,8 +6044,7 @@ void update_idle_cpumask(struct rq *rq) sd = rcu_dereference(per_cpu(sd_llc, cpu)); if (!sd || !sd->shared) goto unlock; - if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) - goto unlock; + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); unlock: rcu_read_unlock(); With this fix, the performance decrease is only 2% > > I just wonder if it's caused by the atomic ops as you have two cache domains > with > sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me > to find > an ARM machine but I'll try. > > Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? > no any > performance impact? with v2 : Throughput 24.97Mb/s (+/- 0.07%) so there is no perf regression > > > Thanks, > -Aubrey
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
Hi Vicent, On 2020/9/24 21:09, Vincent Guittot wrote: Would you mind share uperf(netperf load) result on your side? That's the workload I have seen the most benefit this patch contributed under heavy load level. >>> >>> with uperf, i've got the same kind of result as sched pipe >>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%) >>> with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23% >>> regression as for sched pipe >>> >> In case this is caused by the logic error in this patch(sorry again), did >> you see any improvement in patch V2? Though it does not helps for nohz=off >> case, just want to know if it helps or does not help at all on arm platform. > > With the v2 which rate limit the update of the cpumask (but doesn't > support sched_idle stask), I don't see any performance impact: I agree we should go the way with cpumask update rate limited. And I think no performance impact for sched-pipe is expected, as this workload has only 2 threads and the platform has 8 cores, so mostly previous cpu is returned, and even if select_idle_sibling is called, select_idle_core is hit and rarely call select_idle_cpu. But I'm more curious why there is 23% performance penalty? So for this patch, if you revert this change but keep cpumask updated, is 23% penalty still there? - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); I just wonder if it's caused by the atomic ops as you have two cache domains with sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me to find an ARM machine but I'll try. Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? no any performance impact? Thanks, -Aubrey
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
Hi Tim On Thu, 24 Sep 2020 at 18:37, Tim Chen wrote: > > > > On 9/22/20 12:14 AM, Vincent Guittot wrote: > > >> > > And a quick test with hackbench on my octo cores arm64 gives for 12 > > Vincent, > > Is it octo (=10) or octa (=8) cores on a single socket for your system? it's a 8 cores and the cores are splitted in 2 cache domains > The L2 is per core or there are multiple L2s shared among groups of cores? > > Wonder if placing the threads within a L2 or not within > an L2 could cause differences seen with Aubrey's test. I haven't checked recently but the 2 tasks involved in sched pipe run on CPUs which belong to the same cache domain Vincent > > Tim >
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Thu, Sep 24, 2020 at 10:43:12AM -0700 Tim Chen wrote: > > > On 9/24/20 10:13 AM, Phil Auld wrote: > > On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote: > >> > >> > >> On 9/22/20 12:14 AM, Vincent Guittot wrote: > >> > > >> > >> And a quick test with hackbench on my octo cores arm64 gives for 12 > >> > >> Vincent, > >> > >> Is it octo (=10) or octa (=8) cores on a single socket for your system? > > > > In what Romance language does octo mean 10? :) > > > > Got confused by october, the tenth month. :) It used to be the eigth month ;) > > Tim > > > > >> The L2 is per core or there are multiple L2s shared among groups of cores? > >> > >> Wonder if placing the threads within a L2 or not within > >> an L2 could cause differences seen with Aubrey's test. > >> > >> Tim > >> > > > --
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On 9/24/20 10:13 AM, Phil Auld wrote: > On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote: >> >> >> On 9/22/20 12:14 AM, Vincent Guittot wrote: >> >> >> And a quick test with hackbench on my octo cores arm64 gives for 12 >> >> Vincent, >> >> Is it octo (=10) or octa (=8) cores on a single socket for your system? > > In what Romance language does octo mean 10? :) > Got confused by october, the tenth month. :) Tim > >> The L2 is per core or there are multiple L2s shared among groups of cores? >> >> Wonder if placing the threads within a L2 or not within >> an L2 could cause differences seen with Aubrey's test. >> >> Tim >> >
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote: > > > On 9/22/20 12:14 AM, Vincent Guittot wrote: > > >> > > And a quick test with hackbench on my octo cores arm64 gives for 12 > > Vincent, > > Is it octo (=10) or octa (=8) cores on a single socket for your system? In what Romance language does octo mean 10? :) > The L2 is per core or there are multiple L2s shared among groups of cores? > > Wonder if placing the threads within a L2 or not within > an L2 could cause differences seen with Aubrey's test. > > Tim > --
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On 9/22/20 12:14 AM, Vincent Guittot wrote: >> And a quick test with hackbench on my octo cores arm64 gives for 12 Vincent, Is it octo (=10) or octa (=8) cores on a single socket for your system? The L2 is per core or there are multiple L2s shared among groups of cores? Wonder if placing the threads within a L2 or not within an L2 could cause differences seen with Aubrey's test. Tim
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Thu, 24 Sep 2020 at 05:04, Li, Aubrey wrote: > > On 2020/9/23 16:50, Vincent Guittot wrote: > > On Wed, 23 Sep 2020 at 04:59, Li, Aubrey wrote: > >> > >> Hi Vincent, > >> > >> On 2020/9/22 15:14, Vincent Guittot wrote: > >>> On Tue, 22 Sep 2020 at 05:33, Li, Aubrey > >>> wrote: > > On 2020/9/21 23:21, Vincent Guittot wrote: > > On Mon, 21 Sep 2020 at 17:14, Vincent Guittot > > wrote: > >> > >> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey > >> wrote: > >>> > >>> On 2020/9/16 19:00, Mel Gorman wrote: > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > > Added idle cpumask to track idle cpus in sched domain. When a CPU > > enters idle, its corresponding bit in the idle cpumask will be set, > > and when the CPU exits idle, its bit will be cleared. > > > > When a task wakes up to select an idle cpu, scanning idle cpumask > > has low cost than scanning all the cpus in last level cache domain, > > especially when the system is heavily loaded. > > > > The following benchmarks were tested on a x86 4 socket system with > > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > > > > This still appears to be tied to turning the tick off. An idle CPU > available for computation does not necessarily have the tick turned > off > if it's for short periods of time. When nohz is disabled or a > machine is > active enough that CPUs are not disabling the tick, select_idle_cpu > may > fail to select an idle CPU and instead stack tasks on the old CPU. > > The other subtlety is that select_idle_sibling() currently allows a > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > idle as such, it's simply running a low priority task that is > suitable > for preemption. I suspect this patch breaks that. > > >>> Thanks! > >>> > >>> I shall post a v3 with performance data, I made a quick uperf testing > >>> and > >>> found the benefit is still there. So I posted the patch here and > >>> looking > >>> forward to your comments before I start the benchmarks. > >>> > >>> Thanks, > >>> -Aubrey > >>> > >>> --- > >>> diff --git a/include/linux/sched/topology.h > >>> b/include/linux/sched/topology.h > >>> index fb11091129b3..43a641d26154 100644 > >>> --- a/include/linux/sched/topology.h > >>> +++ b/include/linux/sched/topology.h > >>> @@ -65,8 +65,21 @@ struct sched_domain_shared { > >>> atomic_tref; > >>> atomic_tnr_busy_cpus; > >>> int has_idle_cores; > >>> + /* > >>> +* Span of all idle CPUs in this domain. > >>> +* > >>> +* NOTE: this field is variable length. (Allocated dynamically > >>> +* by attaching extra space to the end of the structure, > >>> +* depending on how many CPUs the kernel has booted up with) > >>> +*/ > >>> + unsigned long idle_cpus_span[]; > >>> }; > >>> > >>> +static inline struct cpumask *sds_idle_cpus(struct > >>> sched_domain_shared *sds) > >>> +{ > >>> + return to_cpumask(sds->idle_cpus_span); > >>> +} > >>> + > >>> struct sched_domain { > >>> /* These fields must be setup */ > >>> struct sched_domain __rcu *parent; /* top domain must be > >>> null terminated */ > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>> index 6b3b59cc51d6..9a3c82645472 100644 > >>> --- a/kernel/sched/fair.c > >>> +++ b/kernel/sched/fair.c > >>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq) > >>> rcu_read_unlock(); > >>> } > >>> > >>> +/* > >>> + * Update cpu idle state and record this information > >>> + * in sd_llc_shared->idle_cpus_span. > >>> + */ > >>> +void update_idle_cpumask(struct rq *rq) > >>> +{ > >>> + struct sched_domain *sd; > >>> + int cpu = cpu_of(rq); > >>> + > >>> + rcu_read_lock(); > >>> + sd = rcu_dereference(per_cpu(sd_llc, cpu)); > >>> + if (!sd || !sd->shared) > >>> + goto unlock; > >>> + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) > >>> + goto unlock; > > Oops, I realized I didn't send an update out to fix this while I fixed > it locally. it should be > > if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) the fix doesn't change the perf results > > Sorry for this, Vincent, :( > > >>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); > >>> +unlock: > >>> + rcu_read_unlock(); > >>
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Wed, 23 Sep 2020 at 04:59, Li, Aubrey wrote: > > Hi Vincent, > > On 2020/9/22 15:14, Vincent Guittot wrote: > > On Tue, 22 Sep 2020 at 05:33, Li, Aubrey wrote: > >> > >> On 2020/9/21 23:21, Vincent Guittot wrote: > >>> On Mon, 21 Sep 2020 at 17:14, Vincent Guittot > >>> wrote: > > On Thu, 17 Sep 2020 at 11:21, Li, Aubrey > wrote: > > > > On 2020/9/16 19:00, Mel Gorman wrote: > >> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > >>> Added idle cpumask to track idle cpus in sched domain. When a CPU > >>> enters idle, its corresponding bit in the idle cpumask will be set, > >>> and when the CPU exits idle, its bit will be cleared. > >>> > >>> When a task wakes up to select an idle cpu, scanning idle cpumask > >>> has low cost than scanning all the cpus in last level cache domain, > >>> especially when the system is heavily loaded. > >>> > >>> The following benchmarks were tested on a x86 4 socket system with > >>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > >>> > >> > >> This still appears to be tied to turning the tick off. An idle CPU > >> available for computation does not necessarily have the tick turned off > >> if it's for short periods of time. When nohz is disabled or a machine > >> is > >> active enough that CPUs are not disabling the tick, select_idle_cpu may > >> fail to select an idle CPU and instead stack tasks on the old CPU. > >> > >> The other subtlety is that select_idle_sibling() currently allows a > >> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > >> idle as such, it's simply running a low priority task that is suitable > >> for preemption. I suspect this patch breaks that. > >> > > Thanks! > > > > I shall post a v3 with performance data, I made a quick uperf testing > > and > > found the benefit is still there. So I posted the patch here and looking > > forward to your comments before I start the benchmarks. > > > > Thanks, > > -Aubrey > > > > --- > > diff --git a/include/linux/sched/topology.h > > b/include/linux/sched/topology.h > > index fb11091129b3..43a641d26154 100644 > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -65,8 +65,21 @@ struct sched_domain_shared { > > atomic_tref; > > atomic_tnr_busy_cpus; > > int has_idle_cores; > > + /* > > +* Span of all idle CPUs in this domain. > > +* > > +* NOTE: this field is variable length. (Allocated dynamically > > +* by attaching extra space to the end of the structure, > > +* depending on how many CPUs the kernel has booted up with) > > +*/ > > + unsigned long idle_cpus_span[]; > > }; > > > > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared > > *sds) > > +{ > > + return to_cpumask(sds->idle_cpus_span); > > +} > > + > > struct sched_domain { > > /* These fields must be setup */ > > struct sched_domain __rcu *parent; /* top domain must be > > null terminated */ > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 6b3b59cc51d6..9a3c82645472 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq) > > rcu_read_unlock(); > > } > > > > +/* > > + * Update cpu idle state and record this information > > + * in sd_llc_shared->idle_cpus_span. > > + */ > > +void update_idle_cpumask(struct rq *rq) > > +{ > > + struct sched_domain *sd; > > + int cpu = cpu_of(rq); > > + > > + rcu_read_lock(); > > + sd = rcu_dereference(per_cpu(sd_llc, cpu)); > > + if (!sd || !sd->shared) > > + goto unlock; > > + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) > > + goto unlock; > > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); > > +unlock: > > + rcu_read_unlock(); > > +} > > + > > /* > > * Scan the entire LLC domain for idle cores; this dynamically > > switches off if > > * there are no idle cores left in the system; tracked through > > @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct > > *p, struct sched_domain *sd, int t > > > > time = cpu_clock(this); > > > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > + /* > > +* sched_domain_shared is set only at shared cache level, > > +* this works only because select_idle_cpu is called with
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Tue, 22 Sep 2020 at 05:33, Li, Aubrey wrote: > > On 2020/9/21 23:21, Vincent Guittot wrote: > > On Mon, 21 Sep 2020 at 17:14, Vincent Guittot > > wrote: > >> > >> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey wrote: > >>> > >>> On 2020/9/16 19:00, Mel Gorman wrote: > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > > Added idle cpumask to track idle cpus in sched domain. When a CPU > > enters idle, its corresponding bit in the idle cpumask will be set, > > and when the CPU exits idle, its bit will be cleared. > > > > When a task wakes up to select an idle cpu, scanning idle cpumask > > has low cost than scanning all the cpus in last level cache domain, > > especially when the system is heavily loaded. > > > > The following benchmarks were tested on a x86 4 socket system with > > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > > > > This still appears to be tied to turning the tick off. An idle CPU > available for computation does not necessarily have the tick turned off > if it's for short periods of time. When nohz is disabled or a machine is > active enough that CPUs are not disabling the tick, select_idle_cpu may > fail to select an idle CPU and instead stack tasks on the old CPU. > > The other subtlety is that select_idle_sibling() currently allows a > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > idle as such, it's simply running a low priority task that is suitable > for preemption. I suspect this patch breaks that. > > >>> Thanks! > >>> > >>> I shall post a v3 with performance data, I made a quick uperf testing and > >>> found the benefit is still there. So I posted the patch here and looking > >>> forward to your comments before I start the benchmarks. > >>> > >>> Thanks, > >>> -Aubrey > >>> > >>> --- > >>> diff --git a/include/linux/sched/topology.h > >>> b/include/linux/sched/topology.h > >>> index fb11091129b3..43a641d26154 100644 > >>> --- a/include/linux/sched/topology.h > >>> +++ b/include/linux/sched/topology.h > >>> @@ -65,8 +65,21 @@ struct sched_domain_shared { > >>> atomic_tref; > >>> atomic_tnr_busy_cpus; > >>> int has_idle_cores; > >>> + /* > >>> +* Span of all idle CPUs in this domain. > >>> +* > >>> +* NOTE: this field is variable length. (Allocated dynamically > >>> +* by attaching extra space to the end of the structure, > >>> +* depending on how many CPUs the kernel has booted up with) > >>> +*/ > >>> + unsigned long idle_cpus_span[]; > >>> }; > >>> > >>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared > >>> *sds) > >>> +{ > >>> + return to_cpumask(sds->idle_cpus_span); > >>> +} > >>> + > >>> struct sched_domain { > >>> /* These fields must be setup */ > >>> struct sched_domain __rcu *parent; /* top domain must be > >>> null terminated */ > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>> index 6b3b59cc51d6..9a3c82645472 100644 > >>> --- a/kernel/sched/fair.c > >>> +++ b/kernel/sched/fair.c > >>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq) > >>> rcu_read_unlock(); > >>> } > >>> > >>> +/* > >>> + * Update cpu idle state and record this information > >>> + * in sd_llc_shared->idle_cpus_span. > >>> + */ > >>> +void update_idle_cpumask(struct rq *rq) > >>> +{ > >>> + struct sched_domain *sd; > >>> + int cpu = cpu_of(rq); > >>> + > >>> + rcu_read_lock(); > >>> + sd = rcu_dereference(per_cpu(sd_llc, cpu)); > >>> + if (!sd || !sd->shared) > >>> + goto unlock; > >>> + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) > >>> + goto unlock; > >>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); > >>> +unlock: > >>> + rcu_read_unlock(); > >>> +} > >>> + > >>> /* > >>> * Scan the entire LLC domain for idle cores; this dynamically switches > >>> off if > >>> * there are no idle cores left in the system; tracked through > >>> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, > >>> struct sched_domain *sd, int t > >>> > >>> time = cpu_clock(this); > >>> > >>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > >>> + /* > >>> +* sched_domain_shared is set only at shared cache level, > >>> +* this works only because select_idle_cpu is called with > >>> +* sd_llc. > >>> +*/ > >>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); > >>> > >>> for_each_cpu_wrap(cpu, cpus, target) { > >>> if (!--nr) > >>> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int > >>> prev_cpu, int sd_flag, int wake_f > >>> > >>> if (want_affine)
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Mon, 21 Sep 2020 at 17:14, Vincent Guittot wrote: > > On Thu, 17 Sep 2020 at 11:21, Li, Aubrey wrote: > > > > On 2020/9/16 19:00, Mel Gorman wrote: > > > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > > >> Added idle cpumask to track idle cpus in sched domain. When a CPU > > >> enters idle, its corresponding bit in the idle cpumask will be set, > > >> and when the CPU exits idle, its bit will be cleared. > > >> > > >> When a task wakes up to select an idle cpu, scanning idle cpumask > > >> has low cost than scanning all the cpus in last level cache domain, > > >> especially when the system is heavily loaded. > > >> > > >> The following benchmarks were tested on a x86 4 socket system with > > >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > > >> > > > > > > This still appears to be tied to turning the tick off. An idle CPU > > > available for computation does not necessarily have the tick turned off > > > if it's for short periods of time. When nohz is disabled or a machine is > > > active enough that CPUs are not disabling the tick, select_idle_cpu may > > > fail to select an idle CPU and instead stack tasks on the old CPU. > > > > > > The other subtlety is that select_idle_sibling() currently allows a > > > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > > > idle as such, it's simply running a low priority task that is suitable > > > for preemption. I suspect this patch breaks that. > > > > > Thanks! > > > > I shall post a v3 with performance data, I made a quick uperf testing and > > found the benefit is still there. So I posted the patch here and looking > > forward to your comments before I start the benchmarks. > > > > Thanks, > > -Aubrey > > > > --- > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > > index fb11091129b3..43a641d26154 100644 > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -65,8 +65,21 @@ struct sched_domain_shared { > > atomic_tref; > > atomic_tnr_busy_cpus; > > int has_idle_cores; > > + /* > > +* Span of all idle CPUs in this domain. > > +* > > +* NOTE: this field is variable length. (Allocated dynamically > > +* by attaching extra space to the end of the structure, > > +* depending on how many CPUs the kernel has booted up with) > > +*/ > > + unsigned long idle_cpus_span[]; > > }; > > > > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared > > *sds) > > +{ > > + return to_cpumask(sds->idle_cpus_span); > > +} > > + > > struct sched_domain { > > /* These fields must be setup */ > > struct sched_domain __rcu *parent; /* top domain must be null > > terminated */ > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 6b3b59cc51d6..9a3c82645472 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq) > > rcu_read_unlock(); > > } > > > > +/* > > + * Update cpu idle state and record this information > > + * in sd_llc_shared->idle_cpus_span. > > + */ > > +void update_idle_cpumask(struct rq *rq) > > +{ > > + struct sched_domain *sd; > > + int cpu = cpu_of(rq); > > + > > + rcu_read_lock(); > > + sd = rcu_dereference(per_cpu(sd_llc, cpu)); > > + if (!sd || !sd->shared) > > + goto unlock; > > + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) > > + goto unlock; > > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); > > +unlock: > > + rcu_read_unlock(); > > +} > > + > > /* > > * Scan the entire LLC domain for idle cores; this dynamically switches > > off if > > * there are no idle cores left in the system; tracked through > > @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > > > time = cpu_clock(this); > > > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > + /* > > +* sched_domain_shared is set only at shared cache level, > > +* this works only because select_idle_cpu is called with > > +* sd_llc. > > +*/ > > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); > > > > for_each_cpu_wrap(cpu, cpus, target) { > > if (!--nr) > > @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int > > prev_cpu, int sd_flag, int wake_f > > > > if (want_affine) > > current->recent_used_cpu = cpu; > > + > > + sd = rcu_dereference(per_cpu(sd_llc, new_cpu)); > > + if (sd && sd->shared) > > + cpumask_clear_cpu(new_cpu, > > sds_idle_cpus(sd->shared)); > > Why are you clearing the bit only for the fast path ?
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Thu, 17 Sep 2020 at 11:21, Li, Aubrey wrote: > > On 2020/9/16 19:00, Mel Gorman wrote: > > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > >> Added idle cpumask to track idle cpus in sched domain. When a CPU > >> enters idle, its corresponding bit in the idle cpumask will be set, > >> and when the CPU exits idle, its bit will be cleared. > >> > >> When a task wakes up to select an idle cpu, scanning idle cpumask > >> has low cost than scanning all the cpus in last level cache domain, > >> especially when the system is heavily loaded. > >> > >> The following benchmarks were tested on a x86 4 socket system with > >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > >> > > > > This still appears to be tied to turning the tick off. An idle CPU > > available for computation does not necessarily have the tick turned off > > if it's for short periods of time. When nohz is disabled or a machine is > > active enough that CPUs are not disabling the tick, select_idle_cpu may > > fail to select an idle CPU and instead stack tasks on the old CPU. > > > > The other subtlety is that select_idle_sibling() currently allows a > > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > > idle as such, it's simply running a low priority task that is suitable > > for preemption. I suspect this patch breaks that. > > > Thanks! > > I shall post a v3 with performance data, I made a quick uperf testing and > found the benefit is still there. So I posted the patch here and looking > forward to your comments before I start the benchmarks. > > Thanks, > -Aubrey > > --- > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index fb11091129b3..43a641d26154 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -65,8 +65,21 @@ struct sched_domain_shared { > atomic_tref; > atomic_tnr_busy_cpus; > int has_idle_cores; > + /* > +* Span of all idle CPUs in this domain. > +* > +* NOTE: this field is variable length. (Allocated dynamically > +* by attaching extra space to the end of the structure, > +* depending on how many CPUs the kernel has booted up with) > +*/ > + unsigned long idle_cpus_span[]; > }; > > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds) > +{ > + return to_cpumask(sds->idle_cpus_span); > +} > + > struct sched_domain { > /* These fields must be setup */ > struct sched_domain __rcu *parent; /* top domain must be null > terminated */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6b3b59cc51d6..9a3c82645472 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq) > rcu_read_unlock(); > } > > +/* > + * Update cpu idle state and record this information > + * in sd_llc_shared->idle_cpus_span. > + */ > +void update_idle_cpumask(struct rq *rq) > +{ > + struct sched_domain *sd; > + int cpu = cpu_of(rq); > + > + rcu_read_lock(); > + sd = rcu_dereference(per_cpu(sd_llc, cpu)); > + if (!sd || !sd->shared) > + goto unlock; > + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) > + goto unlock; > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); > +unlock: > + rcu_read_unlock(); > +} > + > /* > * Scan the entire LLC domain for idle cores; this dynamically switches off > if > * there are no idle cores left in the system; tracked through > @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, > struct sched_domain *sd, int t > > time = cpu_clock(this); > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > + /* > +* sched_domain_shared is set only at shared cache level, > +* this works only because select_idle_cpu is called with > +* sd_llc. > +*/ > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); > > for_each_cpu_wrap(cpu, cpus, target) { > if (!--nr) > @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int > prev_cpu, int sd_flag, int wake_f > > if (want_affine) > current->recent_used_cpu = cpu; > + > + sd = rcu_dereference(per_cpu(sd_llc, new_cpu)); > + if (sd && sd->shared) > + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared)); Why are you clearing the bit only for the fast path ? the slow path can also select an idle CPU Then, I'm afraid that updating a cpumask at each and every task wakeup will be far too expensive. That's why we are ot updating nohz.idle_cpus_mask at each and every enter/exit idle but only once per tick. And a quick test with hackbench on my octo cores arm64 give
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On 2020/9/16 19:00, Mel Gorman wrote: > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: >> Added idle cpumask to track idle cpus in sched domain. When a CPU >> enters idle, its corresponding bit in the idle cpumask will be set, >> and when the CPU exits idle, its bit will be cleared. >> >> When a task wakes up to select an idle cpu, scanning idle cpumask >> has low cost than scanning all the cpus in last level cache domain, >> especially when the system is heavily loaded. >> >> The following benchmarks were tested on a x86 4 socket system with >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: >> > > This still appears to be tied to turning the tick off. An idle CPU > available for computation does not necessarily have the tick turned off > if it's for short periods of time. When nohz is disabled or a machine is > active enough that CPUs are not disabling the tick, select_idle_cpu may > fail to select an idle CPU and instead stack tasks on the old CPU. > > The other subtlety is that select_idle_sibling() currently allows a > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > idle as such, it's simply running a low priority task that is suitable > for preemption. I suspect this patch breaks that. > Thanks! I shall post a v3 with performance data, I made a quick uperf testing and found the benefit is still there. So I posted the patch here and looking forward to your comments before I start the benchmarks. Thanks, -Aubrey --- diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index fb11091129b3..43a641d26154 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -65,8 +65,21 @@ struct sched_domain_shared { atomic_tref; atomic_tnr_busy_cpus; int has_idle_cores; + /* +* Span of all idle CPUs in this domain. +* +* NOTE: this field is variable length. (Allocated dynamically +* by attaching extra space to the end of the structure, +* depending on how many CPUs the kernel has booted up with) +*/ + unsigned long idle_cpus_span[]; }; +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds) +{ + return to_cpumask(sds->idle_cpus_span); +} + struct sched_domain { /* These fields must be setup */ struct sched_domain __rcu *parent; /* top domain must be null terminated */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b3b59cc51d6..9a3c82645472 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq) rcu_read_unlock(); } +/* + * Update cpu idle state and record this information + * in sd_llc_shared->idle_cpus_span. + */ +void update_idle_cpumask(struct rq *rq) +{ + struct sched_domain *sd; + int cpu = cpu_of(rq); + + rcu_read_lock(); + sd = rcu_dereference(per_cpu(sd_llc, cpu)); + if (!sd || !sd->shared) + goto unlock; + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) + goto unlock; + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); +unlock: + rcu_read_unlock(); +} + /* * Scan the entire LLC domain for idle cores; this dynamically switches off if * there are no idle cores left in the system; tracked through @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + /* +* sched_domain_shared is set only at shared cache level, +* this works only because select_idle_cpu is called with +* sd_llc. +*/ + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (want_affine) current->recent_used_cpu = cpu; + + sd = rcu_dereference(per_cpu(sd_llc, new_cpu)); + if (sd && sd->shared) + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared)); } rcu_read_unlock(); @@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) /* ensure bandwidth has been allocated on our new cfs_rq */ account_cfs_rq_runtime(cfs_rq, 0); } + /* Update idle cpumask if task has idle policy */ + if (unlikely(task_has_idle_policy(p))) + update_idle_cpumask(rq); } void init_cfs_rq(struct cfs_rq *cfs_rq) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 1ae95b9150d3..876dfdfe35bb 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -405,6 +40
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On 16/09/20 12:00, Mel Gorman wrote: > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: >> Added idle cpumask to track idle cpus in sched domain. When a CPU >> enters idle, its corresponding bit in the idle cpumask will be set, >> and when the CPU exits idle, its bit will be cleared. >> >> When a task wakes up to select an idle cpu, scanning idle cpumask >> has low cost than scanning all the cpus in last level cache domain, >> especially when the system is heavily loaded. >> >> The following benchmarks were tested on a x86 4 socket system with >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: >> > > This still appears to be tied to turning the tick off. An idle CPU > available for computation does not necessarily have the tick turned off > if it's for short periods of time. When nohz is disabled or a machine is > active enough that CPUs are not disabling the tick, select_idle_cpu may > fail to select an idle CPU and instead stack tasks on the old CPU. > Vincent was pointing out in v1 that we ratelimit nohz_balance_exit_idle() by having it happen on a tick to prevent being hammered by a flurry of idle enter / exit sub tick granularity. I'm afraid flipping bits of this cpumask on idle enter / exit might be too brutal. > The other subtlety is that select_idle_sibling() currently allows a > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > idle as such, it's simply running a low priority task that is suitable > for preemption. I suspect this patch breaks that. I think you're spot on. An alternative I see here would be to move this into its own select_idle_foo() function. If that mask is empty or none of the tagged CPUs actually pass available_idle_cpu(), we fall-through to the usual idle searches. That's far from perfect; you could wake a truly idle CPU instead of preempting a SCHED_IDLE task on a warm and busy CPU. I'm not sure if a proliferation of cpumask really is the answer to that...
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > Added idle cpumask to track idle cpus in sched domain. When a CPU > enters idle, its corresponding bit in the idle cpumask will be set, > and when the CPU exits idle, its bit will be cleared. > > When a task wakes up to select an idle cpu, scanning idle cpumask > has low cost than scanning all the cpus in last level cache domain, > especially when the system is heavily loaded. > > The following benchmarks were tested on a x86 4 socket system with > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > This still appears to be tied to turning the tick off. An idle CPU available for computation does not necessarily have the tick turned off if it's for short periods of time. When nohz is disabled or a machine is active enough that CPUs are not disabling the tick, select_idle_cpu may fail to select an idle CPU and instead stack tasks on the old CPU. The other subtlety is that select_idle_sibling() currently allows a SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really idle as such, it's simply running a low priority task that is suitable for preemption. I suspect this patch breaks that. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
On Wed, 16 Sep 2020 at 13:00, Mel Gorman wrote: > > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > > Added idle cpumask to track idle cpus in sched domain. When a CPU > > enters idle, its corresponding bit in the idle cpumask will be set, > > and when the CPU exits idle, its bit will be cleared. > > > > When a task wakes up to select an idle cpu, scanning idle cpumask > > has low cost than scanning all the cpus in last level cache domain, > > especially when the system is heavily loaded. > > > > The following benchmarks were tested on a x86 4 socket system with > > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > > > > This still appears to be tied to turning the tick off. An idle CPU > available for computation does not necessarily have the tick turned off > if it's for short periods of time. When nohz is disabled or a machine is > active enough that CPUs are not disabling the tick, select_idle_cpu may > fail to select an idle CPU and instead stack tasks on the old CPU. > > The other subtlety is that select_idle_sibling() currently allows a > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > idle as such, it's simply running a low priority task that is suitable > for preemption. I suspect this patch breaks that. Yes, good point. I completely missed this > > -- > Mel Gorman > SUSE Labs
[RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain
Added idle cpumask to track idle cpus in sched domain. When a CPU enters idle, its corresponding bit in the idle cpumask will be set, and when the CPU exits idle, its bit will be cleared. When a task wakes up to select an idle cpu, scanning idle cpumask has low cost than scanning all the cpus in last level cache domain, especially when the system is heavily loaded. The following benchmarks were tested on a x86 4 socket system with 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: uperf throughput: netperf workload, tcp_nodelay, r/w size = 90 threads baseline-avg%stdpatch-avg %std 961 1.240.982.76 144 1 1.131.354.01 192 1 0.581.673.25 240 1 2.491.683.55 hackbench: process mode, 10 loops, 40 file descriptors per group group baseline-avg%stdpatch-avg %std 2(80) 1 12.05 0.979.88 3(120)1 12.48 0.9511.62 4(160)1 13.83 0.9713.22 5(200)1 2.761.012.94 schbench: 99th percentile latency, 16 workers per message thread mthread baseline-avg%stdpatch-avg %std 6(96) 1 1.240.993 1.73 9(144)1 0.380.998 0.39 12(192) 1 1.580.995 1.64 15(240) 1 51.71 0.606 37.41 sysbench mysql throughput: read/write, table size = 10,000,000 threadbaseline-avg%stdpatch-avg %std 961 1.771.015 1.71 144 1 3.390.998 4.05 192 1 2.881.002 2.81 240 1 2.071.011 2.09 kbuild: kexec reboot every time baseline-avg patch-avg 1 1 v1->v2: - idle cpumask is updated in the nohz routines, by initializing idle cpumask with sched_domain_span(sd), nohz=off case remains the original behavior. Cc: Qais Yousef Cc: Valentin Schneider Cc: Jiang Biao Cc: Tim Chen Signed-off-by: Aubrey Li --- include/linux/sched/topology.h | 13 + kernel/sched/fair.c| 9 - kernel/sched/topology.c| 3 ++- 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index fb11091129b3..43a641d26154 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -65,8 +65,21 @@ struct sched_domain_shared { atomic_tref; atomic_tnr_busy_cpus; int has_idle_cores; + /* +* Span of all idle CPUs in this domain. +* +* NOTE: this field is variable length. (Allocated dynamically +* by attaching extra space to the end of the structure, +* depending on how many CPUs the kernel has booted up with) +*/ + unsigned long idle_cpus_span[]; }; +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds) +{ + return to_cpumask(sds->idle_cpus_span); +} + struct sched_domain { /* These fields must be setup */ struct sched_domain __rcu *parent; /* top domain must be null terminated */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b3b59cc51d6..cfe78fcf69da 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6136,7 +6136,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + /* +* sched_domain_shared is set only at shared cache level, +* this works only because select_idle_cpu is called with +* sd_llc. +*/ + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) @@ -10182,6 +10187,7 @@ static void set_cpu_sd_state_busy(int cpu) sd->nohz_idle = 0; atomic_inc(&sd->shared->nr_busy_cpus); + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared)); unlock: rcu_read_unlock(); } @@ -10212,6 +10218,7 @@ static void set_cpu_sd_state_idle(int cpu) sd->nohz_idle = 1; atomic_dec(&sd->shared->nr_busy_cpus); + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); unlock: rcu_read_unlock(); } diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9079d865a935..f14a6ef4de57 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->shared = *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref);