Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-27 Thread Li, Aubrey
On 2020/9/26 0:45, Vincent Guittot wrote:
> Le vendredi 25 sept. 2020 à 17:21:46 (+0800), Li, Aubrey a écrit :
>> Hi Vicent,
>>
>> On 2020/9/24 21:09, Vincent Guittot wrote:
>>
>> Would you mind share uperf(netperf load) result on your side? That's the
>> workload I have seen the most benefit this patch contributed under heavy
>> load level.
>
> with uperf, i've got the same kind of result as sched pipe
> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
> with this patch:  Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
> regression as for sched pipe
>
 In case this is caused by the logic error in this patch(sorry again), did
 you see any improvement in patch V2? Though it does not helps for nohz=off
 case, just want to know if it helps or does not help at all on arm 
 platform.
>>>
>>> With the v2 which rate limit the update of the cpumask (but doesn't
>>> support sched_idle stask),  I don't see any performance impact:
>>
>> I agree we should go the way with cpumask update rate limited.
>>
>> And I think no performance impact for sched-pipe is expected, as this 
>> workload
>> has only 2 threads and the platform has 8 cores, so mostly previous cpu is
>> returned, and even if select_idle_sibling is called, select_idle_core is hit
>> and rarely call select_idle_cpu.
> 
> my platform is not smt so select_idle_core is nop. Nevertheless 
> select_idle_cpu
> is almost never called because prev is idle and selected before calling it in
> our case
> 
>>
>> But I'm more curious why there is 23% performance penalty? So for this 
>> patch, if
>> you revert this change but keep cpumask updated, is 23% penalty still there?
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> 
> I was about to say that reverting this line should not change anything because
> we never reach this point but it does in fact. And after looking at a trace,
> I can see that the 2 threads of perf bench sched pipe are on the same CPU and
> that the sds_idle_cpus(sd->shared) is always empty. In fact, the rq->curr is
> not yet idle and still point to the cfs task when you call 
> update_idle_cpumask().
> This means that once cleared, the bit will never be set
> You can remove the test in update_idle_cpumask() which is called either when
> entering idle or when there is only sched_idle tasks that are runnable.
> 
> @@ -6044,8 +6044,7 @@ void update_idle_cpumask(struct rq *rq)
> sd = rcu_dereference(per_cpu(sd_llc, cpu));
> if (!sd || !sd->shared)
> goto unlock;
> -   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> -   goto unlock;
> +
> cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>  unlock:
> rcu_read_unlock();
> 
> With this fix, the performance decrease is only 2%
> 
>>
>> I just wonder if it's caused by the atomic ops as you have two cache domains 
>> with
>> sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me 
>> to find
>> an ARM machine but I'll try.
>>
>> Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? 
>> no any
>> performance impact?
> 
> with v2 :  Throughput 24.97Mb/s (+/- 0.07%) so there is no perf regression
> 

Thanks Vincent, let me try to refine this patch.

-Aubrey


Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-25 Thread Vincent Guittot
Le vendredi 25 sept. 2020 à 17:21:46 (+0800), Li, Aubrey a écrit :
> Hi Vicent,
> 
> On 2020/9/24 21:09, Vincent Guittot wrote:
> 
>  Would you mind share uperf(netperf load) result on your side? That's the
>  workload I have seen the most benefit this patch contributed under heavy
>  load level.
> >>>
> >>> with uperf, i've got the same kind of result as sched pipe
> >>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
> >>> with this patch:  Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
> >>> regression as for sched pipe
> >>>
> >> In case this is caused by the logic error in this patch(sorry again), did
> >> you see any improvement in patch V2? Though it does not helps for nohz=off
> >> case, just want to know if it helps or does not help at all on arm 
> >> platform.
> > 
> > With the v2 which rate limit the update of the cpumask (but doesn't
> > support sched_idle stask),  I don't see any performance impact:
> 
> I agree we should go the way with cpumask update rate limited.
> 
> And I think no performance impact for sched-pipe is expected, as this workload
> has only 2 threads and the platform has 8 cores, so mostly previous cpu is
> returned, and even if select_idle_sibling is called, select_idle_core is hit
> and rarely call select_idle_cpu.

my platform is not smt so select_idle_core is nop. Nevertheless select_idle_cpu
is almost never called because prev is idle and selected before calling it in
our case

> 
> But I'm more curious why there is 23% performance penalty? So for this patch, 
> if
> you revert this change but keep cpumask updated, is 23% penalty still there?
> 
> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

I was about to say that reverting this line should not change anything because
we never reach this point but it does in fact. And after looking at a trace,
I can see that the 2 threads of perf bench sched pipe are on the same CPU and
that the sds_idle_cpus(sd->shared) is always empty. In fact, the rq->curr is
not yet idle and still point to the cfs task when you call 
update_idle_cpumask().
This means that once cleared, the bit will never be set
You can remove the test in update_idle_cpumask() which is called either when
entering idle or when there is only sched_idle tasks that are runnable.

@@ -6044,8 +6044,7 @@ void update_idle_cpumask(struct rq *rq)
sd = rcu_dereference(per_cpu(sd_llc, cpu));
if (!sd || !sd->shared)
goto unlock;
-   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
-   goto unlock;
+
cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();

With this fix, the performance decrease is only 2%

> 
> I just wonder if it's caused by the atomic ops as you have two cache domains 
> with
> sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me 
> to find
> an ARM machine but I'll try.
> 
> Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? 
> no any
> performance impact?

with v2 :  Throughput 24.97Mb/s (+/- 0.07%) so there is no perf regression

>
> 
> Thanks,
> -Aubrey


Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-25 Thread Li, Aubrey
Hi Vicent,

On 2020/9/24 21:09, Vincent Guittot wrote:

 Would you mind share uperf(netperf load) result on your side? That's the
 workload I have seen the most benefit this patch contributed under heavy
 load level.
>>>
>>> with uperf, i've got the same kind of result as sched pipe
>>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
>>> with this patch:  Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
>>> regression as for sched pipe
>>>
>> In case this is caused by the logic error in this patch(sorry again), did
>> you see any improvement in patch V2? Though it does not helps for nohz=off
>> case, just want to know if it helps or does not help at all on arm platform.
> 
> With the v2 which rate limit the update of the cpumask (but doesn't
> support sched_idle stask),  I don't see any performance impact:

I agree we should go the way with cpumask update rate limited.

And I think no performance impact for sched-pipe is expected, as this workload
has only 2 threads and the platform has 8 cores, so mostly previous cpu is
returned, and even if select_idle_sibling is called, select_idle_core is hit
and rarely call select_idle_cpu.

But I'm more curious why there is 23% performance penalty? So for this patch, if
you revert this change but keep cpumask updated, is 23% penalty still there?

-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

I just wonder if it's caused by the atomic ops as you have two cache domains 
with
sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me to 
find
an ARM machine but I'll try.

Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? no 
any
performance impact?

Thanks,
-Aubrey


Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Vincent Guittot
Hi Tim

On Thu, 24 Sep 2020 at 18:37, Tim Chen  wrote:
>
>
>
> On 9/22/20 12:14 AM, Vincent Guittot wrote:
>
> >>
> 
>  And a quick test with hackbench on my octo cores arm64 gives for 12
>
> Vincent,
>
> Is it octo (=10) or octa (=8) cores on a single socket for your system?

it's a 8 cores and the cores are splitted in 2 cache domains

> The L2 is per core or there are multiple L2s shared among groups of cores?
>
> Wonder if placing the threads within a L2 or not within
> an L2 could cause differences seen with Aubrey's test.

I haven't checked recently but the 2 tasks involved in sched pipe run
on CPUs which belong to the same cache domain

Vincent

>
> Tim
>


Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Phil Auld
On Thu, Sep 24, 2020 at 10:43:12AM -0700 Tim Chen wrote:
> 
> 
> On 9/24/20 10:13 AM, Phil Auld wrote:
> > On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
> >>
> >>
> >> On 9/22/20 12:14 AM, Vincent Guittot wrote:
> >>
> 
> >>
> >> And a quick test with hackbench on my octo cores arm64 gives for 12
> >>
> >> Vincent,
> >>
> >> Is it octo (=10) or octa (=8) cores on a single socket for your system?
> > 
> > In what Romance language does octo mean 10?  :)
> > 
> 
> Got confused by october, the tenth month. :)

It used to be the eigth month ;)

> 
> Tim
> 
> > 
> >> The L2 is per core or there are multiple L2s shared among groups of cores?
> >>
> >> Wonder if placing the threads within a L2 or not within
> >> an L2 could cause differences seen with Aubrey's test.
> >>
> >> Tim
> >>
> > 
> 

-- 



Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Tim Chen



On 9/24/20 10:13 AM, Phil Auld wrote:
> On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
>>
>>
>> On 9/22/20 12:14 AM, Vincent Guittot wrote:
>>

>>
>> And a quick test with hackbench on my octo cores arm64 gives for 12
>>
>> Vincent,
>>
>> Is it octo (=10) or octa (=8) cores on a single socket for your system?
> 
> In what Romance language does octo mean 10?  :)
> 

Got confused by october, the tenth month. :)

Tim

> 
>> The L2 is per core or there are multiple L2s shared among groups of cores?
>>
>> Wonder if placing the threads within a L2 or not within
>> an L2 could cause differences seen with Aubrey's test.
>>
>> Tim
>>
> 


Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Phil Auld
On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
> 
> 
> On 9/22/20 12:14 AM, Vincent Guittot wrote:
> 
> >>
> 
>  And a quick test with hackbench on my octo cores arm64 gives for 12
> 
> Vincent,
> 
> Is it octo (=10) or octa (=8) cores on a single socket for your system?

In what Romance language does octo mean 10?  :)


> The L2 is per core or there are multiple L2s shared among groups of cores?
> 
> Wonder if placing the threads within a L2 or not within
> an L2 could cause differences seen with Aubrey's test.
> 
> Tim
> 

-- 



Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Tim Chen



On 9/22/20 12:14 AM, Vincent Guittot wrote:

>>

 And a quick test with hackbench on my octo cores arm64 gives for 12

Vincent,

Is it octo (=10) or octa (=8) cores on a single socket for your system?
The L2 is per core or there are multiple L2s shared among groups of cores?

Wonder if placing the threads within a L2 or not within
an L2 could cause differences seen with Aubrey's test.

Tim



Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Vincent Guittot
On Thu, 24 Sep 2020 at 05:04, Li, Aubrey  wrote:
>
> On 2020/9/23 16:50, Vincent Guittot wrote:
> > On Wed, 23 Sep 2020 at 04:59, Li, Aubrey  wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 2020/9/22 15:14, Vincent Guittot wrote:
> >>> On Tue, 22 Sep 2020 at 05:33, Li, Aubrey  
> >>> wrote:
> 
>  On 2020/9/21 23:21, Vincent Guittot wrote:
> > On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
> >  wrote:
> >>
> >> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey  
> >> wrote:
> >>>
> >>> On 2020/9/16 19:00, Mel Gorman wrote:
>  On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, its corresponding bit in the idle cpumask will be set,
> > and when the CPU exits idle, its bit will be cleared.
> >
> > When a task wakes up to select an idle cpu, scanning idle cpumask
> > has low cost than scanning all the cpus in last level cache domain,
> > especially when the system is heavily loaded.
> >
> > The following benchmarks were tested on a x86 4 socket system with
> > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >
> 
>  This still appears to be tied to turning the tick off. An idle CPU
>  available for computation does not necessarily have the tick turned 
>  off
>  if it's for short periods of time. When nohz is disabled or a 
>  machine is
>  active enough that CPUs are not disabling the tick, select_idle_cpu 
>  may
>  fail to select an idle CPU and instead stack tasks on the old CPU.
> 
>  The other subtlety is that select_idle_sibling() currently allows a
>  SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
>  idle as such, it's simply running a low priority task that is 
>  suitable
>  for preemption. I suspect this patch breaks that.
> 
> >>> Thanks!
> >>>
> >>> I shall post a v3 with performance data, I made a quick uperf testing 
> >>> and
> >>> found the benefit is still there. So I posted the patch here and 
> >>> looking
> >>> forward to your comments before I start the benchmarks.
> >>>
> >>> Thanks,
> >>> -Aubrey
> >>>
> >>> ---
> >>> diff --git a/include/linux/sched/topology.h 
> >>> b/include/linux/sched/topology.h
> >>> index fb11091129b3..43a641d26154 100644
> >>> --- a/include/linux/sched/topology.h
> >>> +++ b/include/linux/sched/topology.h
> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>> atomic_tref;
> >>> atomic_tnr_busy_cpus;
> >>> int has_idle_cores;
> >>> +   /*
> >>> +* Span of all idle CPUs in this domain.
> >>> +*
> >>> +* NOTE: this field is variable length. (Allocated dynamically
> >>> +* by attaching extra space to the end of the structure,
> >>> +* depending on how many CPUs the kernel has booted up with)
> >>> +*/
> >>> +   unsigned long   idle_cpus_span[];
> >>>  };
> >>>
> >>> +static inline struct cpumask *sds_idle_cpus(struct 
> >>> sched_domain_shared *sds)
> >>> +{
> >>> +   return to_cpumask(sds->idle_cpus_span);
> >>> +}
> >>> +
> >>>  struct sched_domain {
> >>> /* These fields must be setup */
> >>> struct sched_domain __rcu *parent;  /* top domain must be 
> >>> null terminated */
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 6b3b59cc51d6..9a3c82645472 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> >>> rcu_read_unlock();
> >>>  }
> >>>
> >>> +/*
> >>> + * Update cpu idle state and record this information
> >>> + * in sd_llc_shared->idle_cpus_span.
> >>> + */
> >>> +void update_idle_cpumask(struct rq *rq)
> >>> +{
> >>> +   struct sched_domain *sd;
> >>> +   int cpu = cpu_of(rq);
> >>> +
> >>> +   rcu_read_lock();
> >>> +   sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >>> +   if (!sd || !sd->shared)
> >>> +   goto unlock;
> >>> +   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> >>> +   goto unlock;
>
> Oops, I realized I didn't send an update out to fix this while I fixed
> it locally. it should be
>
> if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))

the fix  doesn't change the perf results

>
> Sorry for this, Vincent, :(
>
> >>> +   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> >>> +unlock:
> >>> +   rcu_read_unlock();
> >>

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-23 Thread Vincent Guittot
On Wed, 23 Sep 2020 at 04:59, Li, Aubrey  wrote:
>
> Hi Vincent,
>
> On 2020/9/22 15:14, Vincent Guittot wrote:
> > On Tue, 22 Sep 2020 at 05:33, Li, Aubrey  wrote:
> >>
> >> On 2020/9/21 23:21, Vincent Guittot wrote:
> >>> On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
> >>>  wrote:
> 
>  On Thu, 17 Sep 2020 at 11:21, Li, Aubrey  
>  wrote:
> >
> > On 2020/9/16 19:00, Mel Gorman wrote:
> >> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> >>> Added idle cpumask to track idle cpus in sched domain. When a CPU
> >>> enters idle, its corresponding bit in the idle cpumask will be set,
> >>> and when the CPU exits idle, its bit will be cleared.
> >>>
> >>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>> has low cost than scanning all the cpus in last level cache domain,
> >>> especially when the system is heavily loaded.
> >>>
> >>> The following benchmarks were tested on a x86 4 socket system with
> >>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >>>
> >>
> >> This still appears to be tied to turning the tick off. An idle CPU
> >> available for computation does not necessarily have the tick turned off
> >> if it's for short periods of time. When nohz is disabled or a machine 
> >> is
> >> active enough that CPUs are not disabling the tick, select_idle_cpu may
> >> fail to select an idle CPU and instead stack tasks on the old CPU.
> >>
> >> The other subtlety is that select_idle_sibling() currently allows a
> >> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> >> idle as such, it's simply running a low priority task that is suitable
> >> for preemption. I suspect this patch breaks that.
> >>
> > Thanks!
> >
> > I shall post a v3 with performance data, I made a quick uperf testing 
> > and
> > found the benefit is still there. So I posted the patch here and looking
> > forward to your comments before I start the benchmarks.
> >
> > Thanks,
> > -Aubrey
> >
> > ---
> > diff --git a/include/linux/sched/topology.h 
> > b/include/linux/sched/topology.h
> > index fb11091129b3..43a641d26154 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > atomic_tref;
> > atomic_tnr_busy_cpus;
> > int has_idle_cores;
> > +   /*
> > +* Span of all idle CPUs in this domain.
> > +*
> > +* NOTE: this field is variable length. (Allocated dynamically
> > +* by attaching extra space to the end of the structure,
> > +* depending on how many CPUs the kernel has booted up with)
> > +*/
> > +   unsigned long   idle_cpus_span[];
> >  };
> >
> > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
> > *sds)
> > +{
> > +   return to_cpumask(sds->idle_cpus_span);
> > +}
> > +
> >  struct sched_domain {
> > /* These fields must be setup */
> > struct sched_domain __rcu *parent;  /* top domain must be 
> > null terminated */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6b3b59cc51d6..9a3c82645472 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> > rcu_read_unlock();
> >  }
> >
> > +/*
> > + * Update cpu idle state and record this information
> > + * in sd_llc_shared->idle_cpus_span.
> > + */
> > +void update_idle_cpumask(struct rq *rq)
> > +{
> > +   struct sched_domain *sd;
> > +   int cpu = cpu_of(rq);
> > +
> > +   rcu_read_lock();
> > +   sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > +   if (!sd || !sd->shared)
> > +   goto unlock;
> > +   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> > +   goto unlock;
> > +   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> > +unlock:
> > +   rcu_read_unlock();
> > +}
> > +
> >  /*
> >   * Scan the entire LLC domain for idle cores; this dynamically 
> > switches off if
> >   * there are no idle cores left in the system; tracked through
> > @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct 
> > *p, struct sched_domain *sd, int t
> >
> > time = cpu_clock(this);
> >
> > -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +   /*
> > +* sched_domain_shared is set only at shared cache level,
> > +* this works only because select_idle_cpu is called with

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-22 Thread Vincent Guittot
On Tue, 22 Sep 2020 at 05:33, Li, Aubrey  wrote:
>
> On 2020/9/21 23:21, Vincent Guittot wrote:
> > On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
> >  wrote:
> >>
> >> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey  wrote:
> >>>
> >>> On 2020/9/16 19:00, Mel Gorman wrote:
>  On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, its corresponding bit in the idle cpumask will be set,
> > and when the CPU exits idle, its bit will be cleared.
> >
> > When a task wakes up to select an idle cpu, scanning idle cpumask
> > has low cost than scanning all the cpus in last level cache domain,
> > especially when the system is heavily loaded.
> >
> > The following benchmarks were tested on a x86 4 socket system with
> > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >
> 
>  This still appears to be tied to turning the tick off. An idle CPU
>  available for computation does not necessarily have the tick turned off
>  if it's for short periods of time. When nohz is disabled or a machine is
>  active enough that CPUs are not disabling the tick, select_idle_cpu may
>  fail to select an idle CPU and instead stack tasks on the old CPU.
> 
>  The other subtlety is that select_idle_sibling() currently allows a
>  SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
>  idle as such, it's simply running a low priority task that is suitable
>  for preemption. I suspect this patch breaks that.
> 
> >>> Thanks!
> >>>
> >>> I shall post a v3 with performance data, I made a quick uperf testing and
> >>> found the benefit is still there. So I posted the patch here and looking
> >>> forward to your comments before I start the benchmarks.
> >>>
> >>> Thanks,
> >>> -Aubrey
> >>>
> >>> ---
> >>> diff --git a/include/linux/sched/topology.h 
> >>> b/include/linux/sched/topology.h
> >>> index fb11091129b3..43a641d26154 100644
> >>> --- a/include/linux/sched/topology.h
> >>> +++ b/include/linux/sched/topology.h
> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>> atomic_tref;
> >>> atomic_tnr_busy_cpus;
> >>> int has_idle_cores;
> >>> +   /*
> >>> +* Span of all idle CPUs in this domain.
> >>> +*
> >>> +* NOTE: this field is variable length. (Allocated dynamically
> >>> +* by attaching extra space to the end of the structure,
> >>> +* depending on how many CPUs the kernel has booted up with)
> >>> +*/
> >>> +   unsigned long   idle_cpus_span[];
> >>>  };
> >>>
> >>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
> >>> *sds)
> >>> +{
> >>> +   return to_cpumask(sds->idle_cpus_span);
> >>> +}
> >>> +
> >>>  struct sched_domain {
> >>> /* These fields must be setup */
> >>> struct sched_domain __rcu *parent;  /* top domain must be 
> >>> null terminated */
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 6b3b59cc51d6..9a3c82645472 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> >>> rcu_read_unlock();
> >>>  }
> >>>
> >>> +/*
> >>> + * Update cpu idle state and record this information
> >>> + * in sd_llc_shared->idle_cpus_span.
> >>> + */
> >>> +void update_idle_cpumask(struct rq *rq)
> >>> +{
> >>> +   struct sched_domain *sd;
> >>> +   int cpu = cpu_of(rq);
> >>> +
> >>> +   rcu_read_lock();
> >>> +   sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >>> +   if (!sd || !sd->shared)
> >>> +   goto unlock;
> >>> +   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> >>> +   goto unlock;
> >>> +   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> >>> +unlock:
> >>> +   rcu_read_unlock();
> >>> +}
> >>> +
> >>>  /*
> >>>   * Scan the entire LLC domain for idle cores; this dynamically switches 
> >>> off if
> >>>   * there are no idle cores left in the system; tracked through
> >>> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, 
> >>> struct sched_domain *sd, int t
> >>>
> >>> time = cpu_clock(this);
> >>>
> >>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>> +   /*
> >>> +* sched_domain_shared is set only at shared cache level,
> >>> +* this works only because select_idle_cpu is called with
> >>> +* sd_llc.
> >>> +*/
> >>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >>>
> >>> for_each_cpu_wrap(cpu, cpus, target) {
> >>> if (!--nr)
> >>> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int 
> >>> prev_cpu, int sd_flag, int wake_f
> >>>
> >>> if (want_affine)

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-21 Thread Vincent Guittot
On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
 wrote:
>
> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey  wrote:
> >
> > On 2020/9/16 19:00, Mel Gorman wrote:
> > > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> > >> Added idle cpumask to track idle cpus in sched domain. When a CPU
> > >> enters idle, its corresponding bit in the idle cpumask will be set,
> > >> and when the CPU exits idle, its bit will be cleared.
> > >>
> > >> When a task wakes up to select an idle cpu, scanning idle cpumask
> > >> has low cost than scanning all the cpus in last level cache domain,
> > >> especially when the system is heavily loaded.
> > >>
> > >> The following benchmarks were tested on a x86 4 socket system with
> > >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> > >>
> > >
> > > This still appears to be tied to turning the tick off. An idle CPU
> > > available for computation does not necessarily have the tick turned off
> > > if it's for short periods of time. When nohz is disabled or a machine is
> > > active enough that CPUs are not disabling the tick, select_idle_cpu may
> > > fail to select an idle CPU and instead stack tasks on the old CPU.
> > >
> > > The other subtlety is that select_idle_sibling() currently allows a
> > > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> > > idle as such, it's simply running a low priority task that is suitable
> > > for preemption. I suspect this patch breaks that.
> > >
> > Thanks!
> >
> > I shall post a v3 with performance data, I made a quick uperf testing and
> > found the benefit is still there. So I posted the patch here and looking
> > forward to your comments before I start the benchmarks.
> >
> > Thanks,
> > -Aubrey
> >
> > ---
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index fb11091129b3..43a641d26154 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > atomic_tref;
> > atomic_tnr_busy_cpus;
> > int has_idle_cores;
> > +   /*
> > +* Span of all idle CPUs in this domain.
> > +*
> > +* NOTE: this field is variable length. (Allocated dynamically
> > +* by attaching extra space to the end of the structure,
> > +* depending on how many CPUs the kernel has booted up with)
> > +*/
> > +   unsigned long   idle_cpus_span[];
> >  };
> >
> > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
> > *sds)
> > +{
> > +   return to_cpumask(sds->idle_cpus_span);
> > +}
> > +
> >  struct sched_domain {
> > /* These fields must be setup */
> > struct sched_domain __rcu *parent;  /* top domain must be null 
> > terminated */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6b3b59cc51d6..9a3c82645472 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> > rcu_read_unlock();
> >  }
> >
> > +/*
> > + * Update cpu idle state and record this information
> > + * in sd_llc_shared->idle_cpus_span.
> > + */
> > +void update_idle_cpumask(struct rq *rq)
> > +{
> > +   struct sched_domain *sd;
> > +   int cpu = cpu_of(rq);
> > +
> > +   rcu_read_lock();
> > +   sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > +   if (!sd || !sd->shared)
> > +   goto unlock;
> > +   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> > +   goto unlock;
> > +   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> > +unlock:
> > +   rcu_read_unlock();
> > +}
> > +
> >  /*
> >   * Scan the entire LLC domain for idle cores; this dynamically switches 
> > off if
> >   * there are no idle cores left in the system; tracked through
> > @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> >
> > time = cpu_clock(this);
> >
> > -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +   /*
> > +* sched_domain_shared is set only at shared cache level,
> > +* this works only because select_idle_cpu is called with
> > +* sd_llc.
> > +*/
> > +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int 
> > prev_cpu, int sd_flag, int wake_f
> >
> > if (want_affine)
> > current->recent_used_cpu = cpu;
> > +
> > +   sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
> > +   if (sd && sd->shared)
> > +   cpumask_clear_cpu(new_cpu, 
> > sds_idle_cpus(sd->shared));
>
> Why are you clearing the bit only for the fast path ? 

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-21 Thread Vincent Guittot
On Thu, 17 Sep 2020 at 11:21, Li, Aubrey  wrote:
>
> On 2020/9/16 19:00, Mel Gorman wrote:
> > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> >> Added idle cpumask to track idle cpus in sched domain. When a CPU
> >> enters idle, its corresponding bit in the idle cpumask will be set,
> >> and when the CPU exits idle, its bit will be cleared.
> >>
> >> When a task wakes up to select an idle cpu, scanning idle cpumask
> >> has low cost than scanning all the cpus in last level cache domain,
> >> especially when the system is heavily loaded.
> >>
> >> The following benchmarks were tested on a x86 4 socket system with
> >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >>
> >
> > This still appears to be tied to turning the tick off. An idle CPU
> > available for computation does not necessarily have the tick turned off
> > if it's for short periods of time. When nohz is disabled or a machine is
> > active enough that CPUs are not disabling the tick, select_idle_cpu may
> > fail to select an idle CPU and instead stack tasks on the old CPU.
> >
> > The other subtlety is that select_idle_sibling() currently allows a
> > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> > idle as such, it's simply running a low priority task that is suitable
> > for preemption. I suspect this patch breaks that.
> >
> Thanks!
>
> I shall post a v3 with performance data, I made a quick uperf testing and
> found the benefit is still there. So I posted the patch here and looking
> forward to your comments before I start the benchmarks.
>
> Thanks,
> -Aubrey
>
> ---
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_tref;
> atomic_tnr_busy_cpus;
> int has_idle_cores;
> +   /*
> +* Span of all idle CPUs in this domain.
> +*
> +* NOTE: this field is variable length. (Allocated dynamically
> +* by attaching extra space to the end of the structure,
> +* depending on how many CPUs the kernel has booted up with)
> +*/
> +   unsigned long   idle_cpus_span[];
>  };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> +   return to_cpumask(sds->idle_cpus_span);
> +}
> +
>  struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent;  /* top domain must be null 
> terminated */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b3b59cc51d6..9a3c82645472 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> rcu_read_unlock();
>  }
>
> +/*
> + * Update cpu idle state and record this information
> + * in sd_llc_shared->idle_cpus_span.
> + */
> +void update_idle_cpumask(struct rq *rq)
> +{
> +   struct sched_domain *sd;
> +   int cpu = cpu_of(rq);
> +
> +   rcu_read_lock();
> +   sd = rcu_dereference(per_cpu(sd_llc, cpu));
> +   if (!sd || !sd->shared)
> +   goto unlock;
> +   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> +   goto unlock;
> +   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> +unlock:
> +   rcu_read_unlock();
> +}
> +
>  /*
>   * Scan the entire LLC domain for idle cores; this dynamically switches off 
> if
>   * there are no idle cores left in the system; tracked through
> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>
> time = cpu_clock(this);
>
> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +   /*
> +* sched_domain_shared is set only at shared cache level,
> +* this works only because select_idle_cpu is called with
> +* sd_llc.
> +*/
> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int 
> prev_cpu, int sd_flag, int wake_f
>
> if (want_affine)
> current->recent_used_cpu = cpu;
> +
> +   sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
> +   if (sd && sd->shared)
> +   cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));

Why are you clearing the bit only for the fast path ? the slow path
can also select an idle CPU

Then, I'm afraid that updating a cpumask at each and every task wakeup
will be far too expensive. That's why we are ot updating
nohz.idle_cpus_mask at each and every enter/exit idle but only once
per tick.

And a quick test with hackbench on my octo cores arm64 give

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-17 Thread Li, Aubrey
On 2020/9/16 19:00, Mel Gorman wrote:
> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> The following benchmarks were tested on a x86 4 socket system with
>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
>>
> 
> This still appears to be tied to turning the tick off. An idle CPU
> available for computation does not necessarily have the tick turned off
> if it's for short periods of time. When nohz is disabled or a machine is
> active enough that CPUs are not disabling the tick, select_idle_cpu may
> fail to select an idle CPU and instead stack tasks on the old CPU.
> 
> The other subtlety is that select_idle_sibling() currently allows a
> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> idle as such, it's simply running a low priority task that is suitable
> for preemption. I suspect this patch breaks that.
> 
Thanks!

I shall post a v3 with performance data, I made a quick uperf testing and
found the benefit is still there. So I posted the patch here and looking
forward to your comments before I start the benchmarks.

Thanks,
-Aubrey

---
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..9a3c82645472 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
 }
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+
+   rcu_read_lock();
+   sd = rcu_dereference(per_cpu(sd_llc, cpu));
+   if (!sd || !sd->shared)
+   goto unlock;
+   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
+   goto unlock;
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+unlock:
+   rcu_read_unlock();
+}
+
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
@@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
 
if (want_affine)
current->recent_used_cpu = cpu;
+
+   sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
+   if (sd && sd->shared)
+   cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
}
rcu_read_unlock();
 
@@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct 
task_struct *p, bool first)
/* ensure bandwidth has been allocated on our new cfs_rq */
account_cfs_rq_runtime(cfs_rq, 0);
}
+   /* Update idle cpumask if task has idle policy */
+   if (unlikely(task_has_idle_policy(p)))
+   update_idle_cpumask(rq);
 }
 
 void init_cfs_rq(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1ae95b9150d3..876dfdfe35bb 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -405,6 +40

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-16 Thread Valentin Schneider


On 16/09/20 12:00, Mel Gorman wrote:
> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> The following benchmarks were tested on a x86 4 socket system with
>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
>>
>
> This still appears to be tied to turning the tick off. An idle CPU
> available for computation does not necessarily have the tick turned off
> if it's for short periods of time. When nohz is disabled or a machine is
> active enough that CPUs are not disabling the tick, select_idle_cpu may
> fail to select an idle CPU and instead stack tasks on the old CPU.
>

Vincent was pointing out in v1 that we ratelimit nohz_balance_exit_idle()
by having it happen on a tick to prevent being hammered by a flurry of
idle enter / exit sub tick granularity. I'm afraid flipping bits of this
cpumask on idle enter / exit might be too brutal.

> The other subtlety is that select_idle_sibling() currently allows a
> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> idle as such, it's simply running a low priority task that is suitable
> for preemption. I suspect this patch breaks that.

I think you're spot on.

An alternative I see here would be to move this into its own
select_idle_foo() function. If that mask is empty or none of the tagged
CPUs actually pass available_idle_cpu(), we fall-through to the usual idle
searches.

That's far from perfect; you could wake a truly idle CPU instead of
preempting a SCHED_IDLE task on a warm and busy CPU. I'm not sure if a
proliferation of cpumask really is the answer to that...


Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-16 Thread Mel Gorman
On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
> 
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
> 
> The following benchmarks were tested on a x86 4 socket system with
> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> 

This still appears to be tied to turning the tick off. An idle CPU
available for computation does not necessarily have the tick turned off
if it's for short periods of time. When nohz is disabled or a machine is
active enough that CPUs are not disabling the tick, select_idle_cpu may
fail to select an idle CPU and instead stack tasks on the old CPU.

The other subtlety is that select_idle_sibling() currently allows a
SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
idle as such, it's simply running a low priority task that is suitable
for preemption. I suspect this patch breaks that.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-16 Thread Vincent Guittot
On Wed, 16 Sep 2020 at 13:00, Mel Gorman  wrote:
>
> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, its corresponding bit in the idle cpumask will be set,
> > and when the CPU exits idle, its bit will be cleared.
> >
> > When a task wakes up to select an idle cpu, scanning idle cpumask
> > has low cost than scanning all the cpus in last level cache domain,
> > especially when the system is heavily loaded.
> >
> > The following benchmarks were tested on a x86 4 socket system with
> > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >
>
> This still appears to be tied to turning the tick off. An idle CPU
> available for computation does not necessarily have the tick turned off
> if it's for short periods of time. When nohz is disabled or a machine is
> active enough that CPUs are not disabling the tick, select_idle_cpu may
> fail to select an idle CPU and instead stack tasks on the old CPU.
>
> The other subtlety is that select_idle_sibling() currently allows a
> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> idle as such, it's simply running a low priority task that is suitable
> for preemption. I suspect this patch breaks that.

Yes, good point. I completely missed this

>
> --
> Mel Gorman
> SUSE Labs


[RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-15 Thread Aubrey Li
Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

The following benchmarks were tested on a x86 4 socket system with
24 cores per socket and 2 hyperthreads per core, total 192 CPUs:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   1.240.982.76
  144   1   1.131.354.01
  192   1   0.581.673.25
  240   1   2.491.683.55

hackbench: process mode, 10 loops, 40 file descriptors per group

  group baseline-avg%stdpatch-avg   %std
  2(80) 1   12.05   0.979.88
  3(120)1   12.48   0.9511.62
  4(160)1   13.83   0.9713.22
  5(200)1   2.761.012.94

schbench: 99th percentile latency, 16 workers per message thread

  mthread   baseline-avg%stdpatch-avg   %std
  6(96) 1   1.240.993   1.73
  9(144)1   0.380.998   0.39
  12(192)   1   1.580.995   1.64
  15(240)   1   51.71   0.606   37.41

sysbench mysql throughput: read/write, table size = 10,000,000

  threadbaseline-avg%stdpatch-avg   %std
  961   1.771.015   1.71
  144   1   3.390.998   4.05
  192   1   2.881.002   2.81
  240   1   2.071.011   2.09

kbuild: kexec reboot every time

  baseline-avg  patch-avg
  1 1

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/fair.c|  9 -
 kernel/sched/topology.c|  3 ++-
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..cfe78fcf69da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10187,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;
 
atomic_inc(&sd->shared->nr_busy_cpus);
+   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
@@ -10212,6 +10218,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;
 
atomic_dec(&sd->shared->nr_busy_cpus);
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..f14a6ef4de57 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);