from:"Dietmar Eggemann"

Re: [PATCH] sched/fair: Fix negative energy delta in find_energy_efficient_cpu()

2021-04-20 Thread Dietmar Eggemann

On 20/04/2021 14:56, pierre.gond...@arm.com wrote:
> From: Pierre Gondois 
> 
> find_energy_efficient_cpu() (feec()) searches the best energy CPU
> to place a task on. To do so, compute_energy() estimates the energy
> impact of placing the task on a CPU, based on CPU and task utilization
> signals.
> 
> Utilization signals can be concurrently updated while evaluating a
> perf_domain. In some cases, this leads to having a 'negative delta',
> i.e. placing the task in the perf_domain is seen as an energy gain.
> Thus, any further energy comparison is biased.
> 
> In case of a 'negative delta', return prev_cpu since:
> 1. a 'negative delta' happens in less than 0.5% of feec() calls,
>on a Juno with 6 CPUs (4 little, 2 big)
> 2. it is unlikely to have two consecutive 'negative delta' for
>a task, so if the first call fails, feec() will correctly
>place the task in the next feec() call
> 3. EAS current behavior tends to select prev_cpu if the task
>doesn't raise the OPP of its current perf_domain. prev_cpu
>is EAS's generic decision
> 4. prev_cpu should be preferred to returning an error code.
>In the latter case, select_idle_sibling() would do the placement,
>selecting a big (and not energy efficient) CPU. As 3., the task
>would potentially reside on the big CPU for a long time
> 
> The patch also:
> a. groups the compute_energy() calls to lower the chances of having
>concurrent updates in between the calls
> b. skips the base_energy_pd computation if no CPU is available in a
>perf_domain

Did you run some tests to make sure you didn't regress on energy
consumption? You could run EAS' Energy tests w/ and w/o the patch
depicted in:

https://lkml.kernel.org/r/20181203095628.11858-1-quentin.per...@arm.com

> Fixes: eb92692b2544d sched/fair: Speed-up energy-aware wake-up
> Reported-by: Xuewen Yan 
> Suggested-by: Xuewen Yan 
> Signed-off-by: Pierre Gondois 
> ---
>  kernel/sched/fair.c | 69 +
>  1 file changed, 39 insertions(+), 30 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0dba0ebc3657..577482aa8919 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6594,8 +6594,8 @@ static int find_energy_efficient_cpu(struct task_struct 
> *p, int prev_cpu)
>  {
>   unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
>   struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
> + int cpu, best_energy_cpu = prev_cpu, target = -1;
>   unsigned long cpu_cap, util, base_energy = 0;
> - int cpu, best_energy_cpu = prev_cpu;
>   struct sched_domain *sd;
>   struct perf_domain *pd;
>  
> @@ -6614,19 +6614,18 @@ static int find_energy_efficient_cpu(struct 
> task_struct *p, int prev_cpu)
>   if (!sd)
>   goto fail;
>  
> + target = prev_cpu;
> +
>   sync_entity_load_avg(>se);
>   if (!task_util_est(p))
> - goto unlock;
> + goto fail;
>  
>   for (; pd; pd = pd->next) {
>   unsigned long cur_delta, spare_cap, max_spare_cap = 0;
> + bool compute_prev_delta = false;
>   unsigned long base_energy_pd;
>   int max_spare_cap_cpu = -1;
>  
> - /* Compute the 'base' energy of the pd, without @p */
> - base_energy_pd = compute_energy(p, -1, pd);
> - base_energy += base_energy_pd;
> -
>   for_each_cpu_and(cpu, perf_domain_span(pd), 
> sched_domain_span(sd)) {
>   if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>   continue;
> @@ -6647,26 +6646,41 @@ static int find_energy_efficient_cpu(struct 
> task_struct *p, int prev_cpu)
>   if (!fits_capacity(util, cpu_cap))
>   continue;
>  
> - /* Always use prev_cpu as a candidate. */
>   if (cpu == prev_cpu) {
> - prev_delta = compute_energy(p, prev_cpu, pd);
> - prev_delta -= base_energy_pd;
> - best_delta = min(best_delta, prev_delta);
> - }
> -
> - /*
> -  * Find the CPU with the maximum spare capacity in
> -  * the performance domain
> -  */
> - if (spare_cap > max_spare_cap) {
> + /* Always use prev_cpu as a candidate. */
> + compute_prev_delta = true;
> + } else if (spare_cap > max_spare_cap) {
> + /*
> +  * Find the CPU with the maximum spare capacity
> +  * in the performance domain.
> +  */
>   max_spare_cap = spare_cap;
>   max_spare_cap_cpu = cpu;
>   }
>   }
>  
> +

Re: [PATCH v2 0/1] arm: topology: parse the topology from the dt

2021-04-19 Thread Dietmar Eggemann

On 19/04/2021 04:55, Ruifeng Zhang wrote:
> Dietmar Eggemann  于2021年4月17日周六 上午1:00写道：
>>
>> On 16/04/2021 13:04, Ruifeng Zhang wrote:
>>> Dietmar Eggemann  于2021年4月16日周五 下午6:39写道：
>>>>
>>>> On 16/04/2021 11:32, Valentin Schneider wrote:
>>>>> On 16/04/21 15:47, Ruifeng Zhang wrote:

[...]

>> I'm afraid that this is now a much weaker case to get this into
>> mainline.
> 
> But it's still a problem and it's not break the original logic ( parse
> topology from MPIDR or parse capacity ), only add the support for
> parse topology from DT.
> I think it should still be merged into the mainline. If don't, the
> DynamIQ SoC has some issue in sched and cpufreq.

IMHO, not necessarily. Your DynamIQ SoC is one cluster with 8 CPUs. It's
subdivided into 2 Frequency Domains (FDs).

CFS Energy-Aware-Scheduling (EAS, find_energy_efficient_cpu()) and
Capacity-Aware-Scheduling (CAS, select_idle_sibling() ->
select_idle_capacity()) work correctly even in case you only have an MC
sched domain (sd).
No matter which sd (MC, DIE) the sd_asym_cpucapacity is, we always
iterate over all CPUs. Per Performance Domains (i.e. FDs) in EAS and
over sched_domain_span(sd) in CAS.

CFS load-balancing (in case your system is `over-utilized`) might work
slightly different due to the missing DIE sd but not inevitably worse.

Do you have benchmarks or testcases in mind which convince you that
Phantom Domains is something you would need? BTW, they are called
Phantom since they let you use uarch and/or max CPU frequency domain to
fake real topology (like LLC) boundaries.

[...]

> Why do you keep the logic of topology_parse_cpu_capacity in arm
> get_coretype_capacity function? The capacity-dmips-mhz will be parsed
> by drivers/base/arch_topology.c as following:
> parse_dt_topology
> parse_cluster
> parse_core
> get_cpu_for_node
> topology_parse_cpu_capacity

I think we still need it for systems out there w/o cpu-map in dt, like
my arm32 TC2 with mainline vexpress-v2p-ca15_a7.dts.

It's called twice on each CPU in case I add the cpu-map dt entry though.

Re: [PATCH v2 0/1] arm: topology: parse the topology from the dt

2021-04-16 Thread Dietmar Eggemann

On 16/04/2021 13:04, Ruifeng Zhang wrote:
> Dietmar Eggemann  于2021年4月16日周五 下午6:39写道：
>>
>> On 16/04/2021 11:32, Valentin Schneider wrote:
>>> On 16/04/21 15:47, Ruifeng Zhang wrote:

[...]

>> I'm confused. Do you have the MT bit set to 1 then? So the issue that
>> the mpidr handling in arm32's store_cpu_topology() is not correct does
>> not exist?
> 
> I have reconfirmed it, the MT bit has been set to 1. I am sorry for
> the previous messages.
> The mpidr parse by store_cpu_topology is correct, at least for the sc9863a.

Nice! This is sorted then.

[...]

>> Is this what you need for your arm32 kernel system? Adding the
>> possibility to parse cpu-map to create Phantom Domains?
> 
> Yes, I need parse DT cpu-map to create different Phantom Domains.
> With it, the dts should be change to:
> cpu-map {
> cluster0 {
> core0 {
> cpu = <>;
> };
> core1 {
> cpu = <>;
> };
> core2 {
> cpu = <>;
> };
> core3 {
> cpu = <>;
> };
> };
> 
> cluster1 {
> core0 {
> cpu = <>;
> };
> core1 {
> cpu = <>;
> };
> core2 {
> cpu = <>;
> };
> core3 {
> cpu = <>;
> };
> };
> };
> 

I'm afraid that this is now a much weaker case to get this into
mainline.

I'm able to run with an extra cpu-map entry:

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts 
b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 012d40a7228c..f60d9b448253 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -35,6 +35,29 @@ cpus {
#address-cells = <1>;
#size-cells = <0>;
 
+   cpu-map {
+   cluster0 {
+   core0 {
+   cpu = <>;
+   };
+   core1 {
+   cpu = <>;
+   };
+   };
+
+   cluster1 {
+   core0 {
+   cpu = <>;
+   };
+   core1 {
+   cpu = <>;
+   };
+   core2 {
+   cpu = <>;
+   };
+   };
+   };
+
cpu0: cpu@0 {
 
a condensed version (see below) of your patch on my Arm32 TC2.
The move of update_cpu_capacity() in store_cpu_topology() is only
necessary when I use the old clock-frequency based cpu_efficiency
approach for asymmetric CPU capacity (TC2 is a15/a7):

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts 
b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index f60d9b448253..e0679cca40ed 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -64,7 +64,7 @@ cpu0: cpu@0 {
reg = <0>;
cci-control-port = <_control1>;
cpu-idle-states = <_SLEEP_BIG>;
-   capacity-dmips-mhz = <1024>;
+   clock-frequency = <10>;
dynamic-power-coefficient = <990>;
};
 
@@ -74,7 +74,7 @@ cpu1: cpu@1 {
reg = <1>;
cci-control-port = <_control1>;
cpu-idle-states = <_SLEEP_BIG>;
-   capacity-dmips-mhz = <1024>;
+   clock-frequency = <10>;
dynamic-power-coefficient = <990>;
};
 
@@ -84,7 +84,7 @@ cpu2: cpu@2 {
reg = <0x100>;
cci-con

Re: [PATCH v2 0/1] arm: topology: parse the topology from the dt

2021-04-16 Thread Dietmar Eggemann

On 16/04/2021 11:32, Valentin Schneider wrote:
> On 16/04/21 15:47, Ruifeng Zhang wrote:
>> For more requirements, if all cores in one physical cluster, the
>> {aff2} of all cores are the same value.
>> i.e. the sc9863a,
>> core0: 8100
>> core1: 81000100
>> core2: 81000200
>> core3: 81000300
>> core4: 81000400
>> core5: 81000500
>> core6: 81000600
>> core7: 81000700
>>
>> According to MPIDR all cores will parse to the one cluster, but it's
>> the big.LITTLE system, it's need two logic cluster for schedule or
>> cpufreq.
>> So I think it's better to add the logic of parse topology from DT.
> 
> Ah, so it's a slightly different issue, but still one that requires a
> different means of specifying topology.

I'm confused. Do you have the MT bit set to 1 then? So the issue that
the mpidr handling in arm32's store_cpu_topology() is not correct does
not exist?

With DynamIQ you have only *one* cluster, you should also be able to run
your big.LITTLE system with only an MC sched domain.

# cat /proc/schedstat
cpu0 
domain0 ff ... <- MC
...

You can introduce a cpu-map to create what we called Phantom Domains in
Android products.

# cat /proc/schedstat

cpu0 
domain0 0f ... <- MC
domain1 ff ... < DIE

Is this what you need for your arm32 kernel system? Adding the
possibility to parse cpu-map to create Phantom Domains?

Re: [PATCH v2 0/1] arm: topology: parse the topology from the dt

2021-04-15 Thread Dietmar Eggemann

On 15/04/2021 20:09, Valentin Schneider wrote:
> On 14/04/21 20:23, Ruifeng Zhang wrote:
>> From: Ruifeng Zhang 
>>
>> In Unisoc, the sc9863a SoC which using cortex-a55, it has two software
>> version, one of them is the kernel running on EL1 using aarch32.
>> user(EL0) kernel(EL1)
>> sc9863a_go  aarch32   aarch32
>> sc9863a aarch64   aarch64
>>
>> When kernel runs on EL1 using aarch32, the topology will parse wrong.
>> For example,
>> The MPIDR has been written to the chip register in armv8.2 format.
>> For example,
>> core0: 8000
>> core1: 8100
>> core2: 8200
>> ...
>>
>> It will parse to:
>> |   | aff2 | packageid | coreid |
>> |---+--+---+|
>> | Core0 |0 | 0 |0   |
>> | Core1 |0 | 1 |0   |
>> | Core2 |0 | 2 |0   |
>> |  ...  |  |   ||
>>
>> The wrong topology is that all of the coreid are 0 and unexpected
>> packageid.
>>
>> The reason is the MPIDR format is different between armv7 and armv8.2.
>> armv7 (A7) mpidr is:
>> [11:8]  [7:2]   [1:0]
>> cluster reservedcpu
>> The cortex-a7 spec DDI0464F 4.3.5
>> https://developer.arm.com/documentation/ddi0464/f/?lang=en
>>
>> armv8.2 (A55) mpidr is:
>> [23:16] [15:8]  [7:0]
>> cluster cpu thread
>>
> 
> What I had understood from our conversation was that there *isn't* a format
> difference (at least for the bottom 32 bits) - arm64/kernel/topopology.c
> would parse it the same, except that MPIDR parsing has been deprecated for
> arm64.
> 
> The problem is that those MPIDR values don't match the actual topology. If
> they had the MT bit set, i.e.
> 
>   core0: 8100
>   core1: 81000100
>   core2: 81000200
> 
> then it would be parsed as:
> 
>   |   | package_id | core_id | thread_id |
>   |---++-+---|
>   | Core0 |  0 |   0 | 0 |
>   | Core1 |  0 |   1 | 0 |
>   | Core2 |  0 |   2 | 0 |
> 
> which would make more sense (wrt the actual, physical topology).

... and this would be in sync with
https://developer.arm.com/documentation/100442/0200/register-descriptions/aarch32-system-registers/mpidr--multiprocessor-affinity-register

MT, [24]

   0b1 ...

There is no 0b0 for MT.

Re: [PATCH v2 7/9] sched,debug: Convert sysctl sched_domains to debugfs

2021-04-15 Thread Dietmar Eggemann

On 15/04/2021 11:06, Peter Zijlstra wrote:
> On Tue, Apr 13, 2021 at 03:55:15PM +0100, Valentin Schneider wrote:
>> On 12/04/21 12:14, Peter Zijlstra wrote:
>>> Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) 
>>> Reviewed-by: Dietmar Eggemann 
>>
>> On my Juno (2+4 big.LITTLE), sys/kernel/debug/sched/domains/ is now empty.
>>
>> I think that's because of unregister_sched_domain_sysctl() -
>> debugfs_remove() is recursive, and I do get a case where we rebuild the
>> domains but no CPU has been added or removed (we rebuild the domains when
>> cpufreq kicks in, it's part of the big.LITTLE ponies).
>>
>> Do we actually still need that unregister? From a brief glance it looks
>> like we could throw it out.
> 
> Yeah, I can't think of anything either. AFAICT it hasn't done anything
> useful since that cpumask optimization. Consider it gone.
> 
> I'll let it soak for another day or so, but then I was planning on
> merging this series.
> 
> Updated patch has been in queue.git/sched/debug since yesterday.

Had to check since v1 was working fine on Juno. So it was this
__cpumask_clear_cpu() in register_sched_domain_sysctl() introduced in v2
which let the files under /sys/kernel/debug/sched/domains disapear.

With {un,}register_sched_domain_sysctl() removed from
partition_sched_domains_locked() they're there again.

Looks like now register_sched_domain_sysctl() can be made static in
kernel/sched/debug.c.

[...]

Re: [PATCH 1/1] arm: topology: parse the topology from the dt

2021-04-14 Thread Dietmar Eggemann

On 14/04/2021 13:26, Ruifeng Zhang wrote:
> Dietmar Eggemann  于2021年4月14日周三 下午5:43写道：
>>
>> On 13/04/2021 15:26, Ruifeng Zhang wrote:
>>> Thanks for your review. Patch-v2 that solve the capacity issue will be
>>> uploaded as soon as possible. : )
>>>
>>> Valentin Schneider  于2021年4月13日周二 下午7:40写道：
>>>>
>>>> On 13/04/21 14:13, Ruifeng Zhang wrote:
>>>>> Valentin Schneider  于2021年4月12日周一 下午11:33写道：

[...]

>> Looks like sc9863a has two frequency domains (1.6 and 1.2GHz). So
>> technically it's a big.LITTLE system (based only on max CPU frequency
>> (not on uarch) differences).
>> But the dts file doesn't contain any `capacity-dmips-mhz` entries? So
>> asymmetric CPU capacity (even only based on max CPU frequency) detection
>> won't kick in. Since you don't have any uarch diffs, you would have to
>> specify `capacity-dmips-mhz = <1024>` for each CPU.
> 
> Yes, for capacity, the DT should have a capacity-dmips-MHz entry or a
> clock-frequency entry (for A7 and A15 only).
> The sc9863a dts is a vendor file,  in my opinion is not appropriate to
> be update with this series.
> What do you think if I independently update the sc9863a dts file later?
> 

Yes, this is a separate thing. Just wanted to mention it here since this
allows you to test asymmetric CPU capacity on your platform w/ and w/o
your patch on arm64 and arm.

No need to add `clock-frequency` entries, just `capacity-dmips-mhz`
entries should do.

Re: [PATCH 1/1] arm: topology: parse the topology from the dt

2021-04-14 Thread Dietmar Eggemann

On 13/04/2021 15:26, Ruifeng Zhang wrote:
> Thanks for your review. Patch-v2 that solve the capacity issue will be
> uploaded as soon as possible. : )
> 
> Valentin Schneider  于2021年4月13日周二 下午7:40写道：
>>
>> On 13/04/21 14:13, Ruifeng Zhang wrote:
>>> Valentin Schneider  于2021年4月12日周一 下午11:33写道：
 I'm not fluent at all in armv7 (or most aarch32 compat mode stuff), but
 I couldn't find anything about MPIDR format differences:

   DDI 0487G.a G8.2.113
   """
   AArch32 System register MPIDR bits [31:0] are architecturally mapped to
   AArch64 System register MPIDR_EL1[31:0].
   """

 Peeking at some armv7 doc and arm/kernel/topology.c the layout really looks
 just the same, i.e. for both of them, with your example of:
>>>
>>> The cortex-a7 spec DDI0464F 4.3.5
>>> https://developer.arm.com/documentation/ddi0464/f/?lang=en
>>>
>>
>> Ah, so that's where the core_id=bit[1:0] comes from. That does still
>> conform to the MPIDR format, and as you point out below that's being parsed
>> the same (aff2, aff1, aff0) == mpidr([23:16][15:8][7:0])
>>
>>> The current arch/arm/kernel/topology code parse the MPIDR with a armv7 
>>> format.
>>> the parse code is:
>>> void store_cpu_topology(unsigned int cpuid)
>>> {
>>> ...
>>> cpuid_topo->thread_id = -1;
>>> cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 0);
>>> cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 1);
>>> ...
>>> }

   core0: 8000
   core1: 8100
   core2: 8200
   ...

 we'll get:

   |   | aff2 | aff1 | aff0 |
   |---+--+--+--|
   | Core0 |0 |0 |0 |
   | Core1 |0 |1 |0 |
   | Core2 |0 |2 |0 |
   ...

 Now, arm64 doesn't fallback to MPIDR for topology information anymore since

   3102bc0e6ac7 ("arm64: topology: Stop using MPIDR for topology 
 information")

 so without DT we would get:
   |   | package_id | core_id |
   |---++-|
   | Core0 |  0 |   0 |
   | Core1 |  0 |   1 |
   | Core2 |  0 |   2 |

 Whereas with an arm kernel we'll end up parsing MPIDR as:
   |   | package_id | core_id |
   |---++-|
   | Core0 |  0 |   0 |
   | Core1 |  1 |   0 |
   | Core2 |  2 |   0 |

 Did I get this right? Is this what you're observing?
>>>
>>> Yes, this is a problem if an armv8.2 or above cpu is running a 32-bit
>>> kernel on EL1.
>>
>>
>> With the above MPIDR(_EL1) values, you would have the same problem in
>> aarch64 mode on any kernel predating
>>
>>   3102bc0e6ac7 ("arm64: topology: Stop using MPIDR for topology information")
>>
>> since all Aff0 values are 0. Arguably those MPIDR(_EL1) values don't
>> make much sense (cores in the same cluster should have different Aff0
>> values, unless SMT), but in arm64 that's usually "corrected" by DT.
>>
>> As you pointed out, arm doesn't currently leverage the cpu-map DT entry. I
>> don't see any obvious problem with adding support for it, so if you can fix
>> the capacity issue Dietmar reported, I think we could consider it.

Coming back to your original patch. You want to use parse_dt_topology()
from drivers/base/arch_topology.c to be able detect a cpu-map in dt and
so bypassing the read of mpidr in store_cpu_topology()?

Looks like sc9863a has two frequency domains (1.6 and 1.2GHz). So
technically it's a big.LITTLE system (based only on max CPU frequency
(not on uarch) differences).
But the dts file doesn't contain any `capacity-dmips-mhz` entries? So
asymmetric CPU capacity (even only based on max CPU frequency) detection
won't kick in. Since you don't have any uarch diffs, you would have to
specify `capacity-dmips-mhz = <1024>` for each CPU.

Re: [PATCH] sched: remove the redundant comments

2021-04-13 Thread Dietmar Eggemann

On 13/04/2021 18:28, Steven Rostedt wrote:
> On Tue, 13 Apr 2021 10:36:07 +0200
> Dietmar Eggemann  wrote:

[...]

>> Add a
>>
>>   Fixes: 55627e3cd22c ("sched/core: Remove rq->cpu_load[]")
>>
>> line.
> 
> It's just removing a comment. Should it really need a "Fixes" tag, which
> will cause many people to look at it to determine if it should be
> backported to stable?

I see, in this case let's skip it.

Re: [PATCH] sched: remove the redundant comments

2021-04-13 Thread Dietmar Eggemann

On 12/04/2021 09:39, Hui Su wrote:
> Since the commit 55627e3cd22c ("sched/core: Remove rq->cpu_load[]"),
> we don't need this any more.
> 
> Signed-off-by: Hui Su 
> ---
>  kernel/sched/sched.h | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 10a1522b1e30..2232022d8561 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -897,11 +897,6 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
>  struct rq {
>   /* runqueue lock: */
>   raw_spinlock_t  lock;
> -
> - /*
> -  * nr_running and cpu_load should be in the same cacheline because
> -  * remote CPUs use both these fields when doing load calculation.
> -  */
>   unsigned intnr_running;
>  #ifdef CONFIG_NUMA_BALANCING
>   unsigned intnr_numa_running;

I forgot to remove this snippet back then. LGTM.

Add a

  Fixes: 55627e3cd22c ("sched/core: Remove rq->cpu_load[]")

line.

Reviewed-by: Dietmar Eggemann

Re: [PATCH 1/1] arm: topology: parse the topology from the dt

2021-04-12 Thread Dietmar Eggemann

On 12/04/2021 14:20, Ruifeng Zhang wrote:
> Valentin Schneider  于2021年4月12日周一 下午7:32写道：
>>
>>
>> Hi,
>>
>> On 12/04/21 15:08, Ruifeng Zhang wrote:
>>> From: Ruifeng Zhang 
>>>
>>> The arm topology still parse from the MPIDR, but it is incomplete.  When
>>> the armv8.3 cpu runs in aarch32 mode, it will parse out the wrong topology.
>>>
>>> armv7 (A7) mpidr is:
>>> [11:8]  [7:2]   [1:0]
>>> cluster reservedcpu
>>>
>>> armv8.3 (A55) mpidr is:
>>> [23:16] [15:8]  [7:0]
>>> cluster cpu thread
>>>
>>> For compatibility to keep the function of get capacity from default
>>> cputype, renamed arm parse_dt_topology to get_cputype_capacity and delete
>>> related logic of parse from dt.
>>> Arm using the same parse_dt_topology function as arm64.
>>>
>>> The arm device boot step is to look for the default cputype and get cpu
>>> capacity firstly. Then parse the topology and capacity from dt to replace
>>> default values.
>>>
>>
>> I'm afraid I don't get it.
>>
>> CONFIG_COMPAT lets you run 32-bit stuff at EL0, but the kernel is still
>> arm64. So if you take your armv8.3 system, the topology parsed by the
>> kernel will be the same regardless of CONFIG_COMPAT.
>>
>> Could you elaborate on what problem you are trying to fix here?
> 
> There is a armv8.3 cpu which should work normally both on aarch64 and aarch32.
> The MPIDR has been written to the chip register in armv8.3 format.
> For example,
> core0: 8000
> core1: 8100
> core2: 8200
> ...
> 
> Its cpu topology can be parsed normally on aarch64 mode (both
> userspace and kernel work on arm64).
> 
> The problem is when it working on aarch32 mode (both userspace and
> kernel work on arm 32-bit), the cpu topology
> will parse error because of the format is different between armv7 and armv8.3.
> The arm 32-bit driver, arch/arm/kernel/topology will parse the MPIDR
> and store to the topology with armv7,
> and the result is all cpu core_id is 0, the bit[1:0] of armv7 MPIDR format.
> 
> In addition, I think arm should also allow customers to configure cpu
> topologies via DT.

This patch ruins the CPU capacity detection based on capacity-dmips-mhz
(Documentation/devicetree/bindings/arm/cpu-capacity.txt) on my TC2 [L B
B L L] (armv7).

tip/sched/core with *mainline* multi_v7_defconfig:

root@linaro-nano:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
516
1024
1024
516
516

your patch with mainline multi_v7_defconfig:

root@linaro-nano:~#  cat /sys/devices/system/cpu/cpu*/cpu_capacity
1024
1024
1024
1024
1024


There are 2 capacity detection mechanism in arch/arm/kernel/topology.c:

(1) cpu_efficiency (only for armv7 a15 and a7) based, relies on
clock-frequency dt property

(2) capacity-dmips-mhz dt property based

I currently don't see how this different MPIDR layout leads to you code
changes.

Re: [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper

2021-04-06 Thread Dietmar Eggemann

On 01/04/2021 21:30, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
> 
>   group_smaller_max_cpu_capacity(, );
> 
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
> 
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
> 
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
> 
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.

IMHO, you haven't mentioned why you replace local sched group with dst
CPU. I can see that only the capacity of the dst CPU makes really sense
here. Might be worth mentioning in the patch header why. There is some
of it in v3 6/7 but that's a different change.

Reviewed-by: Dietmar Eggemann 

[...]

Re: [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery

2021-04-06 Thread Dietmar Eggemann

On 01/04/2021 21:30, Valentin Schneider wrote:
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
> 
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
> 
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose. Cleanup the LBF_DST_PINNED active balance special case.
> 
> Signed-off-by: Valentin Schneider 

Reviewed-by: Dietmar Eggemann 

[...]

Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-06 Thread Dietmar Eggemann

On 01/04/2021 21:30, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar 
> 
> During load balance, LBF_SOME_PINNED will bet set if any candidate task

nitpick; s/bet/be ?

[...]

Reviewed-by: Dietmar Eggemann

Re: [PATCH] sched/cpupri: fix the task priority BUG_ON checks

2021-04-06 Thread Dietmar Eggemann

On 26/03/2021 04:15, Zeng Tao wrote:
> The BUG_ON checks are intended to check if the task priotity is valid,
> but in the function convert_prio, if the task priority is not valid, it
> will be converted to an uninitialized stack variable. Fix it by moving
> the BUG_ON checks to the default branch of convert_prio.
> 
> Fixes: 934fc3314b39 ("sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1")
> Signed-off-by: Zeng Tao 
> ---
>  kernel/sched/cpupri.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index ec9be78..c5a0e6e 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -60,6 +60,8 @@ static int convert_prio(int prio)
>   case MAX_RT_PRIO:
>   cpupri = CPUPRI_HIGHER; /* 100 */
>   break;
> + default:
> + BUG();
>   }
>  
>   return cpupri;
> @@ -148,8 +150,6 @@ int cpupri_find_fitness(struct cpupri *cp, struct 
> task_struct *p,
>   int task_pri = convert_prio(p->prio);
>   int idx, cpu;
>  
> - BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
> -
>   for (idx = 0; idx < task_pri; idx++) {
>  
>   if (!__cpupri_find(cp, p, lowest_mask, idx))
> @@ -215,8 +215,6 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
>  
>   newpri = convert_prio(newpri);
>  
> - BUG_ON(newpri >= CPUPRI_NR_PRIORITIES);
> -
>   if (newpri == oldpri)
>   return;
>  
> 

LGTM. Currently, convert_prio() is never called with prio <
CPUPRI_INVALID (-1) or prio > MAX_RT_PRIO (100) but in case it would
then this would be caught much easier with this patch.

Reviewed-by: Dietmar Eggemann

Re: [PATCH] sched/fair: use signed long when compute energy delta in eas

2021-04-01 Thread Dietmar Eggemann

+cc: Pierre Gondois 

On 30/03/2021 11:45, Quentin Perret wrote:
> Hi,
> 
> On Tuesday 30 Mar 2021 at 13:21:54 (+0800), Xuewen Yan wrote:
>> From: Xuewen Yan 
>>
>> now the energy delta compute as follow:
>>
>> base_energy_pd = compute_energy(p, -1, pd);
>>  --->Traverse all CPUs in pd
>>  --->em_pd_energy()
>> - \
>> search for the max_sapre_cap_cpu   \
>> -   search time
>> cur_delta = compute_energy(p, max_spare_cap_cpu, pd);  /
>>  --->Traverse all CPUs in pd   /
>>  /
>>  --->em_pd_energy()
>> cur_delta -= base_energy_pd;
>>
>> During the search_time, or when calculate the cpu_util in
>> compute_energy(), there may occurred task dequeue or cpu_util change,
>> it may cause the cur_energy < base_energy_pd, so the cur_delta
>> would be negative. But the cur_delta is unsigned long, at this time,
>> the cur_delta would always bigger than best_delta of last pd.
>>
>> Change the vars to signed long.
> 
> Is that really helping though?
> 
> Yes you will not overflow, but the decision is still 'wrong' if the util
> values are not stable for the entire wake-up. I think folks on the Arm
> side had patches to try and cache the util values upfront, and then use
> them throughout feec() and compute_energy(), which I think would be a
> better fix.
> 
> Dietmar, wdyt?

Yes, we have some patches from Pierre Gondois which introduce a pd cache
to store the CPU utilization values so they can be reused for 'cpu !=
dst_cpu' calculations within find_energy_efficient_cpu() (feec()).

We did run them in our Jan 2021 EAS integration:

https://gitlab.arm.com/linux-arm/linux-power/-/commits/eas/next/integration-20210129

  sched: Allocate pd_cache when EAS is enabled
  sched/fair: Use pd_cache to speed up find_energy_efficient_cpu()

We haven't posted them since we're still looking for a story to justify
the extra complexity. The experiments on Arm64 Juno (2 big, 4 little
CPUs) showed 1-2% failure due to changes of CPU utilization values
during feec(). There was a 5% (big CPU)-10% (little CPU) runtime
reduction for feec() with the patches.

Re: [PATCH 7/9] sched,debug: Convert sysctl sched_domains to debugfs

2021-03-26 Thread Dietmar Eggemann

On 26/03/2021 11:33, Peter Zijlstra wrote:
> Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/debug.c|  255 
> ++--
>  kernel/sched/sched.h|2 
>  kernel/sched/topology.c |1 
>  3 files changed, 59 insertions(+), 199 deletions(-)

Reviewed-by: Dietmar Eggemann 

[...]

> +#define SDM(type, mode, member)  \
> + debugfs_create_##type(#member, mode, parent, >member)
>  
> - WARN_ON(sd_ctl_dir[0].child);
> - sd_ctl_dir[0].child = cpu_entries;
> - }
> + SDM(ulong, 0644, min_interval);
> + SDM(ulong, 0644, max_interval);
> + SDM(u64,   0644, max_newidle_lb_cost);
> + SDM(u32,   0644, busy_factor);
> + SDM(u32,   0644, imbalance_pct);
> + SDM(u32,   0644, cache_nice_tries);
> +//   SDM(x32,   0444, flags);

Can be removed.

> + SDM(str,   0444, name);
>  
> - if (!cpu_idx) {
> - struct ctl_table *e = cpu_entries;
> +#undef SDM
>  
> - cpu_idx = kcalloc(nr_cpu_ids, sizeof(struct ctl_table*), 
> GFP_KERNEL);
> - if (!cpu_idx)
> - return;
> + debugfs_create_file("flags", 0444, parent, >flags, _flags_fops);
> +}

[...]

Re: [PATCH 9/9] sched,fair: Alternative sched_slice()

2021-03-26 Thread Dietmar Eggemann

On 26/03/2021 11:34, Peter Zijlstra wrote:
> The current sched_slice() seems to have issues; there's two possible
> things that could be improved:
> 
>  - the 'nr_running' used for __sched_period() is daft when cgroups are
>considered. Using the RQ wide h_nr_running seems like a much more
>consistent number.
> 
>  - (esp) cgroups can slice it real fine, which makes for easy
>over-scheduling, ensure min_gran is what the name says.

So ALT_PERIOD considers all runnable CFS tasks now and BASE_SLICE
guarantees min_gran as a floor for cgroup (hierarchies) with small
weight value(s)?

> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/fair.c |   15 ++-
>  kernel/sched/features.h |3 +++
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -680,7 +680,16 @@ static u64 __sched_period(unsigned long
>   */
>  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> - u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
> + unsigned int nr_running = cfs_rq->nr_running;
> + u64 slice;
> +
> + if (sched_feat(ALT_PERIOD))
> + nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
> +
> + slice = __sched_period(nr_running + !se->on_rq);
> +
> + if (sched_feat(BASE_SLICE))
> + slice -= sysctl_sched_min_granularity;
>  
>   for_each_sched_entity(se) {
>   struct load_weight *load;
> @@ -697,6 +706,10 @@ static u64 sched_slice(struct cfs_rq *cf
>   }
>   slice = __calc_delta(slice, se->load.weight, load);
>   }
> +
> + if (sched_feat(BASE_SLICE))
> + slice += sysctl_sched_min_granularity;
> +
>   return slice;
>  }
>  
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -90,3 +90,6 @@ SCHED_FEAT(WA_BIAS, true)
>   */
>  SCHED_FEAT(UTIL_EST, true)
>  SCHED_FEAT(UTIL_EST_FASTUP, true)
> +
> +SCHED_FEAT(ALT_PERIOD, true)
> +SCHED_FEAT(BASE_SLICE, true)
> 
>

Re: [PATCH v3 2/7] sched/fair: Clean up active balance nr_balance_failed trickery

2021-03-17 Thread Dietmar Eggemann

On 11/03/2021 13:05, Valentin Schneider wrote:

[...]

> @@ -9952,7 +9954,8 @@ static int active_load_balance_cpu_stop(void *data)
>* @dst_grpmask we need to make that test go away with 
> lying
>* about DST_PINNED.
>*/
> - .flags  = LBF_DST_PINNED,
> + .flags  = LBF_DST_PINNED |
> +   LBF_ACTIVE_LB,

Since you now have a dedicated LBF flag for active balancing you could remove 
the
LBF_DST_PINNED here and adapt the if condition in can_migrate_task():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f50a902bdf24..9f7feb512016 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7583,10 +7583,13 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
 * meet load balance goals by pulling other tasks on src_cpu.
 *
 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
-* already computed one in current iteration.
+* already computed one in current iteration or if it is an
+* active balance.
 */
-   if (env->idle == CPU_NEWLY_IDLE || (env->flags & 
LBF_DST_PINNED))
+   if (env->idle == CPU_NEWLY_IDLE ||
+   (env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))) {
return 0;
+   }
 
/* Prevent to re-select dst_cpu via env's CPUs: */
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
@@ -9948,14 +9951,7 @@ static int active_load_balance_cpu_stop(void *data)
.src_cpu= busiest_rq->cpu,
.src_rq = busiest_rq,
.idle   = CPU_IDLE,
-   /*
-* can_migrate_task() doesn't need to compute 
new_dst_cpu
-* for active balancing. Since we have CPU_IDLE, but no
-* @dst_grpmask we need to make that test go away with 
lying
-* about DST_PINNED.
-*/
-   .flags  = LBF_DST_PINNED |
- LBF_ACTIVE_LB,
+   .flags  = LBF_ACTIVE_LB,
};

Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls

2021-03-16 Thread Dietmar Eggemann

On 16/03/2021 18:31, Valentin Schneider wrote:
> On 16/03/21 16:49, Dietmar Eggemann wrote:
>> On 11/03/2021 13:05, Valentin Schneider wrote:
>>> From: Lingutla Chandrasekhar 
>>>
>>> In load balancing, when balancing group is unable to pull task
>>> due to ->cpus_ptr constraints from busy group, then it sets
>>> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
>>> is set for its parent domain level. which makes the group
>>> classified as imbalance to get help from another balancing cpu.
>>>
>>> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
>>
>> Does it have to be a big.LITTLE system? I assume this issue also happens
>> on an SMP system.
>>
> 
> Aye, though the consequences are "worse" on asym CPU capacity systems.

I can only think of higher group_type 'group_imbalanced' eclipses
'group_misfit_task' here?

> 
>>> CPUs 2-3 as Bigs with below scenario:
>>> - CPU0 doing newly_idle balancing
>>> - CPU1 running percpu kworker and RT task (small tasks)
>>
>> What's the role of the small RT task here in the story?
>>
> 
> I don't think it matters much here.

Chandra already mentioned that this is part of the story, namely to
start trying to move task on lb MC CPU1->CPU0 (if (busiest->nr_running >
1)).

[...]

>> This sentence mentioned per-cpu threads (and so does the patch name) but
>> the implementation (only) deals with per-cpu kernel threads. IMHO, it
>> would be good to align this.
>>
> 
> Tell you what, I'll go for:
> 1) how can pcpu kthreads cause LBF_SOME_PINNED
> 2) why we may not want this, but still ignore !kthread pcpu tasks
> 3) why this is even more important for big.LITTLE

LGTM.

Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls

2021-03-16 Thread Dietmar Eggemann

Hi Chandra,

On 16/03/2021 17:03, Chandra Sekhar Lingutla wrote:
> Hi Dietmar,
> 
> On 3/16/2021 9:19 PM, Dietmar Eggemann wrote:
>> On 11/03/2021 13:05, Valentin Schneider wrote:
>>> From: Lingutla Chandrasekhar 

[...]

>>> CPUs 2-3 as Bigs with below scenario:
>>> - CPU0 doing newly_idle balancing
>>> - CPU1 running percpu kworker and RT task (small tasks)
>> What's the role of the small RT task here in the story?
> This is to satisfy 'busiest->nr_running > 1' checks.

Ah, I see. Forgot about this bit of the story, the 'if
(busiest->nr_running > 1)' in load_balance().

[...]

Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls

2021-03-16 Thread Dietmar Eggemann

On 11/03/2021 13:05, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar 
> 
> In load balancing, when balancing group is unable to pull task
> due to ->cpus_ptr constraints from busy group, then it sets
> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
> is set for its parent domain level. which makes the group
> classified as imbalance to get help from another balancing cpu.
> 
> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and

Does it have to be a big.LITTLE system? I assume this issue also happens
on an SMP system.

> CPUs 2-3 as Bigs with below scenario:
> - CPU0 doing newly_idle balancing
> - CPU1 running percpu kworker and RT task (small tasks)

What's the role of the small RT task here in the story?

> - CPU2 running 2 big tasks
> - CPU3 running 1 medium task
> 
> While CPU0 is doing newly_idle load balance at MC level, it fails to
> pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
> and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
> it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
> find other busiest_group, so CPU0 stops balacing at MC level without
> clearing 'sgc->imbalance' and restart the load balacing at DIE level.
> 
> And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
> type as imbalance, and Bigs that classified the level below imbalance type
> would be ignored to pick as busiest, and the balancing would be aborted
> without pulling any tasks (by the time, CPU1 might not have running tasks).
> 
> It is suboptimal decision to classify the group as imbalance due to
> percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.

This sentence mentioned per-cpu threads (and so does the patch name) but
the implementation (only) deals with per-cpu kernel threads. IMHO, it
would be good to align this.

> 
> Signed-off-by: Lingutla Chandrasekhar 
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e2ab1e00ef9..83aea97fbf22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7565,6 +7565,10 @@ int can_migrate_task(struct task_struct *p, struct 
> lb_env *env)
>   if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>   return 0;
>  
> + /* Disregard pcpu kthreads; they are where they need to be. */
> + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> + return 0;
> +
>   if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>   int cpu;
>  
>

Re: [RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance

2021-03-08 Thread Dietmar Eggemann

Hi Michal,

On 25/02/2021 17:27, Michal Koutný wrote:
> leaf_cfs_rq_list should contain cfs_rqs that have runnable entities in
> them.  When we're operating under a throttled hierarchy we always update
> the leaf_cfs_rq_list in order no to break list_add_leaf_cfs_rq invariant
> of adding complete branches.

So this patch replaces in enqueue_entity() the unconditional
list_add_leaf_cfs_rq(cfs_rq) in case cfs->nr_running > 1
(parent had been throttled) 

- if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
  ^^^
with

+ if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
  ^^^

and removes the leaf_cfs_rq_list maintenance code after the
xxx_throttle label in enqueue_task_fair() and unthrottle_cfs_rq
from commit f6783319737f ("sched/fair: Fix insertion in
rq->leaf_cfs_rq_list") and fe61468b2cb ("sched/fair: Fix
enqueue_task_fair warning").

> This caused troubles when an entity became runnable (enqueue_entity)
> under a throttled hierarchy (see commit b34cb07dde7c ("sched/fair: Fix
> enqueue_task_fair() warning some more")). While it proved well, this
> new change ignores updating leaf_cfs_rq_list when we're operating under
> the throttled hierarchy and defers the leaf_cfs_rq_list update to the
> point when whole hiearchy is unthrottled (tg_unthrottle_up).

IMHO, f6783319737f gives the explanation why throttled cfs_rq's have to
be added to rq->leaf_cfs_rq_list.

IIRC, fe61468b2cb was fixing a use case in which a cfs-rq with
on_list=0 and nr_running > 1 within the cgroup hierarchy wasn't
added back to rq->leaf_cfs_rq_list:

https://lkml.kernel.org/r/15252de5-9a2d-19ae-607a-594ee88d1...@de.ibm.com

In enqueue_task_fair() just before the assert_list_leaf_cfs_rq(rq),
Iterate through the se heriarchy of p=[CPU 2/KVM 2662] in case the
assert will hit:

...
CPU23 
path=/machine.slice/machine-test.slice/machine-qemu\x2d18\x2dtest10.scope/vcpuX 
on_list=1 nr_running=1 throttled=0 p=[CPU 2/KVM 2662]
CPU23 path=/machine.slice/machine-test.slice/machine-qemu\x2d18\x2dtest10.scope 
  on_list=0 nr_running=3 throttled=0 p=[CPU 2/KVM 2662]

  ^ 
CPU23 path=/machine.slice/machine-test.slice
  on_list=1 nr_running=1 throttled=1 p=[CPU 2/KVM 2662]
CPU23 path=/machine.slice   
  on_list=1 nr_running=0 throttled=0 p=[CPU 2/KVM 2662]
CPU23 path=/
  on_list=1 nr_running=1 throttled=0 p=[CPU 2/KVM 2662]
...

> The code is now simpler and leaf_cfs_rq_list contains only cfs_rqs that
> are truly runnable.
> 
> Why is this RFC?
> - Primarily, I'm not sure I interpreted the purpose of leaf_cfs_rq_list
>   right. The removal of throttled cfs_rqs from it would exclude them
>   from __update_blocked_fair() calculation and I can't see past it now.

The leaf_cfs_rq_list should contain all cfs_rqs with
cfs_rq->avg.load/runnable/util_avg != 0 so that in case there are no
runnable entities on them anymore this (blocked) load 
cfs_rq->avg.xxx_avg can be decayed. IMHO. the "leaf_" is misleading
here since it can also contain non-leaf cfs_rqs.

>   If it's wrong assumption, I'd like this to help clarify what the
>   proper definition of leaf_cfs_rq_list would be.
> - Additionally, I didn't check thoroughly for corner cases when
>   se->on_rq => cfs_rq_of(se)->on_list wouldn't hold, so the patch
>   certainly isn't finished.

[...]

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Dietmar Eggemann

On 26/02/2021 17:40, Srikar Dronamraju wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..d49bfcdc4a19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct 
> task_struct *p,
>   return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
>  }
> 
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> + struct sched_domain_shared *tsds, *psds;
> + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> + tnr_busy = atomic_read(>nr_busy_cpus);
> + tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> + pnr_busy = atomic_read(>nr_busy_cpus);
> + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
> + /* No need to compare, if both LLCs are fully loaded */
> + if (pnr_busy == pllc_size && tnr_busy == pllc_size)

 ^
   shouldn't this be tllc_size ?

> + return nr_cpumask_bits;
> +
> + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> + return this_cpu;
> +
> + /* For better wakeup latency, prefer idler LLC to cache affinity */
> + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> + if (!diff)
> + return nr_cpumask_bits;
> + if (diff < 0)
> + return this_cpu;
> +
> + return prev_cpu;
> +}
> +

Re: [PATCH v2] sched/pelt: Fix task util_est update filtering

2021-03-01 Thread Dietmar Eggemann

On 26/02/2021 09:41, Peter Zijlstra wrote:
> On Thu, Feb 25, 2021 at 04:58:20PM +, Vincent Donnefort wrote:
>> +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
>> +
>>  /*
>> - * Check if a (signed) value is within a specified (unsigned) margin,
>> + * Check if a (signed) value is within the (unsigned) util_est margin,
>>   * based on the observation that:
>>   *
>>   * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
>>   *
>> - * NOTE: this only works when value + maring < INT_MAX.
>> + * NOTE: this only works when value + UTIL_EST_MARGIN < INT_MAX.
>>   */
>> -static inline bool within_margin(int value, int margin)
>> +static inline bool util_est_within_margin(int value)
>>  {
>> -return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
>> +return ((unsigned int)(value + UTIL_EST_MARGIN - 1) <
>> +(2 * UTIL_EST_MARGIN - 1));
>>  }
> 
>> -if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
>> +if (util_est_within_margin(last_ewma_diff)) {
> 
> What was the purpose of this change? What was a generic helper is now
> super specific.

I guess because it was only ever used in util_est for last_ewma_diff.

It's now used for last_ewma_diff and last_enqueued_diff, still only for
util_est though and both times with the same margin
(SCHED_CAPACITY_SCALE / 100)).

Vincent D. should be back on Wed from hols.

Re: [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next()

2021-02-25 Thread Dietmar Eggemann

On 25/02/2021 09:36, vincent.donnef...@arm.com wrote:
> From: Vincent Donnefort 
> 
> The sub_positive local version is saving an explicit load-store and is
> enough for the cpu_util_next() usage.
> 
> Signed-off-by: Vincent Donnefort 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 146ac9fec4b6..1364f8b95214 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6525,7 +6525,7 @@ static unsigned long cpu_util_next(int cpu, struct 
> task_struct *p, int dst_cpu)
>* util_avg should already be correct.
>*/
>   if (task_cpu(p) == cpu && dst_cpu != cpu)
> - sub_positive(, task_util(p));
> + lsub_positive(, task_util(p));
>   else if (task_cpu(p) != cpu && dst_cpu == cpu)
>   util += task_util(p);

Reviewed-by: Dietmar Eggemann

Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()

2021-02-25 Thread Dietmar Eggemann

On 25/02/2021 09:36, vincent.donnef...@arm.com wrote:
> From: Vincent Donnefort 

[...]

> cpu_util_next() estimates the CPU utilization that would happen if the
> task was placed on dst_cpu as follows:
> 
>   max(cpu_util + task_util, cpu_util_est + _task_util_est)
> 
> The task contribution to the energy delta can then be either:
> 
>   (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
>   and _task_util_est > cpu_util.
>   (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
> 
>   (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
>otherwise must be small enough so that feec() takes the CPU as a
>potential target for the task placement)

I still don't quite get the reasoning for (2) why task_util is used as
task contribution.

So we use 'cpu_util + task_util' instead of 'cpu_util_est +
_task_util_est' in (2).

I.e. since _task_util_est is always >= task_util during wakeup, cpu_util
must be > cpu_util_est (by more than _task_util_est - task_util).

I can see it for a CPU whose cpu_util has a fair amount of contributions
from blocked tasks which cpu_util_est wouldn't have.

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7043bb0f2621..146ac9fec4b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6573,8 +6573,24 @@ compute_energy(struct task_struct *p, int dst_cpu, 
> struct perf_domain *pd)
>* its pd list and will not be accounted by compute_energy().
>*/
>   for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> - unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, 
> dst_cpu);
> - struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
> + unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> + unsigned long cpu_util, util_running = util_freq;
> + struct task_struct *tsk = NULL;
> +
> + /*
> +  * When @p is placed on @cpu:
> +  *
> +  * util_running = max(cpu_util, cpu_util_est) +
> +  *max(task_util, _task_util_est)
> +  *
> +  * while cpu_util_next is: max(cpu_util + task_util,
> +  * cpu_util_est + _task_util_est)
> +  */

Nit pick:

s/on @cpu/on @dst_cpu ?

s/while cpu_util_next is/while cpu_util_next(cpu, p, cpu) would be

If dst_cpu != cpu (including dst_cpu == -1) task_util and _task_util_est
are not added to util resp. util_est.

Not sure if this is clear from the source code here?

[...]

Reviewed-by: Dietmar Eggemann

Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()

2021-02-23 Thread Dietmar Eggemann

On 22/02/2021 17:23, Quentin Perret wrote:
> On Monday 22 Feb 2021 at 15:58:56 (+), Quentin Perret wrote:
>> But in any case, if we're going to address this, I'm still not sure this
>> patch will be what we want. As per my first comment we need to keep the
>> frequency estimation right.
> 
> Totally untested, but I think in principle you would like something like
> the snippet below. Would that work?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..6594d875c6ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6534,8 +6534,13 @@ compute_energy(struct task_struct *p, int dst_cpu, 
> struct perf_domain *pd)
>  * its pd list and will not be accounted by compute_energy().
>  */
> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> -   unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, 
> dst_cpu);
> +   unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> +   unsigned long util_running = cpu_util_without(cpu, p);

Wouldn't this be the same as:

 unsigned long util_running = cpu_util_next(cpu, p, -1);

except some different handling of !last_update_time and
'task_on_rq_queued(p) || current == p)' in cpu_util_without() which
shouldn't happen in EAS.

We have quite a lot of util related functions!

[...]

Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()

2021-02-23 Thread Dietmar Eggemann

On 22/02/2021 10:54, vincent.donnef...@arm.com wrote:
> From: Vincent Donnefort 

[...]

> Also, replace sub_positive with lsub_positive which saves one explicit
> load-store.

IMHO, in case you're going to fix this now in compute_energy(), this
optimization could still be done. Maybe in an extra patch?
cpu_util_without() is using lsub_positive() to remove task util from cpu
util as well.

[...]

Re: [PATCH] sched/pelt: Fix task util_est update filtering

2021-02-19 Thread Dietmar Eggemann

On 16/02/2021 17:39, vincent.donnef...@arm.com wrote:
> From: Vincent Donnefort 
> 
> Being called for each dequeue, util_est reduces the number of its updates
> by filtering out when the EWMA signal is different from the task util_avg
> by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the
> decay from a previous high util_avg, EWMA might now be close enough to
> the new util_avg. No update would then happen while it would leave
> ue.enqueued with an out-of-date value.

(1) enqueued[x-1] < ewma[x-1]

(2) diff(enqueued[x], ewma[x]) < 1024/100 && enqueued[x] < ewma[x] (*)

with ewma[x-1] == ewma[x]

(*) enqueued[x] must still be less than ewma[x] w/ default
UTIL_EST_FASTUP. Otherwise we would already 'goto done' (write the new
util_est) via the previous if condition.

> 
> Taking into consideration the two util_est members, EWMA and enqueued for
> the filtering, ensures, for both, an up-to-date value.
> 
> This is for now an issue only for the trace probe that might return the
> stale value. Functional-wise, it isn't (yet) a problem, as the value is
> always accessed through max(enqueued, ewma).

Yeah, I remember that the ue.enqueued plots looked weird in these
sections with stale ue.enqueued values.

> This problem has been observed using LISA's UtilConvergence:test_means on
> the sd845c board.

I ran the test a couple of times on my juno board and I never hit this
path (util_est_within_margin(last_ewma_diff) &&
!util_est_within_margin(last_enqueued_diff)) for a test task.

I can't see how this issue can be board specific? Does it happen
reliably on sd845c or is it just that it happens very, very occasionally?

I saw it a couple of times but always with a (non-test) tasks migrating
from one CPU to another.

> Signed-off-by: Vincent Donnefort 

Reviewed-by: Dietmar Eggemann 

[...]

[tip: sched/core] sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO

2021-02-17 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9d061ba6bc170045857f3efe0bba5def30188d4d
Gitweb:
https://git.kernel.org/tip/9d061ba6bc170045857f3efe0bba5def30188d4d
Author:Dietmar Eggemann 
AuthorDate:Thu, 28 Jan 2021 14:10:39 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:08:17 +01:00

sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO

The only remaining use of MAX_USER_PRIO (and USER_PRIO) is the
SCALE_PRIO() definition in the PowerPC Cell architecture's Synergistic
Processor Unit (SPU) scheduler. TASK_USER_PRIO isn't used anymore.

Commit fe443ef2ac42 ("[POWERPC] spusched: Dynamic timeslicing for
SCHED_OTHER") copied SCALE_PRIO() from the task scheduler in v2.6.23.

Commit a4ec24b48dde ("sched: tidy up SCHED_RR") removed it from the task
scheduler in v2.6.24.

Commit 3ee237dddcd8 ("sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and
NICE_WIDTH in prio.h") introduced NICE_WIDTH much later.

With:

  MAX_USER_PRIO = USER_PRIO(MAX_PRIO)

= MAX_PRIO - MAX_RT_PRIO

   MAX_PRIO = MAX_RT_PRIO + NICE_WIDTH

  MAX_USER_PRIO = MAX_RT_PRIO + NICE_WIDTH - MAX_RT_PRIO

  MAX_USER_PRIO = NICE_WIDTH

MAX_USER_PRIO can be replaced by NICE_WIDTH to be able to remove all the
{*_}USER_PRIO defines.

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210128131040.296856-3-dietmar.eggem...@arm.com
---
 arch/powerpc/platforms/cell/spufs/sched.c |  2 +-
 include/linux/sched/prio.h|  9 -
 kernel/sched/sched.h  |  2 +-
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index f18d506..aeb7f39 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -72,7 +72,7 @@ static struct timer_list spuloadavg_timer;
 #define DEF_SPU_TIMESLICE  (100 * HZ / (1000 * SPUSCHED_TICK))
 
 #define SCALE_PRIO(x, prio) \
-   max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
+   max(x * (MAX_PRIO - prio) / (NICE_WIDTH / 2), MIN_SPU_TIMESLICE)
 
 /*
  * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index d111f2f..ab83d85 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -27,15 +27,6 @@
 #define PRIO_TO_NICE(prio) ((prio) - DEFAULT_PRIO)
 
 /*
- * 'User priority' is the nice value converted to something we
- * can work with better when scaling various scheduler parameters,
- * it's a [ 0 ... 39 ] range.
- */
-#define USER_PRIO(p)   ((p)-MAX_RT_PRIO)
-#define TASK_USER_PRIO(p)  USER_PRIO((p)->static_prio)
-#define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
-
-/*
  * Convert nice value [19,-20] to rlimit style value [1,40].
  */
 static inline long nice_to_rlimit(long nice)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f519aba..2185b3b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -140,7 +140,7 @@ extern void call_trace_sched_update_nr_running(struct rq 
*rq, int count);
  * scale_load() and scale_load_down(w) to convert between them. The
  * following must be true:
  *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *  scale_load(sched_prio_to_weight[NICE_TO_PRIO(0)-MAX_RT_PRIO]) == 
NICE_0_LOAD
  *
  */
 #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)

[tip: sched/core] sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()

2021-02-17 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 71e5f6644fb2f3304fcb310145ded234a37e7cc1
Gitweb:
https://git.kernel.org/tip/71e5f6644fb2f3304fcb310145ded234a37e7cc1
Author:Dietmar Eggemann 
AuthorDate:Mon, 01 Feb 2021 10:53:53 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:08:05 +01:00

sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
__free_domain_allocs()
  __sdt_free() {
...
for_each_sd_topology(tl)
  ...
  sd = *per_cpu_ptr(sdd->sd, j); <--
  ...
  }

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Tested-by: Vincent Guittot 
Tested-by: Barry Song 
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a...@arm.com
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bf5c9bd..09d3504 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1702,7 +1702,7 @@ void sched_init_numa(void)
/* Compute default topology size */
for (i = 0; sched_domain_topology[i].mask; i++);
 
-   tl = kzalloc((i + nr_levels) *
+   tl = kzalloc((i + nr_levels + 1) *
sizeof(struct sched_domain_topology_level), GFP_KERNEL);
if (!tl)
return;

[tip: sched/core] sched: Remove MAX_USER_RT_PRIO

2021-02-17 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: ae18ad281e825993d190073d0ae2ea35dee27ee1
Gitweb:
https://git.kernel.org/tip/ae18ad281e825993d190073d0ae2ea35dee27ee1
Author:Dietmar Eggemann 
AuthorDate:Thu, 28 Jan 2021 14:10:38 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:08:11 +01:00

sched: Remove MAX_USER_RT_PRIO

Commit d46523ea32a7 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO")
was introduced due to a a small time period in which the realtime patch
set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO.

This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO.

Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO
instead.

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210128131040.296856-2-dietmar.eggem...@arm.com
---
 include/linux/sched/prio.h |  9 +
 kernel/sched/core.c|  7 +++
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index 7d64fea..d111f2f 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -11,16 +11,9 @@
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
  * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
  * values are inverted: lower p->prio value means higher priority.
- *
- * The MAX_USER_RT_PRIO value allows the actual maximum
- * RT priority to be separate from the value exported to
- * user-space.  This allows kernel threads to set their
- * priority to a value higher than any user task. Note:
- * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
  */
 
-#define MAX_USER_RT_PRIO   100
-#define MAX_RT_PRIOMAX_USER_RT_PRIO
+#define MAX_RT_PRIO100
 
 #define MAX_PRIO   (MAX_RT_PRIO + NICE_WIDTH)
 #define DEFAULT_PRIO   (MAX_RT_PRIO + NICE_WIDTH / 2)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6c789dc..f0b0b67 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5911,11 +5911,10 @@ recheck:
 
/*
 * Valid priorities for SCHED_FIFO and SCHED_RR are
-* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
+* 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL,
 * SCHED_BATCH and SCHED_IDLE is 0.
 */
-   if ((p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
-   (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
+   if (attr->sched_priority > MAX_RT_PRIO-1)
return -EINVAL;
if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
(rt_policy(policy) != (attr->sched_priority != 0)))
@@ -6983,7 +6982,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
switch (policy) {
case SCHED_FIFO:
case SCHED_RR:
-   ret = MAX_USER_RT_PRIO-1;
+   ret = MAX_RT_PRIO-1;
break;
case SCHED_DEADLINE:
case SCHED_NORMAL:

[tip: sched/core] sched/core: Update task_prio() function header

2021-02-17 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: c541bb7835a306cdbbe8abbdf4e4df507e0ca27a
Gitweb:
https://git.kernel.org/tip/c541bb7835a306cdbbe8abbdf4e4df507e0ca27a
Author:Dietmar Eggemann 
AuthorDate:Thu, 28 Jan 2021 14:10:40 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:08:30 +01:00

sched/core: Update task_prio() function header

The description of the RT offset and the values for 'normal' tasks needs
update. Moreover there are DL tasks now.
task_prio() has to stay like it is to guarantee compatibility with the
/proc//stat priority field:

  # cat /proc//stat | awk '{ print $18; }'

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210128131040.296856-4-dietmar.eggem...@arm.com
---
 kernel/sched/core.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0b0b67..4afbdd2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5616,8 +5616,12 @@ SYSCALL_DEFINE1(nice, int, increment)
  * @p: the task in question.
  *
  * Return: The priority value as seen by users in /proc.
- * RT tasks are offset by -200. Normal tasks are centered
- * around 0, value goes from -16 to +15.
+ *
+ * sched policy return value   kernel priouser prio/nice
+ *
+ * normal, batch, idle [0 ... 39]  [100 ... 139]  0/[-20 ... 19]
+ * fifo, rr [-2 ... -100] [98 ... 0]  [1 ... 99]
+ * deadline -101 -1   0
  */
 int task_prio(const struct task_struct *p)
 {

[tip: sched/core] sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()

2021-02-17 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: de40f33e788b0c016bfde512ace2f76339ef7ddb
Gitweb:
https://git.kernel.org/tip/de40f33e788b0c016bfde512ace2f76339ef7ddb
Author:Dietmar Eggemann 
AuthorDate:Tue, 19 Jan 2021 09:35:42 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()

dl_add_task_root_domain() is called during sched domain rebuild:

  rebuild_sched_domains_locked()
partition_and_rebuild_sched_domains()
  rebuild_root_domains()
 for all top_cpuset descendants:
   update_tasks_root_domain()
 for all tasks of cpuset:
   dl_add_task_root_domain()

Change it so that only the task pi lock is taken to check if the task
has a SCHED_DEADLINE (DL) policy. In case that p is a DL task take the
rq lock as well to be able to safely de-reference root domain's DL
bandwidth structure.

Most of the tasks will have another policy (namely SCHED_NORMAL) and
can now bail without taking the rq lock.

One thing to note here: Even in case that there aren't any DL user
tasks, a slow frequency switching system with cpufreq gov schedutil has
a DL task (sugov) per frequency domain running which participates in DL
bandwidth management.

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Reviewed-by: Quentin Perret 
Reviewed-by: Valentin Schneider 
Reviewed-by: Daniel Bristot de Oliveira 
Acked-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20210119083542.19856-1-dietmar.eggem...@arm.com
---
 kernel/sched/deadline.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1508d12..6f37796 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2388,9 +2388,13 @@ void dl_add_task_root_domain(struct task_struct *p)
struct rq *rq;
struct dl_bw *dl_b;
 
-   rq = task_rq_lock(p, );
-   if (!dl_task(p))
-   goto unlock;
+   raw_spin_lock_irqsave(>pi_lock, rf.flags);
+   if (!dl_task(p)) {
+   raw_spin_unlock_irqrestore(>pi_lock, rf.flags);
+   return;
+   }
+
+   rq = __task_rq_lock(p, );
 
dl_b = >rd->dl_bw;
raw_spin_lock(_b->lock);
@@ -2399,7 +2403,6 @@ void dl_add_task_root_domain(struct task_struct *p)
 
raw_spin_unlock(_b->lock);
 
-unlock:
task_rq_unlock(rq, p, );
 }

[tip: sched/core] sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()

2021-02-10 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 3096b6fe494b7b4e45d20cb77aa6b715a3efe344
Gitweb:
https://git.kernel.org/tip/3096b6fe494b7b4e45d20cb77aa6b715a3efe344
Author:Dietmar Eggemann 
AuthorDate:Tue, 19 Jan 2021 09:35:42 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:48 +01:00

sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()

dl_add_task_root_domain() is called during sched domain rebuild:

  rebuild_sched_domains_locked()
partition_and_rebuild_sched_domains()
  rebuild_root_domains()
 for all top_cpuset descendants:
   update_tasks_root_domain()
 for all tasks of cpuset:
   dl_add_task_root_domain()

Change it so that only the task pi lock is taken to check if the task
has a SCHED_DEADLINE (DL) policy. In case that p is a DL task take the
rq lock as well to be able to safely de-reference root domain's DL
bandwidth structure.

Most of the tasks will have another policy (namely SCHED_NORMAL) and
can now bail without taking the rq lock.

One thing to note here: Even in case that there aren't any DL user
tasks, a slow frequency switching system with cpufreq gov schedutil has
a DL task (sugov) per frequency domain running which participates in DL
bandwidth management.

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Quentin Perret 
Reviewed-by: Valentin Schneider 
Reviewed-by: Daniel Bristot de Oliveira 
Acked-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20210119083542.19856-1-dietmar.eggem...@arm.com
---
 kernel/sched/deadline.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1508d12..6f37796 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2388,9 +2388,13 @@ void dl_add_task_root_domain(struct task_struct *p)
struct rq *rq;
struct dl_bw *dl_b;
 
-   rq = task_rq_lock(p, );
-   if (!dl_task(p))
-   goto unlock;
+   raw_spin_lock_irqsave(>pi_lock, rf.flags);
+   if (!dl_task(p)) {
+   raw_spin_unlock_irqrestore(>pi_lock, rf.flags);
+   return;
+   }
+
+   rq = __task_rq_lock(p, );
 
dl_b = >rd->dl_bw;
raw_spin_lock(_b->lock);
@@ -2399,7 +2403,6 @@ void dl_add_task_root_domain(struct task_struct *p)
 
raw_spin_unlock(_b->lock);
 
-unlock:
task_rq_unlock(rq, p, );
 }

Re: [RFC PATCH 5/6] sched/fair: trigger the update of blocked load on newly idle cpu

2021-02-09 Thread Dietmar Eggemann

On 05/02/2021 12:48, Vincent Guittot wrote:
> Instead of waking up a random and already idle CPU, we can take advantage
> of this_cpu being about to enter idle to run the ILB and update the
> blocked load.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  include/linux/sched/nohz.h |  2 ++
>  kernel/sched/fair.c| 11 ---
>  kernel/sched/idle.c|  6 ++
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
> index 6d67e9a5af6b..74cdc4e87310 100644
> --- a/include/linux/sched/nohz.h
> +++ b/include/linux/sched/nohz.h
> @@ -9,8 +9,10 @@
>  #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>  extern void nohz_balance_enter_idle(int cpu);
>  extern int get_nohz_timer_target(void);
> +extern void nohz_run_idle_balance(int cpu);
>  #else
>  static inline void nohz_balance_enter_idle(int cpu) { }
> +static inline void nohz_run_idle_balance(int cpu) { }
>  #endif

(1) Since nohz_run_idle_balance() would be an interface one sched class
(fair) exports to another (idle) I wonder if kernel/sched/sched.h would
be the more appropriate include file to export/define it?

nohz_balance_exit_idle() is exported via kernel/sched/sched.h (used only
within the scheduler) whereas nohz_balance_enter_idle() is exported via
include/linux/sched/nohz.h (used in kernel/time/tick-sched.c).

Isn't include/linux/sched/nohz.h the interface between kernel/sched/ and
kernel/time?

There is one exception already though: calc_load_nohz_remote() defined
in kernel/sched/loadavg.c and (only) used in kernel/sched/core.c.

(2) Is there a need for an extra function nohz_run_idle_balance()?
do_idle() could call nohz_idle_balance() directly in case in would be
exported instead.

[...]

Re: [PATCH 4/6] sched/fair: reorder newidle_balance pulled_task test

2021-02-09 Thread Dietmar Eggemann

On 05/02/2021 12:48, Vincent Guittot wrote:
> Reorder the tests and skip prevent useless test when no load balance has
> been performed.

LGTM.

But IMHO the reason why those two if conditions can be skipped for the
'goto out' path is that we don't release the rq lock rather the actual
lb. Might be worth saying this in the patch header? It's already
mentioned on top of the first if condition though.

> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c587af230010..935594cd5430 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10592,7 +10592,6 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   if (curr_cost > this_rq->max_idle_balance_cost)
>   this_rq->max_idle_balance_cost = curr_cost;
>  
> -out:
>   /*
>* While browsing the domains, we released the rq lock, a task could
>* have been enqueued in the meantime. Since we're not going idle,
> @@ -10601,14 +10600,15 @@ static int newidle_balance(struct rq *this_rq, 
> struct rq_flags *rf)
>   if (this_rq->cfs.h_nr_running && !pulled_task)
>   pulled_task = 1;
>  
> - /* Move the next balance forward */
> - if (time_after(this_rq->next_balance, next_balance))
> - this_rq->next_balance = next_balance;
> -
>   /* Is there a task of a high priority class? */
>   if (this_rq->nr_running != this_rq->cfs.h_nr_running)
>   pulled_task = -1;
>  
> +out:
> + /* Move the next balance forward */
> + if (time_after(this_rq->next_balance, next_balance))
> + this_rq->next_balance = next_balance;
> +
>   if (pulled_task)
>   this_rq->idle_stamp = 0;
>   else
>

Re: [PATCH 1/6] sched/fair: remove update of blocked load from newidle_balance

2021-02-09 Thread Dietmar Eggemann

On 05/02/2021 12:48, Vincent Guittot wrote:
> newidle_balance runs with both preempt and irq disabled which prevent
> local irq to run during this period. The duration for updating of the
> blocked load of CPUs varies according to the number of cgroups and

Maybe s/number of cgroups/number of CPU cgroups with non-decayed
cfs_rq's (i.e. cfs_rq within the leaf cfs_rq list)

> extends this critical period to an uncontrolled level.
> 
> Remove the update from newidle_balance and trigger a normal ILB that
> will take care of the update instead.
> 
> Signed-off-by: Vincent Guittot 

otherwise, LGTM.

[...]

Re: [PATCH 2/6] sched/fair: remove unused parameter of update_nohz_stats

2021-02-09 Thread Dietmar Eggemann

On 05/02/2021 12:48, Vincent Guittot wrote:
> idle load balance is the only user of update_nohz_stats and doesn't use
> force parameter. Remove it

Wasn't the 'force=true' from ilb eclipsing the jiffy resolution rate
limiting '!time_after(jiffies, rq->last_blocked_load_update_tick)' of
update_blocked_averages()?

So IMHO this has the (maybe intended) side effect that (formerly forced
updates) are now rate limited on one jiffy resolution too.

> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfe1e235fe01..60b8c1c68ab9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8352,7 +8352,7 @@ group_type group_classify(unsigned int imbalance_pct,
>   return group_has_spare;
>  }
>  
> -static bool update_nohz_stats(struct rq *rq, bool force)
> +static bool update_nohz_stats(struct rq *rq)
>  {
>  #ifdef CONFIG_NO_HZ_COMMON
>   unsigned int cpu = rq->cpu;
> @@ -8363,7 +8363,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
>   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>   return false;
>  
> - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> + if (!time_after(jiffies, rq->last_blocked_load_update_tick))
>   return true;
>  
>   update_blocked_averages(cpu);
> @@ -10404,7 +10404,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
> unsigned int flags,
>  
>   rq = cpu_rq(balance_cpu);
>  
> - has_blocked_load |= update_nohz_stats(rq, true);
> + has_blocked_load |= update_nohz_stats(rq);
>  
>   /*
>* If time for next balance is due,
>

Re: [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities

2021-02-04 Thread Dietmar Eggemann

On 04/02/2021 12:34, Valentin Schneider wrote:
> On 04/02/21 11:49, Dietmar Eggemann wrote:
>> On 03/02/2021 16:15, Qais Yousef wrote:
>>> On 01/28/21 18:31, Valentin Schneider wrote:

[...]

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 21bd71f58c06..ea7f0155e268 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1482,6 +1482,33 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, 
> sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>  extern struct static_key_false sched_asym_cpucapacity;
>  
> +/*
> + * Note that the static key is system-wide, but the visibility of
> + * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
> + * imply all CPUs can see asymmetry.
> + *
> + * Consider an asymmetric CPU capacity system such as:
> + *
> + * MC [   ]
> + * 0 1 2 3 4 5
> + * L L L L B B
> + *
> + * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
> + *
> + * By default, booting this system will enable the sched_asym_cpucapacity
> + * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
> + * sched_domain.
> + *
> + * Further consider exclusive cpusets creating a "symmetric island":
> + *
> + * MC [   ][  ]
> + * 0 1  2 3 4 5
> + * L L  L L B B
> + *
> + * Again, booting this will enable the static key, but CPUs 0-1 will *not* 
> have
> + * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the 
> intending

s/intending/intended

> + * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP 
> system.
> + */

LGTM.

Re: [PATCH 0/8] sched/fair: misfit task load-balance tweaks

2021-02-04 Thread Dietmar Eggemann

On 03/02/2021 19:43, Valentin Schneider wrote:
> Hi Qais,
> 
> On 03/02/21 15:14, Qais Yousef wrote:
>> On 01/28/21 18:31, Valentin Schneider wrote:
>>> Hi folks,
>>>
>>> Here is this year's series of misfit changes. On the menu:
>>>
>>> o Patch 1 is an independent active balance cleanup
>>> o Patch 2 adds some more sched_asym_cpucapacity static branches
>>> o Patch 3 introduces yet another margin for capacity to capacity
>>>   comparisons
>>> o Patches 4-6 build on top of patch 3 and change capacity comparisons
>>>   throughout misfit load balancing  
>>> o Patches 7-8 fix some extra misfit issues I've been seeing on "real"
>>>   workloads.
>>>
>>> IMO the somewhat controversial bit is patch 3, because it attempts to solve
>>> margin issues by... Adding another margin. This does solve issues on
>>> existing platforms (e.g. Pixel4), but we'll be back to square one the day
>>> some "clever" folks spin a platform with two different CPU capacities less 
>>> than
>>> 5% apart.
>>
>> One more margin is a cause of apprehension to me. But in this case I think it
>> is the appropriate thing to do now. I can't think of a scenario where this
>> could hurt.
>>
> 
> Thanks for having a look!
> 
>> Thanks
>>
>> --
>> Qais Yousef

How did you verify the benefit of these changes?

It's clear that you need a platform with capacity_orig diffs <20%
between CPU types (like Pixel4 - SD855 (4x261, 3x871, 1x1024) or QC's
RB5 platform - SD865 (4x284, 3x871, 1*1024)) but which
benchmark/testcase did you use?

Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit

2021-02-04 Thread Dietmar Eggemann

On 03/02/2021 19:43, Valentin Schneider wrote:
> On 03/02/21 15:16, Qais Yousef wrote:
>> On 01/28/21 18:31, Valentin Schneider wrote:
>>> Giving group_misfit_task a higher group_classify() priority than
>>> group_imbalance doesn't seem like the right thing to do. Instead, make
>>> need_active_balance() return true for any migration_type when the
>>
>> s/need_active_balance()/voluntary_active_balance()/?
>>
>>> destination CPU is idle and the source CPU has a misfit task.
>>>
>>> While at it, add an sd_has_asym_cpucapacity() guard in
>>> need_active_balance().
>>
>> ditto.
>>
> 
> Myes, clearly this has been left to ferment for too long!

Wasn't the migrate_misfit condition moved from
voluntary_active_balance() into need_active_balance() by commit
("sched/fair: Reduce cases for active balance")?

Re: [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities

2021-02-04 Thread Dietmar Eggemann

On 03/02/2021 16:15, Qais Yousef wrote:
> On 01/28/21 18:31, Valentin Schneider wrote:

[...]

>> @@ -10238,7 +10236,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>   * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>>   * to run the misfit task on.
>>   */
>> -if (check_misfit_status(rq, sd)) {
>> +if (check_misfit_status(rq)) {

Since check_misfit_status() doesn't need sd anymore it looks like that
rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu)) could be replaced by
static_branch_unlikely(_asym_cpucapacity)) in nohz_balancer_kick().

But as you mentioned in an earlier conversation we do need to check sd
because of asymmetric CPU capacity systems w/ exclusive cpusets which
could create symmetric islands (unique capacity_orig among CPUs).

Maybe worth putting a comment here (similar to the one in sis()) so
people don't try to optimize?

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-04 Thread Dietmar Eggemann

On 03/02/2021 14:12, Vincent Guittot wrote:
> On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann  
> wrote:
>>
>> On 29/01/2021 18:27, Vincent Guittot wrote:
>>> Le vendredi 29 janv. 2021 ï¿½ 11:33:00 (+0100), Vincent Guittot a ï¿½crit :
>>>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes  
>>>> wrote:
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>>>>  wrote:
>>>>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>>>>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
>>>>>>>> wrote:
>>>>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>>>>>>>  wrote:
>>
>> [...]
>>
>>>> The only point that I agree with, is that running
>>>> update_blocked_averages with preempt and irq off is not a good thing
>>>> because we don't manage the number of csf_rq to update and I'm going
>>>> to provide a patchset for this
>>>
>>> The patch below moves the update of the blocked load of CPUs outside 
>>> newidle_balance().
>>>
>>> Instead, the update is done with the usual idle load balance update. I'm 
>>> working on an
>>> additonnal patch that will select this cpu that is about to become idle, 
>>> instead of a
>>> random idle cpu but this 1st step fixe the problem of lot of update in 
>>> newly idle.
>>
>> I'm trying to understand the need for this extra patch.
>>
>> The patch below moves away from doing update_blocked_averages() (1) for
>> all CPUs in sched groups of the sched domain:
>>
>> newidle_balance()->load_balance()->
>> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>>
>> to:
>>
>> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
>> update_nohz_stats() and for the ilb CPU.
>>
>> newidle_balance() calls (1) for newidle CPU already.
>>
>> What would be the benefit to choose newidle CPU as ilb CPU?
> 
> To prevent waking up another idle cpu to run the update whereas
> newidle cpu is already woken up and about to be idle so the best
> candidate.
> All the aim of the removed code was to prevent waking up an idle cpu
> for doing something that could be done by the newidle cpu before it
> enters idle state

Ah OK, makes sense. So the only diff would be that newidle CPU will call
(1) on nohz.idle_cpus_mask rather on on sd->span.

[...]

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Dietmar Eggemann

On 29/01/2021 18:27, Vincent Guittot wrote:
> Le vendredi 29 janv. 2021 ï¿½ 11:33:00 (+0100), Vincent Guittot a ï¿½crit :
>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes  wrote:
>>>
>>> Hi Vincent,
>>>
>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>>  wrote:
> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
>> wrote:
>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
 On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
  wrote:

[...]

>> The only point that I agree with, is that running
>> update_blocked_averages with preempt and irq off is not a good thing
>> because we don't manage the number of csf_rq to update and I'm going
>> to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm 
> working on an
> additonnal patch that will select this cpu that is about to become idle, 
> instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly 
> idle.

I'm trying to understand the need for this extra patch.

The patch below moves away from doing update_blocked_averages() (1) for
all CPUs in sched groups of the sched domain:

newidle_balance()->load_balance()->
find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()

to:

calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
update_nohz_stats() and for the ilb CPU.

newidle_balance() calls (1) for newidle CPU already.

What would be the benefit to choose newidle CPU as ilb CPU?

> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 32 +++-
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK   0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED  0x08
> -#define LBF_NOHZ_STATS   0x10
> -#define LBF_NOHZ_AGAIN   0x20
>  
>  struct lb_env {
>   struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
>   sgs->group_load += cpu_load(rq);
>   sgs->group_util += cpu_util(i);
>   sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats tmp_sgs;
>   int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>   do {
>   struct sg_lb_stats *sgs = _sgs;
>   int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   /* Tag domain that child domain prefers tasks go to siblings first */
>   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
>   if (env->sd->flags & SD_NUMA)
>   env->fbq_type = fbq_classify_group(>busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>   return;
>  
>   raw_spin_unlock(_rq->lock);
> - /*
> -  * This CPU is going to be idle and blocked load of idle CPUs
> -  * need to be updated. Run the ilb locally as it is a good
> -  * candidate for ilb instead of waking up another idle CPU.
> -  * Kick an normal ilb if we failed to do the update.
> -  */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);
>   raw_spin_lock(_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - nohz_newidle_balance(this_rq);
> -
>   goto out;
>   }
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (pulled_task)
>   this_rq->idle_stamp = 0;
> + else
> +

[tip: sched/core] sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO

2021-02-02 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: c18b4a67cc459fb8389f6a89ce28e404aafe562c
Gitweb:
https://git.kernel.org/tip/c18b4a67cc459fb8389f6a89ce28e404aafe562c
Author:Dietmar Eggemann 
AuthorDate:Thu, 28 Jan 2021 14:10:39 +01:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 01 Feb 2021 15:31:39 +01:00

sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO

The only remaining use of MAX_USER_PRIO (and USER_PRIO) is the
SCALE_PRIO() definition in the PowerPC Cell architecture's Synergistic
Processor Unit (SPU) scheduler. TASK_USER_PRIO isn't used anymore.

Commit fe443ef2ac42 ("[POWERPC] spusched: Dynamic timeslicing for
SCHED_OTHER") copied SCALE_PRIO() from the task scheduler in v2.6.23.

Commit a4ec24b48dde ("sched: tidy up SCHED_RR") removed it from the task
scheduler in v2.6.24.

Commit 3ee237dddcd8 ("sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and
NICE_WIDTH in prio.h") introduced NICE_WIDTH much later.

With:

  MAX_USER_PRIO = USER_PRIO(MAX_PRIO)

= MAX_PRIO - MAX_RT_PRIO

   MAX_PRIO = MAX_RT_PRIO + NICE_WIDTH

  MAX_USER_PRIO = MAX_RT_PRIO + NICE_WIDTH - MAX_RT_PRIO

  MAX_USER_PRIO = NICE_WIDTH

MAX_USER_PRIO can be replaced by NICE_WIDTH to be able to remove all the
{*_}USER_PRIO defines.

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210128131040.296856-3-dietmar.eggem...@arm.com
---
 arch/powerpc/platforms/cell/spufs/sched.c |  2 +-
 include/linux/sched/prio.h|  9 -
 kernel/sched/sched.h  |  2 +-
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index f18d506..aeb7f39 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -72,7 +72,7 @@ static struct timer_list spuloadavg_timer;
 #define DEF_SPU_TIMESLICE  (100 * HZ / (1000 * SPUSCHED_TICK))
 
 #define SCALE_PRIO(x, prio) \
-   max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
+   max(x * (MAX_PRIO - prio) / (NICE_WIDTH / 2), MIN_SPU_TIMESLICE)
 
 /*
  * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index d111f2f..ab83d85 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -27,15 +27,6 @@
 #define PRIO_TO_NICE(prio) ((prio) - DEFAULT_PRIO)
 
 /*
- * 'User priority' is the nice value converted to something we
- * can work with better when scaling various scheduler parameters,
- * it's a [ 0 ... 39 ] range.
- */
-#define USER_PRIO(p)   ((p)-MAX_RT_PRIO)
-#define TASK_USER_PRIO(p)  USER_PRIO((p)->static_prio)
-#define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
-
-/*
  * Convert nice value [19,-20] to rlimit style value [1,40].
  */
 static inline long nice_to_rlimit(long nice)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 045b010..6edc67d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -140,7 +140,7 @@ extern void call_trace_sched_update_nr_running(struct rq 
*rq, int count);
  * scale_load() and scale_load_down(w) to convert between them. The
  * following must be true:
  *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *  scale_load(sched_prio_to_weight[NICE_TO_PRIO(0)-MAX_RT_PRIO]) == 
NICE_0_LOAD
  *
  */
 #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)

[tip: sched/core] sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()

2021-02-02 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e972d92d52a1f691498add14feb2ee5902d02404
Gitweb:
https://git.kernel.org/tip/e972d92d52a1f691498add14feb2ee5902d02404
Author:Dietmar Eggemann 
AuthorDate:Mon, 01 Feb 2021 10:53:53 +01:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 01 Feb 2021 15:31:39 +01:00

sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
__free_domain_allocs()
  __sdt_free() {
...
for_each_sd_topology(tl)
  ...
  sd = *per_cpu_ptr(sdd->sd, j); <--
  ...
  }

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Vincent Guittot 
Tested-by: Barry Song 
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a...@arm.com
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bf5c9bd..09d3504 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1702,7 +1702,7 @@ void sched_init_numa(void)
/* Compute default topology size */
for (i = 0; sched_domain_topology[i].mask; i++);
 
-   tl = kzalloc((i + nr_levels) *
+   tl = kzalloc((i + nr_levels + 1) *
sizeof(struct sched_domain_topology_level), GFP_KERNEL);
if (!tl)
return;

[tip: sched/core] sched: Remove MAX_USER_RT_PRIO

2021-02-02 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 4d38ea6a6d93115113fb4c023d5bb15e8ce1589c
Gitweb:
https://git.kernel.org/tip/4d38ea6a6d93115113fb4c023d5bb15e8ce1589c
Author:Dietmar Eggemann 
AuthorDate:Thu, 28 Jan 2021 14:10:38 +01:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 01 Feb 2021 15:31:39 +01:00

sched: Remove MAX_USER_RT_PRIO

Commit d46523ea32a7 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO")
was introduced due to a a small time period in which the realtime patch
set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO.

This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO.

Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO
instead.

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210128131040.296856-2-dietmar.eggem...@arm.com
---
 include/linux/sched/prio.h |  9 +
 kernel/sched/core.c|  7 +++
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index 7d64fea..d111f2f 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -11,16 +11,9 @@
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
  * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
  * values are inverted: lower p->prio value means higher priority.
- *
- * The MAX_USER_RT_PRIO value allows the actual maximum
- * RT priority to be separate from the value exported to
- * user-space.  This allows kernel threads to set their
- * priority to a value higher than any user task. Note:
- * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
  */
 
-#define MAX_USER_RT_PRIO   100
-#define MAX_RT_PRIOMAX_USER_RT_PRIO
+#define MAX_RT_PRIO100
 
 #define MAX_PRIO   (MAX_RT_PRIO + NICE_WIDTH)
 #define DEFAULT_PRIO   (MAX_RT_PRIO + NICE_WIDTH / 2)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06b4499..625ec1e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5897,11 +5897,10 @@ recheck:
 
/*
 * Valid priorities for SCHED_FIFO and SCHED_RR are
-* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
+* 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL,
 * SCHED_BATCH and SCHED_IDLE is 0.
 */
-   if ((p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
-   (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
+   if (attr->sched_priority > MAX_RT_PRIO-1)
return -EINVAL;
if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
(rt_policy(policy) != (attr->sched_priority != 0)))
@@ -6969,7 +6968,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
switch (policy) {
case SCHED_FIFO:
case SCHED_RR:
-   ret = MAX_USER_RT_PRIO-1;
+   ret = MAX_RT_PRIO-1;
break;
case SCHED_DEADLINE:
case SCHED_NORMAL:

[tip: sched/core] sched/core: Update task_prio() function header

2021-02-02 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 075a28439d0c8eb6d3c799e1eed24bb9bc7750cd
Gitweb:
https://git.kernel.org/tip/075a28439d0c8eb6d3c799e1eed24bb9bc7750cd
Author:Dietmar Eggemann 
AuthorDate:Thu, 28 Jan 2021 14:10:40 +01:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 01 Feb 2021 15:31:39 +01:00

sched/core: Update task_prio() function header

The description of the RT offset and the values for 'normal' tasks needs
update. Moreover there are DL tasks now.
task_prio() has to stay like it is to guarantee compatibility with the
/proc//stat priority field:

  # cat /proc//stat | awk '{ print $18; }'

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210128131040.296856-4-dietmar.eggem...@arm.com
---
 kernel/sched/core.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625ec1e..be3a956 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5602,8 +5602,12 @@ SYSCALL_DEFINE1(nice, int, increment)
  * @p: the task in question.
  *
  * Return: The priority value as seen by users in /proc.
- * RT tasks are offset by -200. Normal tasks are centered
- * around 0, value goes from -16 to +15.
+ *
+ * sched policy return value   kernel priouser prio/nice
+ *
+ * normal, batch, idle [0 ... 39]  [100 ... 139]  0/[-20 ... 19]
+ * fifo, rr [-2 ... -100] [98 ... 0]  [1 ... 99]
+ * deadline -101 -1   0
  */
 int task_prio(const struct task_struct *p)
 {

Re: [LKP] [sched/topology] 620a6dc407: BUG:KASAN:slab-out-of-bounds_in_build_sched_domains

2021-02-01 Thread Dietmar Eggemann

On 01/02/2021 08:49, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: 620a6dc40754dc218f5b6389b5d335e9a107fd29 ("sched/topology: Make 
> sched_init_numa() use a set for the deduplicating sort")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
>

We also saw an issue with this patch during sched domain build which got
fixed by:

https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a...@arm.com

Re: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort

2021-02-01 Thread Dietmar Eggemann

On 22/01/2021 13:39, Valentin Schneider wrote:

[...]

> @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
>   /* Compute default topology size */
>   for (i = 0; sched_domain_topology[i].mask; i++);
>  
> - tl = kzalloc((i + level + 1) *
> + tl = kzalloc((i + nr_levels) *
>   sizeof(struct sched_domain_topology_level), GFP_KERNEL);
>   if (!tl)
>   return;

This hunk creates issues during startup on my Arm64 juno board on 
tip/sched/core.

---8<---

From: Dietmar Eggemann 
Date: Mon, 1 Feb 2021 09:58:04 +0100
Subject: [PATCH] sched/topology: Fix sched_domain_topology_level alloc in
 sched_init_numa

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
__free_domain_allocs()
  __sdt_free() {
...
for_each_sd_topology(tl)
  ...
  sd = *per_cpu_ptr(sdd->sd, j); <--
  ...
  }

Signed-off-by: Dietmar Eggemann 
---

./scripts/decode_stacktrace.sh ./vmlinux < input.log

[0.495693] Internal error: Oops: 9605 [#1] PREEMPT SMP
[0.501280] Modules linked in:
[0.504342] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.11.0-rc2-00025-g7a976f77bb96 #95
[0.512455] Hardware name: ARM Juno development board (r0) (DT)
[0.518386] pstate: 8005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[0.524409] pc : build_sched_domains (kernel/sched/topology.c:1872 
kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
[0.529132] lr : build_sched_domains (kernel/sched/topology.c:1868 
kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
[0.533847] sp : 800011efbd00
[0.537165] x29: 800011efbd00 x28: 800011b69a38
[0.542496] x27:  x26: 
[0.547827] x25: 000800191600 x24: 
[0.553157] x23: 800011b69a40 x22: 0001
[0.558487] x21: 000800056e00 x20: 00080016f380
[0.563818] x19: 800011b6a410 x18: 
[0.569149] x17:  x16: 
[0.574478] x15: 0030 x14: 
[0.579809] x13: 800011b82d38 x12: 0189
[0.585139] x11: 0083 x10: 800011bdad38
[0.590469] x9 : f000 x8 : 800011b82d38
[0.595800] x7 : 00e0 x6 : 003f
[0.601130] x5 :  x4 : 
[0.606460] x3 :  x2 : 80096d857000
[0.611790] x1 : 0001 x0 : 0001
[0.617120] Call trace:
[0.619566] build_sched_domains (kernel/sched/topology.c:1872 
kernel/sched/topology.c:1288 kernel/sched/topology.c:2120)
[0.623934] sched_init_domains (kernel/sched/topology.c:2194)
[0.627954] sched_init_smp (kernel/sched/core.c:7727)
[0.631629] kernel_init_freeable (init/main.c:1528)
[0.635914] kernel_init (init/main.c:1417)
[0.639416] ret_from_fork (arch/arm64/kernel/entry.S:959)
[ 0.643008] Code: b4000360 93407f1b aa0003e1 f87b7ae2 (f8626821)
All code

   0:   60  (bad)
   1:   03 00   add(%rax),%eax
   3:   b4 1b   mov$0x1b,%ah
   5:   7f 40   jg 0x47
   7:   93  xchg   %eax,%ebx
   8:   e1 03   loope  0xd
   a:   00 aa e2 7a 7b f8   add%ch,-0x784851e(%rdx)
  10:*  21 68 62and%ebp,0x62(%rax)  <-- trapping 
instruction
  13:   f8  clc

Code starting with the faulting instruction
===
   0:   21 68 62and%ebp,0x62(%rax)
   3:   f8  clc
[0.649129] ---[ end trace 4453f742a7781011 ]---
[0.653772] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x000b
[0.661449] SMP: stopping secondary CPUs
[0.665389] ---[ end Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x000b ]---

 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bf5c9bd10bfe..09d35044bd88 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1702,7 +1702,7 @@ void sched_init_numa(void)
/* Compute default topology size */
for (i = 0; sched_domain_topology[i].mask; i++);
 
-   tl = kzalloc((i + nr_levels) *
+   tl = kzalloc((i + nr_levels + 1) *
sizeof(struct sched_domain_topology_level), GFP_KERNEL);
if (!tl)
return;
-- 
2.25.1

[PATCH 2/3] sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO

2021-01-28 Thread Dietmar Eggemann

The only remaining use of MAX_USER_PRIO (and USER_PRIO) is the
SCALE_PRIO() definition in the PowerPC Cell architecture's Synergistic
Processor Unit (SPU) scheduler. TASK_USER_PRIO isn't used anymore.

Commit fe443ef2ac42 ("[POWERPC] spusched: Dynamic timeslicing for
SCHED_OTHER") copied SCALE_PRIO() from the task scheduler in v2.6.23.

Commit a4ec24b48dde ("sched: tidy up SCHED_RR") removed it from the task
scheduler in v2.6.24.

Commit 3ee237dddcd8 ("sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and
NICE_WIDTH in prio.h") introduced NICE_WIDTH much later.

With:

  MAX_USER_PRIO = USER_PRIO(MAX_PRIO)

= MAX_PRIO - MAX_RT_PRIO

   MAX_PRIO = MAX_RT_PRIO + NICE_WIDTH

  MAX_USER_PRIO = MAX_RT_PRIO + NICE_WIDTH - MAX_RT_PRIO

  MAX_USER_PRIO = NICE_WIDTH

MAX_USER_PRIO can be replaced by NICE_WIDTH to be able to remove all the
{*_}USER_PRIO defines.

Signed-off-by: Dietmar Eggemann 
---
 arch/powerpc/platforms/cell/spufs/sched.c | 2 +-
 include/linux/sched/prio.h| 9 -
 kernel/sched/sched.h  | 2 +-
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index f18d5067cd0f..aeb7f3922106 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -72,7 +72,7 @@ static struct timer_list spuloadavg_timer;
 #define DEF_SPU_TIMESLICE  (100 * HZ / (1000 * SPUSCHED_TICK))
 
 #define SCALE_PRIO(x, prio) \
-   max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
+   max(x * (MAX_PRIO - prio) / (NICE_WIDTH / 2), MIN_SPU_TIMESLICE)
 
 /*
  * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index d111f2fd77ea..ab83d85e1183 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -26,15 +26,6 @@
 #define NICE_TO_PRIO(nice) ((nice) + DEFAULT_PRIO)
 #define PRIO_TO_NICE(prio) ((prio) - DEFAULT_PRIO)
 
-/*
- * 'User priority' is the nice value converted to something we
- * can work with better when scaling various scheduler parameters,
- * it's a [ 0 ... 39 ] range.
- */
-#define USER_PRIO(p)   ((p)-MAX_RT_PRIO)
-#define TASK_USER_PRIO(p)  USER_PRIO((p)->static_prio)
-#define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
-
 /*
  * Convert nice value [19,-20] to rlimit style value [1,40].
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 045b01064c1e..6edc67df3554 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -140,7 +140,7 @@ extern void call_trace_sched_update_nr_running(struct rq 
*rq, int count);
  * scale_load() and scale_load_down(w) to convert between them. The
  * following must be true:
  *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *  scale_load(sched_prio_to_weight[NICE_TO_PRIO(0)-MAX_RT_PRIO]) == 
NICE_0_LOAD
  *
  */
 #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
-- 
2.25.1

[PATCH 1/3] sched: Remove MAX_USER_RT_PRIO

2021-01-28 Thread Dietmar Eggemann

Commit d46523ea32a7 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO")
was introduced due to a a small time period in which the realtime patch
set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO.

This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO.

Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO
instead.

Signed-off-by: Dietmar Eggemann 
---
 include/linux/sched/prio.h | 9 +
 kernel/sched/core.c| 7 +++
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index 7d64feafc408..d111f2fd77ea 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -11,16 +11,9 @@
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
  * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
  * values are inverted: lower p->prio value means higher priority.
- *
- * The MAX_USER_RT_PRIO value allows the actual maximum
- * RT priority to be separate from the value exported to
- * user-space.  This allows kernel threads to set their
- * priority to a value higher than any user task. Note:
- * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
  */
 
-#define MAX_USER_RT_PRIO   100
-#define MAX_RT_PRIOMAX_USER_RT_PRIO
+#define MAX_RT_PRIO100
 
 #define MAX_PRIO   (MAX_RT_PRIO + NICE_WIDTH)
 #define DEFAULT_PRIO   (MAX_RT_PRIO + NICE_WIDTH / 2)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06b449942adf..625ec1e12064 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5897,11 +5897,10 @@ static int __sched_setscheduler(struct task_struct *p,
 
/*
 * Valid priorities for SCHED_FIFO and SCHED_RR are
-* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
+* 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL,
 * SCHED_BATCH and SCHED_IDLE is 0.
 */
-   if ((p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
-   (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
+   if (attr->sched_priority > MAX_RT_PRIO-1)
return -EINVAL;
if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
(rt_policy(policy) != (attr->sched_priority != 0)))
@@ -6969,7 +6968,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
switch (policy) {
case SCHED_FIFO:
case SCHED_RR:
-   ret = MAX_USER_RT_PRIO-1;
+   ret = MAX_RT_PRIO-1;
break;
case SCHED_DEADLINE:
case SCHED_NORMAL:
-- 
2.25.1

[PATCH 3/3] sched/core: Update task_prio() function header

2021-01-28 Thread Dietmar Eggemann

The description of the RT offset and the values for 'normal' tasks needs
update. Moreover there are DL tasks now.
task_prio() has to stay like it is to guarantee compatibility with the
/proc//stat priority field:

  # cat /proc//stat | awk '{ print $18; }'

Signed-off-by: Dietmar Eggemann 
---
 kernel/sched/core.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625ec1e12064..be3a956c2d23 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5602,8 +5602,12 @@ SYSCALL_DEFINE1(nice, int, increment)
  * @p: the task in question.
  *
  * Return: The priority value as seen by users in /proc.
- * RT tasks are offset by -200. Normal tasks are centered
- * around 0, value goes from -16 to +15.
+ *
+ * sched policy return value   kernel priouser prio/nice
+ *
+ * normal, batch, idle [0 ... 39]  [100 ... 139]  0/[-20 ... 19]
+ * fifo, rr [-2 ... -100] [98 ... 0]  [1 ... 99]
+ * deadline -101 -1   0
  */
 int task_prio(const struct task_struct *p)
 {
-- 
2.25.1

[PATCH 0/3] sched: Task priority related cleanups

2021-01-28 Thread Dietmar Eggemann

(1) Removing MAX_USER_RT_PRIO was already discussed here in April 2020:

https://lkml.kernel.org/r/20200423094403.6f1d2...@gandalf.local.home

(2) USER_PRIO() and related macros are not used anymore except in one
case for powerpc where MAX_USER_PRIO can be replaced by NICE_WIDTH.
Set_load_weight(), task_prio(), cpu_weight_nice_write_s64(),
__update_max_tr() don't use USER_PRIO() but priority - MAX_RT_PRIO.

(3) The function header of task_prio() needs an update. It looks
ancient since it mentions a prio space [-16 ... 15] for mormal
tasks. I can't figure out why this range is mentioned here? Maybe
the influence of the 'sleep-bonus interactivity' feature which was
removed by commit f3479f10c5d6 ("sched: remove the sleep-bonus
interactivity code")? 

Dietmar Eggemann (3):
  sched: Remove MAX_USER_RT_PRIO
  sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO
  sched/core: Update task_prio() function header

 arch/powerpc/platforms/cell/spufs/sched.c |  2 +-
 include/linux/sched/prio.h| 18 +-
 kernel/sched/core.c   | 15 +--
 kernel/sched/sched.h  |  2 +-
 4 files changed, 12 insertions(+), 25 deletions(-)

-- 
2.25.1

Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-01-26 Thread Dietmar Eggemann

On 25/01/2021 11:50, Song Bao Hua (Barry Song) wrote:
> 
>> -Original Message-
>> From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com]
>> Sent: Wednesday, January 13, 2021 12:00 AM
>> To: Morten Rasmussen ; Tim Chen
>> 
>> Cc: Song Bao Hua (Barry Song) ;
>> valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
>> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
>> gre...@linuxfoundation.org; Jonathan Cameron ;
>> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
>> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
>> mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
>> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
>> linux-a...@vger.kernel.org; linux...@openeuler.org; xuwei (O)
>> ; Zengtao (B) ; tiantao (H)
>> 
>> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters 
>> and
>> add cluster scheduler
>>
>> On 11/01/2021 10:28, Morten Rasmussen wrote:
>>> On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote:
>>>>
>>>>
>>>> On 1/8/21 7:12 AM, Morten Rasmussen wrote:
>>>>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
>>>>>> On 1/6/21 12:30 AM, Barry Song wrote:

[...]

>> wake_wide() switches between packing (select_idle_sibling(), llc_size
>> CPUs) and spreading (find_idlest_cpu(), all CPUs).
>>
>> AFAICS, since none of the sched domains set SD_BALANCE_WAKE, currently
>> all wakeups are (llc-)packed.
> 
> Sorry for late response. I was struggling with some other topology
> issues recently.
> 
> For "all wakeups are (llc-)packed",
> it seems you mean current want_affine is only affecting the new_cpu,
> and for wake-up path, we will always go to select_idle_sibling() rather
> than find_idlest_cpu() since nobody sets SD_WAKE_BALANCE in any
> sched_domain ?
> 
>>
>>  select_task_rq_fair()
>>
>>for_each_domain(cpu, tmp)
>>
>>  if (tmp->flags & sd_flag)
>>sd = tmp;
>>
>>
>> In case we would like to further distinguish between llc-packing and
>> even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
>> packing vs. spreading heuristic further down in sis().
> 
> I didn't get your point on "2 level packing". Would you like
> to describe more? It seems you mean we need to have separate
> calculation for avg_scan_cost and sched_feat(SIS_) for cluster
> (or MC-L2) since cluster and llc are not in the same level
> physically?

By '1. level packing' I meant going sis() (i.e. sd=per_cpu(sd_llc,
target)) instead of routing WF_TTWU through find_idlest_cpu() which uses
a broader sd span (in case all sd's (or at least up to an sd > llc)
would have SD_BALANCE_WAKE set).
wake_wide() (wakee/waker flip heuristic) is currently used to make this
decision. But since no sd sets SD_BALANCE_WAKE we always go sis() for
WF_TTWU.

'2. level packing' would be the decision between cluster- and
llc-packing. The question was which heuristic could be used here.

>> IMHO, Barry's current implementation doesn't do this right now. Instead
>> he's trying to pack on cluster first and if not successful look further
>> among the remaining llc CPUs for an idle CPU.
> 
> Yes. That is exactly what the current patch is doing.

And this will be favoring cluster- over llc-packing for each task instead.

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-25 Thread Dietmar Eggemann

On 25/01/2021 18:30, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann  
> wrote:
>>
>> On 22/01/2021 20:10, Joel Fernandes wrote:
>>> Hi Vincent,
>>>
>>> Thanks for reply. Please see the replies below:
>>>
>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>  wrote:

[...]

>> If I understood you correctly, you want to avoid these frequent calls
>> to update_blocked_averages() here to further avoid invoking sched_util
>> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
>> is set) very often in your setup.
> 
> So It's not clear if the problem that joel wants to raise, is about:
> - the running time of  update_blocked_averages
> - the running time of the cpufreq_update_util which is called because
> utilization has decayed during the update of blocked load
> - the wake up latency because of newly_idle lb

Pretty much so.

IIRC his interest is driven by the fact that he saw much less activity
in newly_idle lb and therefore cpufreq_update_util on a system with the
same kernel and userspace but with less CPUs (i.e. also smaller
frequency domains) and less cgroups (with blocked load) and started
wondering why.

I assume that since he understands this environment now much better, he
should be able to come up with better test numbers to show if there is a
performance issue on his 2+6 DynamIQ system and if yes, where exactly in
this code path.

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-25 Thread Dietmar Eggemann

On 22/01/2021 20:10, Joel Fernandes wrote:
> Hi Vincent,
> 
> Thanks for reply. Please see the replies below:
> 
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>  wrote:
>>>
>>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
>>> that there are a lot of calls to update_blocked_averages(). This causes
>>> the schedule loop to slow down to taking upto 500 micro seconds at
>>> times (due to newidle load balance). I have also seen this manifest in
>>> the periodic balancer.
>>>
>>> Closer look shows that the problem is caused by the following
>>> ingredients:
>>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
>>> suggesting to inspect /proc/sched_debug for this), this can make
>>> __update_blocked_fair() take a long time.
>>
>> Inactive cgroups are removed from the list so they should not impact
>> the duration
> 
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.
> 
>   * There can be a lot of idle CPU cgroups.  Don't let fully
>   * decayed cfs_rqs linger on the list.
>   */
>  if (cfs_rq_is_decayed(cfs_rq))
>  list_del_leaf_cfs_rq(cfs_rq);
> 
>>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
>>> shared frequency domain configuration to be slower than usual. (the load
>>
>> What do you mean exactly by it causes schedutil to be slower than usual ?
> 
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.
> 
>>> average updates also try to update the frequency in schedutil).
>>>
>>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
>>> code paths to take longer than when running at a high CPU frequency.
>>
>> Low frequency usually means low utilization so it should happen that much.
> 
> It happens a lot as can be seen with schbench. It is super easy to reproduce.
> 
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace.  The tracer shows
> update_blocked_averages taking a lot of time.
> 
>  sugov:0-2454  [002]  2657.992570: funcgraph_entry:   |  
> load_balance() {
>  sugov:0-2454  [002]  2657.992577: funcgraph_entry:   |   
>  update_group_capacity() {
>  sugov:0-2454  [002]  2657.992580: funcgraph_entry:2.656 us   |   
>__msecs_to_jiffies();
>  sugov:0-2454  [002]  2657.992585: funcgraph_entry:2.447 us   |   
>_raw_spin_lock_irqsave();
>  sugov:0-2454  [002]  2657.992591: funcgraph_entry:2.552 us   |   
>_raw_spin_unlock_irqrestore();
>  sugov:0-2454  [002]  2657.992595: funcgraph_exit:   + 17.448 us  |   
>  }
>  sugov:0-2454  [002]  2657.992597: funcgraph_entry:1.875 us   |   
>  update_nohz_stats();
>  sugov:0-2454  [002]  2657.992601: funcgraph_entry:1.667 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992605: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992608: funcgraph_entry:  + 33.333 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992643: funcgraph_exit:   + 38.073 us  |   
>  }
>  sugov:0-2454  [002]  2657.992645: funcgraph_entry:1.770 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992649: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992651: funcgraph_entry:  + 41.823 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992694: funcgraph_exit:   + 45.729 us  |   
>  }
>  sugov:0-2454  [002]  2657.992696: funcgraph_entry:1.823 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992700: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992702: funcgraph_entry:  + 35.312 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992740: funcgraph_exit:   + 39.792 us  |   
>  }
>  sugov:0-2454  [002]  2657.992742: funcgraph_entry:1.771 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992746: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992748: funcgraph_entry:  + 33.438 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992783: funcgraph_exit:   + 37.500 us  |   
>

Re: 5.11-rc4+git: Shortest NUMA path spans too many nodes

2021-01-22 Thread Dietmar Eggemann

On 21/01/2021 22:17, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -Original Message-----
>> From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com]
>> Sent: Friday, January 22, 2021 7:54 AM
>> To: Valentin Schneider ; Meelis Roos
>> ; LKML 
>> Cc: Peter Zijlstra ; Vincent Guittot
>> ; Song Bao Hua (Barry Song)
>> ; Mel Gorman 
>> Subject: Re: 5.11-rc4+git: Shortest NUMA path spans too many nodes
>>
>> On 21/01/2021 19:21, Valentin Schneider wrote:
>>> On 21/01/21 19:39, Meelis Roos wrote:

[...]

>> # cat /sys/devices/system/node/node*/distance
>> 10 12 12 14 14 14 14 16
>> 12 10 14 12 14 14 12 14
>> 12 14 10 14 12 12 14 14
>> 14 12 14 10 12 12 14 14
>> 14 14 12 12 10 14 12 14
>> 14 14 12 12 14 10 14 12
>> 14 12 14 14 12 14 10 12
>> 16 14 14 14 14 12 12 10
>>
>> The '16' seems to be the culprit. How does such a topo look like?

Maybe like this:

  _
  |   |
.-6   0   4-.
|  \ / \ /  |
|   1   2   |
|   \\  |
--7  35 |
  |  ||_|
  |___|

> 
> Once we get a topology like this:
> 
> 
>  +--+ +--++---+   +--+
>  | node | |node  || node  |   |node  |
>  |  +-+  ++   +---+  |
>  +--+ +--++---+   +--+
> 
> We can reproduce this issue. 
> For example, every cpu with the below numa_distance can have 
> "groups don't span domain->span":
> node   0   1   2   3
>   0:  10  12  20  22
>   1:  12  10  22  24
>   2:  20  22  10  12
>   3:  22  24  12  10
 2 20 2
So this should look like: 1 --- 0  2 --- 3

Re: 5.11-rc4+git: Shortest NUMA path spans too many nodes

2021-01-21 Thread Dietmar Eggemann

On 21/01/2021 19:21, Valentin Schneider wrote:
> On 21/01/21 19:39, Meelis Roos wrote:
>>> Could you paste the output of the below?
>>>
>>>$ cat /sys/devices/system/node/node*/distance
>>
>> 10 12 12 14 14 14 14 16
>> 12 10 14 12 14 14 12 14
>> 12 14 10 14 12 12 14 14
>> 14 12 14 10 12 12 14 14
>> 14 14 12 12 10 14 12 14
>> 14 14 12 12 14 10 14 12
>> 14 12 14 14 12 14 10 12
>> 16 14 14 14 14 12 12 10
>>
> 
> Thanks!
> 
>>
>>> Additionally, booting your system with CONFIG_SCHED_DEBUG=y and
>>> appending 'sched_debug' to your cmdline should yield some extra data.
>>
>> [0.00] Linux version 5.11.0-rc4-00015-g45dfb8a5659a (mroos@x4600m2) 
>> (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 
>> 2.35.1) #55 SMP Thu Jan 21 19:23:10 EET 2021
>> [0.00] Command line: 
>> BOOT_IMAGE=/boot/vmlinuz-5.11.0-rc4-00015-g45dfb8a5659a root=/dev/sda1 ro 
>> quiet
> 
> This is missing 'sched_debug' to get the extra topology debug prints (yes
> it needs an extra cmdline argument on top of having CONFIG_SCHED_DEBUG=y),
> but I should be able to generate those locally by feeding QEMU the above
> distance table.

Can be recreated with (simplified with only 1 CPU per node):

$ qemu-system-aarch64 -kernel /opt/git/kernel_org/arch/arm64/boot/Image -hda 
/opt/git/tools/qemu-imgs-manipulator/images/qemu-image-aarch64.img -append 
'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -nographic -machine 
virt,gic-version=max -smp cores=8 -m 512 -cpu cortex-a57 -numa 
node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, -numa node,cpus=2,nodeid=2, 
-numa node,cpus=3,nodeid=3, -numa node,cpus=4,nodeid=4, -numa 
node,cpus=5,nodeid=5, -numa node,cpus=6,nodeid=6, -numa node,cpus=7,nodeid=7, 
-numa dist,src=0,dst=1,val=12, -numa dist,src=0,dst=2,val=12, -numa 
dist,src=0,dst=3,val=14, -numa dist,src=0,dst=4,val=14, -numa 
dist,src=0,dst=5,val=14, -numa dist,src=0,dst=6,val=14, -numa 
dist,src=0,dst=7,val=16, -numa dist,src=1,dst=2,val=14, -numa 
dist,src=1,dst=3,val=12, -numa dist,src=1,dst=4,val=14, -numa 
dist,src=1,dst=5,val=14, -numa dist,src=1,dst=6,val=12, -numa 
dist,src=1,dst=7,val=14, -numa dist,src=2,dst=3,val=14, -numa 
dist,src=2,dst=4,val=12, -numa dist,src=2,dst=5,val=12, -numa 
dist,src=2,dst=6,val=14, -numa dist,src=2,dst=7,val=14, -numa 
dist,src=3,dst=4,val=12, -numa dist,src=3,dst=5,val=12, -numa 
dist,src=3,dst=6,val=14, -numa dist,src=3,dst=7,val=14, -numa 
dist,src=4,dst=5,val=14, -numa dist,src=4,dst=6,val=12, -numa 
dist,src=4,dst=7,val=14, -numa dist,src=5,dst=6,val=14, -numa 
dist,src=5,dst=7,val=12, -numa dist,src=6,dst=7,val=12

[0.206628] [ cut here ]
[0.206698] Shortest NUMA path spans too many nodes
[0.207119] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:753 
cpu_attach_domain+0x42c/0x87c
[0.207176] Modules linked in:
[0.207373] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.11.0-rc2-00010-g65bcf072e20e-dirty #81
[0.207458] Hardware name: linux,dummy-virt (DT)
[0.207584] pstate: 6005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[0.207618] pc : cpu_attach_domain+0x42c/0x87c
[0.207646] lr : cpu_attach_domain+0x42c/0x87c
[0.207665] sp : 800011fcbbf0
[0.207679] x29: 800011fcbbf0 x28: 024d8200 
[0.207735] x27: 1fef x26: 1917 
[0.207755] x25: 024d8000 x24: 1917 
[0.207772] x23:  x22: 800011b69a40 
[0.207789] x21: 024d8320 x20: 8000116fda80 
[0.207806] x19: 024d8000 x18:  
[0.207822] x17:  x16: bd30d762 
[0.207838] x15: 0030 x14:  
[0.207855] x13: 800011b82e08 x12: 01b9 
[0.207871] x11: 0093 x10: 800011bdae08 
[0.207887] x9 : f000 x8 : 800011b82e08 
[0.207922] x7 : 800011bdae08 x6 :  
[0.207939] x5 :  x4 :  
[0.207955] x3 :  x2 :  
[0.207972] x1 :  x0 : 1802 
[0.208125] Call trace:
[0.208230]  cpu_attach_domain+0x42c/0x87c
[0.208256]  build_sched_domains+0x1238/0x12f4
[0.208271]  sched_init_domains+0x80/0xb0
[0.208283]  sched_init_smp+0x30/0x80
[0.208299]  kernel_init_freeable+0xf4/0x238
[0.208313]  kernel_init+0x14/0x118
[0.208328]  ret_from_fork+0x10/0x34
[0.208507] ---[ end trace 75cafa7c7d1a3d7e ]---
[0.208706] CPU0 attaching sched-domain(s):
[0.208756]  domain-0: span=0-2 level=NUMA
[0.209001]   groups: 0:{ span=0 cap=1017 }, 1:{ span=1 cap=1016 }, 2:{ 
span=2 cap=1015 }
[0.209247]   domain-1: span=0-6 level=NUMA
[0.209280]groups: 0:{ span=0-2 mask=0 cap=3048 }, 3:{ span=1,3-5 mask=3 
cap=4073 }, 6:{ span=1,4,6-7 mask=6 cap=4084 }
[0.209693] ERROR: groups don't span domain->span
[0.209703]domain-2: span=0-7 level=NUMA
[0.209722] groups: 0:{ span=0-6

Re: [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable

2021-01-19 Thread Dietmar Eggemann

On 19/01/2021 10:41, Daniel Bristot de Oliveira wrote:
> On 1/14/21 4:51 PM, Dietmar Eggemann wrote:
>> On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:

[...]

>> with this patch:
>>
>> cgroupv1:
>>
>> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 10
>> --sched-runtime 1 0 sleep 500 &
>> [1] 1668
>> root@juno:/sys/fs/cgroup/cpuset# PID1=$!
>>
>> root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 10
>> --sched-runtime 1 0 sleep 500 &
>> [2] 1669
>> root@juno:/sys/fs/cgroup/cpuset# PID2=$!
>>
>> root@juno:/sys/fs/cgroup/cpuset# mkdir A
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems
>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
>> -bash: echo: write error: Device or resource busy
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive
>>
>> root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
>>
>> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep
>> Cpus_allowed_list | awk '{print $2}'
>> 0-5
>> root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep
>> Cpus_allowed_list | awk '{print $2}'
>> 0
> 
> On CPU v1 we also need to disable the load balance to create a root domain, 
> right?

IMHO, that's not necessary for this example. But yes, if we create 2
exclusive cpusets A and B we want to turn off load-balancing on root
level. It also doesn't hurt doing this in this example. But we end up
with no sched domain since load-balance is disabled at root and A only
contains CPU0.

root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance

ls /proc/sys/kernel/sched_domain/cpu*/ doesn't show any (sched) domains.

>> cgroupv2:
> 
> Yeah, I see your point. I was seeing a different output because of Fedora
> default's behavior of adding the tasks to the system.slice/user.slice...
> 
> doing:
> 
>> root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control
> 
> # echo $$ > cgroup.procs

The current shell should be already in the root cgroup?

root@juno:/sys/fs/cgroup# echo $$
1644
root@juno:/sys/fs/cgroup# cat cgroup.procs | grep $$
1644

[...]

[PATCH] sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()

2021-01-19 Thread Dietmar Eggemann

dl_add_task_root_domain() is called during sched domain rebuild:

  rebuild_sched_domains_locked()
partition_and_rebuild_sched_domains()
  rebuild_root_domains()
 for all top_cpuset descendants:
   update_tasks_root_domain()
 for all tasks of cpuset:
   dl_add_task_root_domain()

Change it so that only the task pi lock is taken to check if the task
has a SCHED_DEADLINE (DL) policy. In case that p is a DL task take the
rq lock as well to be able to safely de-reference root domain's DL
bandwidth structure.

Most of the tasks will have another policy (namely SCHED_NORMAL) and
can now bail without taking the rq lock.

One thing to note here: Even in case that there aren't any DL user
tasks, a slow frequency switching system with cpufreq gov schedutil has
a DL task (sugov) per frequency domain running which participates in DL
bandwidth management.

Reviewed-by: Quentin Perret 
Signed-off-by: Dietmar Eggemann 
---

The use case in which this makes a noticeable difference is Android's
'CPU pause' power management feature.

It uses CPU hotplug control to clear a CPU from active state to force
all threads which are not per-cpu kthreads away from this CPU.

Making DL bandwidth management faster during sched domain rebuild helps
to reduce the time to pause/un-pause a CPU.

 kernel/sched/deadline.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5421782fe897..c7b1a63a053b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2409,9 +2409,13 @@ void dl_add_task_root_domain(struct task_struct *p)
struct rq *rq;
struct dl_bw *dl_b;
 
-   rq = task_rq_lock(p, );
-   if (!dl_task(p))
-   goto unlock;
+   raw_spin_lock_irqsave(>pi_lock, rf.flags);
+   if (!dl_task(p)) {
+   raw_spin_unlock_irqrestore(>pi_lock, rf.flags);
+   return;
+   }
+
+   rq = __task_rq_lock(p, );
 
dl_b = >rd->dl_bw;
raw_spin_lock(_b->lock);
@@ -2420,7 +2424,6 @@ void dl_add_task_root_domain(struct task_struct *p)
 
raw_spin_unlock(_b->lock);
 
-unlock:
task_rq_unlock(rq, p, );
 }
 
-- 
2.25.1

Re: [PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks

2021-01-15 Thread Dietmar Eggemann

On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:

[...]

> - %< -
>   #!/bin/bash
>   # Enter on the cgroup directory
>   cd /sys/fs/cgroup/
> 
>   # Check it if is cgroup v2 and enable cpuset
>   if [ -e cgroup.subtree_control ]; then
>   # Enable cpuset controller on cgroup v2
>   echo +cpuset > cgroup.subtree_control
>   fi
> 
>   echo LOG: create an exclusive cpuset and assigned the CPU 0 to it
>   # Create cpuset groups
>   rmdir dl-group &> /dev/null
>   mkdir dl-group
> 
>   # Restrict the task to the CPU 0
>   echo 0 > dl-group/cpuset.mems
>   echo 0 > dl-group/cpuset.cpus
>   echo root >  dl-group/cpuset.cpus.partition
> 
>   echo LOG: dispatching a regular task
>   sleep 100 &
>   CPUSET_PID="$!"
> 
>   # let it settle down
>   sleep 1
> 
>   # Assign the second task to the cgroup

There is only one task 'CPUSET_PID' involved here?

>   echo LOG: moving the second DL task to the cpuset
>   echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null
> 
>   CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk 
> '{print $2}'`
> 
>   chrt -p -d --sched-period 10 --sched-runtime 1 0 $CPUSET_PID
>   ACCEPTED=$?
> 
>   if [ $ACCEPTED == 0 ]; then
>   echo PASS: the task became DL
>   else
>   echo FAIL: the task was rejected as DL
>   fi

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5961a97541c2..3c2775e6869f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p,
>  #ifdef CONFIG_SMP
>   if (dl_bandwidth_enabled() && dl_policy(policy) &&
>   !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
> - cpumask_t *span = rq->rd->span;
> + struct root_domain *rd = dl_task_rd(p);
>  
>   /*
>* Don't allow tasks with an affinity mask smaller than
>* the entire root_domain to become SCHED_DEADLINE. We
>* will also fail if there's no bandwidth available.
>*/
> - if (!cpumask_subset(span, p->cpus_ptr) ||
> - rq->rd->dl_bw.bw == 0) {
> + if (!cpumask_subset(rd->span, p->cpus_ptr) ||
> + rd->dl_bw.bw == 0) {
>   retval = -EPERM;
>   goto unlock;
>   }
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index c221e14d5b86..1f6264cb8867 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy,
>   u64 period = attr->sched_period ?: attr->sched_deadline;
>   u64 runtime = attr->sched_runtime;
>   u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> - int cpus, err = -1, cpu = task_cpu(p);
> - struct dl_bw *dl_b = dl_bw_of(cpu);
> + int cpus, err = -1, cpu = dl_task_cpu(p);
> + struct dl_bw *dl_b = dl_task_root_bw(p);
>   unsigned long cap;
>  
>   if (attr->sched_flags & SCHED_FLAG_SUGOV)
> 

Wouldn't it be sufficient to just introduce dl_task_cpu() and use the
correct cpu to get rd->span or struct dl_bw in __sched_setscheduler()
-> sched_dl_overflow()?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5961a97541c2..0573f676696a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5905,7 +5905,8 @@ static int __sched_setscheduler(struct task_struct *p,
 #ifdef CONFIG_SMP
if (dl_bandwidth_enabled() && dl_policy(policy) &&
!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
-   cpumask_t *span = rq->rd->span;
+   int cpu = dl_task_cpu(p);
+   cpumask_t *span = cpu_rq(cpu)->rd->span;
 
/*
 * Don't allow tasks with an affinity mask smaller than
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c221e14d5b86..308ecaaf3d28 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2678,7 +2678,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-   int cpus, err = -1, cpu = task_cpu(p);
+   int cpus, err = -1, cpu = dl_task_cpu(p);
struct dl_bw *dl_b = dl_bw_of(cpu);
unsigned long cap;

Re: [PATCH 4/6] sched/deadline: Block DL tasks on non-exclusive cpuset if bandwitdh control is enable

2021-01-14 Thread Dietmar Eggemann

On 12/01/2021 16:53, Daniel Bristot de Oliveira wrote:
> The current SCHED_DEADLINE design supports only global scheduler,
> or variants of it, i.e., clustered and partitioned, via cpuset config.
> To enable the partitioning of a system with clusters of CPUs, the
> documentation advises the usage of exclusive cpusets, creating an
> exclusive root_domain for the cpuset.
> 
> Attempts to change the cpu affinity of a thread to a cpu mask different
> from the root domain results in an error. For instance:

[...]

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 788a391657a5..c221e14d5b86 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p,
>   if (cpumask_empty(cs_cpus_allowed))
>   return 0;
>  
> + /*
> +  * Do not allow moving tasks to non-exclusive cpusets
> +  * if bandwidth control is enabled.
> +  */
> + if (dl_bandwidth_enabled() && !exclusive)
> + return -EBUSY;
> +
>   /*
>* The task is not moving to another root domain, so it is
>* already accounted.
> 

But doesn't this mean you only have to set this cgroup exclusive/root to
run into the same issue:

with this patch:

cgroupv1:

root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 10
--sched-runtime 1 0 sleep 500 &
[1] 1668
root@juno:/sys/fs/cgroup/cpuset# PID1=$!

root@juno:/sys/fs/cgroup/cpuset# chrt -d --sched-period 10
--sched-runtime 1 0 sleep 500 &
[2] 1669
root@juno:/sys/fs/cgroup/cpuset# PID2=$!

root@juno:/sys/fs/cgroup/cpuset# mkdir A

root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.mems
root@juno:/sys/fs/cgroup/cpuset# echo 0 > ./A/cpuset.cpus

root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs
-bash: echo: write error: Device or resource busy

root@juno:/sys/fs/cgroup/cpuset# echo 1 > ./A/cpuset.cpu_exclusive

root@juno:/sys/fs/cgroup/cpuset# echo $PID2 > ./A/cgroup.procs

root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID1/status | grep
Cpus_allowed_list | awk '{print $2}'
0-5
root@juno:/sys/fs/cgroup/cpuset# cat /proc/$PID2/status | grep
Cpus_allowed_list | awk '{print $2}'
0

cgroupv2:

root@juno:/sys/fs/cgroup# echo +cpuset > cgroup.subtree_control

root@juno:/sys/fs/cgroup# chrt -d --sched-period 10
--sched-runtime 1 0 sleep 500 &
[1] 1687
root@juno:/sys/fs/cgroup# PID1=$!

root@juno:/sys/fs/cgroup# chrt -d --sched-period 10
--sched-runtime 1 0 sleep 500 &
[2] 1688
root@juno:/sys/fs/cgroup# PID2=$!

root@juno:/sys/fs/cgroup# mkdir A

root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.mems
root@juno:/sys/fs/cgroup# echo 0 > ./A/cpuset.cpus

root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs
-bash: echo: write error: Device or resource busy

root@juno:/sys/fs/cgroup# echo root > ./A/cpuset.cpus.partition

root@juno:/sys/fs/cgroup# echo $PID2 > ./A/cgroup.procs

root@juno:/sys/fs/cgroup# cat /proc/$PID1/status | grep
Cpus_allowed_list | awk '{print $2}'
0-5
root@juno:/sys/fs/cgroup# cat /proc/$PID2/status | grep
Cpus_allowed_list | awk '{print $2}'
0

Re: [PATCH 3/6] sched/deadline: Allow DL tasks on empty (cgroup v2) cpusets

2021-01-14 Thread Dietmar Eggemann

oot cpuset, hence its cpu mask will
> be the same as the root one. So, the bandwidth was already accounted,
> and the task can proceed.

LGTM.

After the '/sys/fs/cgroup# echo '+cpuset' > cgroup.subtree_control':

root's cpuset.cpus.effective == user.slice's cpuset.cpus.effective

> Signed-off-by: Daniel Bristot de Oliveira 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Juri Lelli 
> Cc: Vincent Guittot 
> Cc: Dietmar Eggemann 
> Cc: Steven Rostedt 
> Cc: Ben Segall 
> Cc: Mel Gorman 
> Cc: Daniel Bristot de Oliveira 
> Cc: Li Zefan 
> Cc: Tejun Heo 
> Cc: Johannes Weiner 
> Cc: Valentin Schneider 
> Cc: linux-kernel@vger.kernel.org
> Cc: cgro...@vger.kernel.org
> ---
>  kernel/sched/deadline.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 943aa32cc1bc..788a391657a5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2871,6 +2871,13 @@ int dl_task_can_attach(struct task_struct *p,
>   bool overflow;
>   int ret;
>  
> + /*
> +  * The cpuset has no cpus assigned, so the thread will not
> +  * change its affinity.
> +  */
> + if (cpumask_empty(cs_cpus_allowed))
> + return 0;
> +
>   /*
>* The task is not moving to another root domain, so it is
>* already accounted.
>

Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-01-12 Thread Dietmar Eggemann

On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote:
>  
>> -Original Message-
>> From: Morten Rasmussen [mailto:morten.rasmus...@arm.com]
>> Sent: Saturday, January 9, 2021 4:13 AM
>> To: Tim Chen 
>> Cc: Song Bao Hua (Barry Song) ;
>> valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
>> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
>> gre...@linuxfoundation.org; Jonathan Cameron ;
>> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
>> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
>> mgor...@suse.de; mark.rutl...@arm.com; sudeep.ho...@arm.com;
>> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
>> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org;
>> linux...@openeuler.org; xuwei (O) ; Zengtao (B)
>> ; tiantao (H) 
>> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters 
>> and
>> add cluster scheduler
>>
>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
>>> On 1/6/21 12:30 AM, Barry Song wrote:
 ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
 cluster has 4 cpus. All clusters share L3 cache data while each cluster
 has local L3 tag. On the other hand, each cluster will share some
 internal system bus. This means cache is much more affine inside one 
 cluster
 than across clusters.
>>>
>>> There is a similar need for clustering in x86.  Some x86 cores could share
>> L2 caches that
>>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 
>>> clusters
>>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing
>> L3).
>>> Having a sched domain at the L2 cluster helps spread load among
>>> L2 domains.  This will reduce L2 cache contention and help with
>>> performance for low to moderate load scenarios.
>>
>> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
>> between L2 caches while Barry is after consolidating tasks within the
>> boundaries of a L3 tag cache. One helps cache utilization, the other
>> communication latency between tasks. Am I missing something?
> 
> Morten, this is not true.
> 
> we are both actually looking for the same behavior. My patch also
> has done the exact same behavior of spreading with Tim's patch.

That's the case for the load-balance path because of the extra Sched
Domain (SD) (CLS/MC_L2) below MC.

But in wakeup you add code which leads to a different packing strategy.

It looks like that Tim's workload (SPECrate mcf) shows a performance
boost solely because of the changes the additional MC_L2 SD introduces
in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we
have to carefully distinguish between packing vs. spreading in wakeup
and load-balance here.

> Considering the below two cases:
> Case 1. we have two tasks without any relationship running in a system with 2 
> clusters and 8 cpus.
> 
> Without the sched_domain of cluster, these two tasks might be put as below:
> +---++-+
> | ++   ++   || |
> | |task|   |task|   || |
> | |1   |   |2   |   || |
> | ++   ++   || |
> |   || |
> |   cluster1|| cluster2|
> +---++-+
> With the sched_domain of cluster, load balance will spread them as below:
> +---++-+
> | ++|| ++  |
> | |task||| |task|  |
> | |1   ||| |2   |  |
> | ++|| ++  |
> |   || |
> |   cluster1|| cluster2|
> +---++-+
> 
> Then task1 and tasks2 get more cache and decrease cache contention.
> They will get better performance.
> 
> That is what my original patch also can make. And tim's patch
> is also doing. Once we add a sched_domain, load balance will
> get involved.
> 
> 
> Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus.
> But they are working in 4 groups:
> Task1 wakes up task4
> Task2 wakes up task5
> Task3 wakes up task6
> Task4 wakes up task7
> 
> With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will
> try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same 
> clusters rather
> than putting all of them in the random one of the 8 cpus. However, the 8 tasks
> are still spreading among the 8 cpus with my change in select_idle_sibling
> as load balance is still working.
> 
> +---++--+
> | +++-+ || ++  +-+  |
> | |task||task | || |task|  |task |  |
> | |1   |

Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-01-12 Thread Dietmar Eggemann

On 11/01/2021 10:28, Morten Rasmussen wrote:
> On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote:
>>
>>
>> On 1/8/21 7:12 AM, Morten Rasmussen wrote:
>>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
 On 1/6/21 12:30 AM, Barry Song wrote:

[...]

>> I think it is going to depend on the workload.  If there are dependent
>> tasks that communicate with one another, putting them together
>> in the same cluster will be the right thing to do to reduce communication
>> costs.  On the other hand, if the tasks are independent, putting them 
>> together on the same cluster
>> will increase resource contention and spreading them out will be better.
> 
> Agree. That is exactly where I'm coming from. This is all about the task
> placement policy. We generally tend to spread tasks to avoid resource
> contention, SMT and caches, which seems to be what you are proposing to
> extend. I think that makes sense given it can produce significant
> benefits.
> 
>>
>> Any thoughts on what is the right clustering "tag" to use to clump
>> related tasks together?
>> Cgroup? Pid? Tasks with same mm?
> 
> I think this is the real question. I think the closest thing we have at
> the moment is the wakee/waker flip heuristic. This seems to be related.
> Perhaps the wake_affine tricks can serve as starting point?

wake_wide() switches between packing (select_idle_sibling(), llc_size
CPUs) and spreading (find_idlest_cpu(), all CPUs).

AFAICS, since none of the sched domains set SD_BALANCE_WAKE, currently
all wakeups are (llc-)packed.

 select_task_rq_fair()

   for_each_domain(cpu, tmp)

 if (tmp->flags & sd_flag)
   sd = tmp;


In case we would like to further distinguish between llc-packing and
even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
packing vs. spreading heuristic further down in sis().

IMHO, Barry's current implementation doesn't do this right now. Instead
he's trying to pack on cluster first and if not successful look further
among the remaining llc CPUs for an idle CPU.

Re: [PATCH v2] fair/util_est: fix schedutil may use an old util_est.enqueued value at dequeue

2020-12-17 Thread Dietmar Eggemann

On 16/12/2020 05:50, Xuewen Yan wrote:
> From: Xuewen Yan 
> 
> when a task dequeued, it would update it's util and cfs_rq's util,
> the following calling relationship exists here:
> 
> update_load_avg() -> cfs_rq_util_change() -> cpufreq_update_util()->
> sugov_update_[shared\|single] -> sugov_get_util() -> cpu_util_cfs();
> 
> util = max {rq->cfs.avg.util_avg,rq->cfs.avg.util_est.enqueued };
> 
> For those tasks alternate long and short runs scenarios, the
> rq->cfs.avg.util_est.enqueued may be bigger than rq->cfs.avg.util_avg.
> but because the _task_util_est(p) didn't be subtracted, this would
> cause schedutil use an old util_est.enqueued value.
> 
> This could have an effect in task ramp-down and ramp-up scenarios.
> when the task dequeued, we hope the freq could be reduced as soon
> as possible. If the schedutil's value is the old util_est.enqueued, this
> may cause the cpu couldn't reduce it's freq.
> 
> separate the util_est_dequeue() into util_est_dequeue() and
> util_est_update(), and dequeue the _task_util_est(p) before update util.

IMHO we could still describe it a little bit different:

---

sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue


CPU (root cfs_rq) estimated utilization (util_est) is currently used in
dequeue_task_fair() to drive frequency selection before it is updated.


with:

CPU_util: rq->cfs.avg.util_avg
CPU_util_est: rq->cfs.avg.util_est
CPU_utilization : max(CPU_util, CPU_util_est)
task_util   : p->se.avg.util_avg
task_util_est   : p->se.avg.util_est

dequeue_task_fair()

/* (1) CPU_util and task_util update + inform schedutil about
   CPU_utilization changes */
for_each_sched_entity() /* 2 loops */
(dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
 -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]
 -> sugov_get_util() -> cpu_util_cfs()

/* (2) CPU_util_est and task_util_est update */
util_est_dequeue()


cpu_util_cfs() uses CPU_utilization which could lead to a false (too
high) utilization value for schedutil in task ramp-down or ramp-up
scenarios during task dequeue.

To mitigate the issue split the util_est update (2) into:

 (A) CPU_util_est update in util_est_dequeue()
 (B) task_util_est update in util_est_update()

Place (A) before (1) and keep (B) where (2) is. The latter is necessary
since (B) relies on task_util update in (1).

---

What do you think?

> 
> Signed-off-by: Xuewen Yan 
> Signed-off-by: Dietmar Eggemann 

Please remove my Signed-off-by line. I'm just reviewing & testing it.

With the comments below adressed you can change it to a:

Reviewed-by: Dietmar Eggemann 

> ---
> Changes since v1:
> -change the util_est_dequeue/update to inline type
> -use unsigned int enqueued rather than util_est in util_est_dequeue
> -remove "cpu" var
> 
> ---
>  kernel/sched/fair.c | 37 +
>  1 file changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae7ceba..864d6b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3945,22 +3945,31 @@ static inline bool within_margin(int value, int 
> margin)
>   return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
>  }
>  
> -static void
> -util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool 
> task_sleep)
> +static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> + struct task_struct *p)

Could you move util_est_dequeue above within_margin() since the later is
only used in util_est_update().

>  {
> - long last_ewma_diff;
> - struct util_est ue;
> - int cpu;
> + unsigned int enqueued;
>  
>   if (!sched_feat(UTIL_EST))
>   return;
>  
>   /* Update root cfs_rq's estimated utilization */
> - ue.enqueued  = cfs_rq->avg.util_est.enqueued;
> - ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
> - WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
> + enqueued  = cfs_rq->avg.util_est.enqueued;
> + enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
>  
>   trace_sched_util_est_cfs_tp(cfs_rq);
> +}
> +
> +static inline void util_est_update(struct cfs_rq *cfs_rq,
> + struct task_struct *p,
> + bool task_sleep)

Nitpick: util_est_enqueue() aligns the arguments on space whereas you do
on tab for util_est_dequeue and util_est_update. I know there is no
strict rule but IMHO these util_est_xxx() should use the same.

[...]

During review I got sus

Re: [PATCH] fair/util_est: Separate util_est_dequeue() for cfs_rq_util_change

2020-12-15 Thread Dietmar Eggemann

On 09/12/2020 11:44, Xuewen Yan wrote:
> when a task dequeued, it will update it's util, and cfs_rq_util_change
> would check rq's util, if the cfs_rq->avg.util_est.enqueued is bigger
> than  cfs_rq->avg.util_avg, but because the cfs_rq->avg.util_est.enqueued
> didn't be decreased, this would cause bigger cfs_rq_util by mistake,
> as a result, cfs_rq_util_change may change freq unreasonablely.
> 
> separate the util_est_dequeue() into util_est_dequeue() and
> util_est_update(), and dequeue the _task_util_est(p) before update util.

I assume this patch header needs a little more substance so that less
involved folks understand the issue as well. Describing the testcase
which reveals the problem would help here too. 

> Signed-off-by: Xuewen Yan 
> ---
>  kernel/sched/fair.c | 24 +++-
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae7ceba..20ecfd5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3946,11 +3946,9 @@ static inline bool within_margin(int value, int margin)
>  }
>  
>  static void
> -util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool 
> task_sleep)
> +util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)

Not sure why util_est_enqueue is inline and util_est_dequeue() and
util_est_update() aren't?

>  {
> - long last_ewma_diff;
>   struct util_est ue;

You would just need a 'unsigned int enqueued' here, like in util_est_enqueue().

> - int cpu;
>  
>   if (!sched_feat(UTIL_EST))
>   return;
> @@ -3961,6 +3959,17 @@ static inline bool within_margin(int value, int margin)
>   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
>  
>   trace_sched_util_est_cfs_tp(cfs_rq);
> +}
> +
> +static void
> +util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p, bool 
> task_sleep)
> +{
> + long last_ewma_diff;
> + struct util_est ue;
> + int cpu;

Nitpick: 'int cpu' not needed

---8<---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c3685a743a76..53dfb20d101e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3956,28 +3956,28 @@ static inline bool within_margin(int value, int margin)
return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
 }
 
-static void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
+static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
+   struct task_struct *p)
 {
-   struct util_est ue;
+   unsigned int enqueued;
 
if (!sched_feat(UTIL_EST))
return;
 
/* Update root cfs_rq's estimated utilization */
-   ue.enqueued  = cfs_rq->avg.util_est.enqueued;
-   ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
-   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
+   enqueued  = cfs_rq->avg.util_est.enqueued;
+   enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
+   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
 
trace_sched_util_est_cfs_tp(cfs_rq);
 }
 
-static void
-util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+static inline void util_est_update(struct cfs_rq *cfs_rq,
+  struct task_struct *p,
+  bool task_sleep)
 {
long last_ewma_diff;
struct util_est ue;
-   int cpu;
 
if (!sched_feat(UTIL_EST))
return;
@@ -4021,8 +4021,7 @@ util_est_update(struct cfs_rq *cfs_rq, struct task_struct 
*p, bool task_sleep)
 * To avoid overestimation of actual task utilization, skip updates if
 * we cannot grant there is idle time in this CPU.
 */
-   cpu = cpu_of(rq_of(cfs_rq));
-   if (task_util(p) > capacity_orig_of(cpu))
+   if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq
return;
 
/*
-- 
2.17.1

Re: [PATCH] fair/util_est: Separate util_est_dequeue() for cfs_rq_util_change

2020-12-15 Thread Dietmar Eggemann

On 15/12/2020 13:56, Xuewen Yan wrote:
> On Tue, Dec 15, 2020 at 5:39 PM Vincent Guittot
>  wrote:
>>
>> On Mon, 14 Dec 2020 at 19:46, Dietmar Eggemann  
>> wrote:
>>>
>>> On 11/12/2020 13:03, Ryan Y wrote:

[...]

>>> It's true that 'sugov_get_util() -> cpu_util_cfs()' can use
>>> rq->cfs.avg.util_est.enqueued before _task_util_est(p) is subtracted
>>> from it.
>>>
>>> But isn't rq->cfs.avg.util_est.enqueued (in this case 601) always close
>>> to rq->cfs.avg.util_avg (597) since the task was just running?
>>> The cfs_rq utilization contains a blocked (sleeping) task.
>>
>> There will be a difference if the task alternates long and short runs
>> in which case util_avg is lower than util_est. But even in this case,
>> the freq will be update at next enqueue/dequeue/tick.
>> The only real case could be when cpu goes idle in shallow state (WFI)
>> which is impacted by the freq/voltage. But even in this case, the
>> situation should not be long
>>
>> That being said, I agree that the value used by schedutil is not
>> correct at dequeue

Ok, makes sense.

> Hi Dietmar,
> 
> as Vincent said, like the following scenario:
>running  sleep
> runningsleep
> |---|
>   ||
> 
>   ^^
> at the ^^ time, the util_avg is lower than util_est.
> we hope that the CPU frequency would be reduced as soon as possible after
> the task is dequeued. But the issue affects the sensitivity of cpu
> frequency reduce.
> worse, since the time, if there is no task enqueue or other scenarios where 
> the
> load is updated, the cpu may always maintain a high frequency.
> 
> if keep the util_est_dequeue as it is, are there other concerns,
> or this patch would affect other places of system?

I see. So essentially this could have an effect in task ramp-down and ramp-up 
scenarios.

periodic ramp-down task [8/16 -> 4/16 -> 1/16ms], 3 consecutive 
dequeue_task_fair():  

task0-0-1690  [000] 43.677788: sched_pelt_se:  cpu=0 path=(null) 
comm=task0-0 pid=1690 load=283 runnable=283 util=283 update_time=4963776
task0-0-1690  [000] 43.677792: sched_pelt_cfs: cpu=0 path=/ load=283 
runnable=283 util=283 update_time=4963776
task0-0-1690  [000] 43.677798: bprint: sugov_get_util: CPU0 
rq->cfs.avg.util_avg=283 rq->cfs.avg.util_est.enqueued=249
task0-0-1690  [000] 43.677803: sched_util_est_cfs: cpu=0 path=/ enqueued=0 
ewma=0 util=283
task0-0-1690  [000] 43.677805: sched_util_est_se:  cpu=0 path=(null) 
comm=task0-0 pid=1690 enqueued=283 ewma=283 util=283
task0-0-1690  [000] 43.677810: bprint: dequeue_task_fair: CPU0 
[task0-0 1690] rq->cfs.avg.util_avg=[270->283] 
rq->cfs.avg.util_est.enqueued=[249->0]

task0-0-1690  [000] 43.698764: sched_pelt_se:  cpu=0 path=(null) 
comm=task0-0 pid=1690 load=247 runnable=248 util=248 update_time=43363011584
task0-0-1690  [000] 43.698768: sched_pelt_cfs: cpu=0 path=/ load=248 
runnable=248 util=248 update_time=43363011584
--> task0-0-1690  [000] 43.698774: bprint: sugov_get_util: CPU0 
rq->cfs.avg.util_avg=248 rq->cfs.avg.util_est.enqueued=283 <-- !!!
task0-0-1690  [000] 43.698778: sched_util_est_cfs: cpu=0 path=/ enqueued=0 
ewma=0 util=248
task0-0-1690  [000] 43.698780: sched_util_est_se:  cpu=0 path=(null) 
comm=task0-0 pid=1690 enqueued=249 ewma=274 util=248
task0-0-1690  [000] 43.698785: bprint: dequeue_task_fair: CPU0 
[task0-0 1690] rq->cfs.avg.util_avg=[228->248] 
rq->cfs.avg.util_est.enqueued=[283->0]


task0-0-1690  [000] 43.714120: sched_pelt_se:  cpu=0 path=(null) 
comm=task0-0 pid=1690 load=183 runnable=183 util=183 update_time=43384443904
task0-0-1690  [000] 43.714125: sched_pelt_cfs: cpu=0 path=/ load=183 
runnable=183 util=183 update_time=43384443904
--> task0-0-1690  [000] 43.714131: bprint: sugov_get_util: CPU0 
rq->cfs.avg.util_avg=183 rq->cfs.avg.util_est.enqueued=275 <-- !!!
task0-0-1690  [000] 43.714135: sched_util_est_cfs: cpu=0 path=/ enqueued=0 
ewma=0 util=183
task0-0-1690  [000] 43.714138: sched_util_est_se:  cpu=0 path=(null) 
comm=task0-0 pid=1690 enqueued=183 ewma=251 util=183
task0-0-1690  [000] 43.714142: bprint: dequeue_task_fair: CPU0 
[task0-0 1690] rq->cfs.avg.util_avg=[163->183] 
rq->cfs.avg.util_est.enqueued=[275->0]

Re: [PATCH] fair/util_est: Separate util_est_dequeue() for cfs_rq_util_change

2020-12-14 Thread Dietmar Eggemann

On 11/12/2020 13:03, Ryan Y wrote:
> Hi Dietmar,
> 
> Yes! That's exactly what I meant.
> 
>> The issue is that sugov_update_[shared\|single] -> sugov_get_util() ->
>> cpu_util_cfs() operates on an old  cfs_rq->avg.util_est.enqueued value?
> 
> well, because of this, when the p dequeued, _task_util_est(p) should be
> subtracted before cfs_rq_util_change().
> however, the original util_est_dequeue() dequeue the util_est and update
> the
> p->se.avg.util_est together.
> so I separate the original util_est_dequeue() to deal with the issue.

OK, I see.

I ran a testcase '50% periodic task 'task0-0' (8ms/16ms)' with
PELT + proprietary trace events within dequeue_task_fair() call:

task0-0-1710 [002] 218.215535: sched_pelt_se:  cpu=2 path=(null) 
comm=task0-0 pid=1710 load=596 runnable=597 util=597 update_time=218123022336
task0-0-1710 [002] 218.215536: sched_pelt_cfs: cpu=2 path=/ load=597 
runnable=597 util=597 update_time=218123022336
task0-0-1710 [002] 218.215538: bprint: sugov_get_util: CPU2 
rq->cfs.avg.util_avg=597 rq->cfs.avg.util_est.enqueued=601
task0-0-1710 [002] 218.215540: sched_util_est_cfs: cpu=2 path=/ enqueued=0 
ewma=0 util=597
task0-0-1710 [002] 218.215542: bprint: dequeue_task_fair: CPU2 
[task0-0 1710] rq->cfs.avg.util_avg=[576->597] 
rq->cfs.avg.util_est.enqueued=[601->0]

It's true that 'sugov_get_util() -> cpu_util_cfs()' can use
rq->cfs.avg.util_est.enqueued before _task_util_est(p) is subtracted
from it.

But isn't rq->cfs.avg.util_est.enqueued (in this case 601) always close
to rq->cfs.avg.util_avg (597) since the task was just running?
The cfs_rq utilization contains a blocked (sleeping) task.

If I would run with your patch cpu_util_cfs() would chose between 597 and 0
whereas without it does between 597 and 601.

Do you have a specific use case in mind? Or even test results showing a benefit
of your patch?

> Dietmar Eggemann  于2020年12月11日周五 下午7:30写道：
> 
>> Hi Yan,
>>
>> On 09/12/2020 11:44, Xuewen Yan wrote:
>>> when a task dequeued, it will update it's util, and cfs_rq_util_change
>>> would check rq's util, if the cfs_rq->avg.util_est.enqueued is bigger
>>> than  cfs_rq->avg.util_avg, but because the cfs_rq->avg.util_est.enqueued
>>> didn't be decreased, this would cause bigger cfs_rq_util by mistake,
>>> as a result, cfs_rq_util_change may change freq unreasonablely.
>>>
>>> separate the util_est_dequeue() into util_est_dequeue() and
>>> util_est_update(), and dequeue the _task_util_est(p) before update util.
>>
>> The issue is that sugov_update_[shared\|single] -> sugov_get_util() ->
>> cpu_util_cfs() operates on an old  cfs_rq->avg.util_est.enqueued value?
>>
>> cpu_util_cfs()
>>
>> if (sched_feat(UTIL_EST))
>> util = max_t(util, READ_ONCE(rq->cfs.avg.util_est.enqueued))
>>  ^
>>
>> dequeue_task_fair() (w/ your patch, moving (1) before (2))
>>
>> /* (1) update cfs_rq->avg.util_est.enqueued */
>> util_est_dequeue()
>>
>> /* (2) potential p->se.avg.util_avg update */
>> /* 2 for loops */
>> for_each_sched_entity()
>>
>> /* this can only lead to a freq change for a root cfs_rq */
>> (dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
>>  -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]
>>
>> /* (3) potential update p->se.avg.util_est */
>> util_est_update()
>>
>>
>> We do need (3) after (2) because of:
>>
>> util_est_update()
>> ...
>> ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED); task_util
>> ...   ^
>>   p->se.avg.util_avg
>>
>>
>> Did I get this right?
>>
>> [...]

Re: [PATCH] fair/util_est: Separate util_est_dequeue() for cfs_rq_util_change

2020-12-11 Thread Dietmar Eggemann

Hi Yan,

On 09/12/2020 11:44, Xuewen Yan wrote:
> when a task dequeued, it will update it's util, and cfs_rq_util_change
> would check rq's util, if the cfs_rq->avg.util_est.enqueued is bigger
> than  cfs_rq->avg.util_avg, but because the cfs_rq->avg.util_est.enqueued
> didn't be decreased, this would cause bigger cfs_rq_util by mistake,
> as a result, cfs_rq_util_change may change freq unreasonablely.
> 
> separate the util_est_dequeue() into util_est_dequeue() and
> util_est_update(), and dequeue the _task_util_est(p) before update util.

The issue is that sugov_update_[shared\|single] -> sugov_get_util() ->
cpu_util_cfs() operates on an old  cfs_rq->avg.util_est.enqueued value?

cpu_util_cfs()

if (sched_feat(UTIL_EST))
util = max_t(util, READ_ONCE(rq->cfs.avg.util_est.enqueued))
 ^

dequeue_task_fair() (w/ your patch, moving (1) before (2))

/* (1) update cfs_rq->avg.util_est.enqueued */
util_est_dequeue()

/* (2) potential p->se.avg.util_avg update */
/* 2 for loops */
for_each_sched_entity()

/* this can only lead to a freq change for a root cfs_rq */
(dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
 -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]

/* (3) potential update p->se.avg.util_est */
util_est_update()


We do need (3) after (2) because of:

util_est_update()
...
ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED); task_util
...   ^
  p->se.avg.util_avg


Did I get this right?

[...]

Re: [PATCH V4 3/3] thermal: cpufreq_cooling: Reuse sched_cpu_util() for SMP platforms

2020-12-08 Thread Dietmar Eggemann

On 07/12/2020 13:17, Viresh Kumar wrote:
> On 03-12-20, 12:54, Dietmar Eggemann wrote:
>> On 24/11/2020 07:26, Viresh Kumar wrote:

[...]

>> When I ran schbench (-t 16 -r 5) on hikey960 I get multiple (~50)
>> instances of ~80ms task activity phase and then ~20ms idle phase on all
>> CPUs.
>>
>> So I assume that scenario 1 is at the beginning (but you mentioned the
>> task were sleeping?)
> 
> I am not able to find the exact values I used, but I did something
> like this to create a scenario where the old computations shall find
> the CPU as idle in the last IPA window:
> 
> - schbench -m 2 -t 4 -s 25000 -c 2 -r 60
> 
> - sampling rate of IPA to 10 ms
> 
> With this IPA wakes up many times and finds the CPUs to have been idle
> in the last IPA window (i.e. 10ms).

Ah, this makes sense.

So with this there are only 8 worker threads w/ 20ms runtime and 75ms
period (30ms message thread time (-C) and 25 latency (-c)).

So much more idle time between two invocations of the worker/message
threads and more IPA sampling.

[...]

>>>  Old: thermal_power_cpu_get_power: cpus=,00ff freq=120 
>>> total_load=800 load={{0x64,0x64,0x64,0x64,0x64,0x64,0x64,0x64}} 
>>> dynamic_power=5280
>>>  New: thermal_power_cpu_get_power: cpus=,00ff freq=120 
>>> total_load=708 load={{0x4d,0x5c,0x5c,0x5b,0x5c,0x5c,0x51,0x5b}} 
>>> dynamic_power=4672
>>>
>>> As can be seen, the idle time based load is 100% for all the CPUs as it
>>> took only the last window into account, but in reality the CPUs aren't
>>> that loaded as shown by the utilization numbers.
>>
>> Is this an IPA sampling at the end of the ~20ms idle phase?
> 
> This is during the phase where the CPUs were fully busy for the last
> period.

OK.

Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

2020-12-08 Thread Dietmar Eggemann

On 07/12/2020 10:15, Mel Gorman wrote:
> SIS_AVG_CPU was introduced as a means of avoiding a search when the
> average search cost indicated that the search would likely fail. It
> was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> select_idle_cpu() more aggressive") and later replaced with a proportional
> search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> scale select_idle_cpu()").
> 
> While there are corner cases where SIS_AVG_CPU is better, it has now been
> disabled for almost three years. As the intent of SIS_PROP is to reduce
> the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> on SIS_PROP as a throttling mechanism.
> 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/fair.c | 3 ---
>  kernel/sched/features.h | 1 -
>  2 files changed, 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98075f9ea9a8..23934dbac635 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>   avg_idle = this_rq()->avg_idle / 512;
>   avg_cost = this_sd->avg_scan_cost + 1;
>  
> - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> - return -1;
> -
>   if (sched_feat(SIS_PROP)) {
>   u64 span_avg = sd->span_weight * avg_idle;
>   if (span_avg > 4*avg_cost)

Nitpick:

Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
completely into the SIS_PROP if condition.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09f6f0edead4..fce9457cccb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6121,7 +6121,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6130,14 +6129,13 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   /*
-* Due to large variance we need a large fuzz factor; hackbench in
-* particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
-
if (sched_feat(SIS_PROP)) {
+   /*
+* Due to large variance we need a large fuzz factor; hackbench 
in
+* particularly is sensitive here.
+*/
+   u64 avg_idle = this_rq()->avg_idle / 512;
+   u64 avg_cost = this_sd->avg_scan_cost + 1;
u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
-- 
2.17.1

Re: [PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target

2020-12-08 Thread Dietmar Eggemann

On 07/12/2020 10:15, Mel Gorman wrote:
> After select_idle_sibling, p->recent_used_cpu is set to the
> new target. However on the next wakeup, prev will be the same as

I'm confused here. Isn't current->recent_used_cpu set to 'cpu =
smp_processor_id()' after sis()? Looking at v5.10-rc6.

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 23934dbac635..01b38fc17bca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, 
> int prev, int target)
>  
>   /* Check a recently used CPU as a potential idle candidate: */
>   recent_used_cpu = p->recent_used_cpu;
> + p->recent_used_cpu = prev;
>   if (recent_used_cpu != prev &&
>   recent_used_cpu != target &&
>   cpus_share_cache(recent_used_cpu, target) &&

p->recent_used_cpu is already set to prev in this if condition.

asym_fits_capacity(task_util, recent_used_cpu)) {
/*
 * Replace recent_used_cpu with prev as it is a potential
 * candidate for the next wake:
 */
p->recent_used_cpu = prev;
return recent_used_cpu;

> @@ -6765,9 +6766,6 @@ select_task_rq_fair(struct task_struct *p, int 
> prev_cpu, int wake_flags)
>   } else if (wake_flags & WF_TTWU) { /* XXX always ? */
>   /* Fast path */
>   new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> - if (want_affine)
> - current->recent_used_cpu = cpu;
>   }
>   rcu_read_unlock();

Re: [PATCH V4 3/3] thermal: cpufreq_cooling: Reuse sched_cpu_util() for SMP platforms

2020-12-03 Thread Dietmar Eggemann

On 24/11/2020 07:26, Viresh Kumar wrote:
> Several parts of the kernel are already using the effective CPU
> utilization (as seen by the scheduler) to get the current load on the
> CPU, do the same here instead of depending on the idle time of the CPU,
> which isn't that accurate comparatively.
> 
> This is also the right thing to do as it makes the cpufreq governor
> (schedutil) align better with the cpufreq_cooling driver, as the power
> requested by cpufreq_cooling governor will exactly match the next
> frequency requested by the schedutil governor since they are both using
> the same metric to calculate load.
> 
> This was tested on ARM Hikey6220 platform with hackbench, sysbench and
> schbench. None of them showed any regression or significant
> improvements. Schbench is the most important ones out of these as it
> creates the scenario where the utilization numbers provide a better
> estimate of the future.
> 
> Scenario 1: The CPUs were mostly idle in the previous polling window of
> the IPA governor as the tasks were sleeping and here are the details
> from traces (load is in %):
> 
>  Old: thermal_power_cpu_get_power: cpus=,00ff freq=120 
> total_load=203 load={{0x35,0x1,0x0,0x31,0x0,0x0,0x64,0x0}} dynamic_power=1339
>  New: thermal_power_cpu_get_power: cpus=,00ff freq=120 
> total_load=600 load={{0x60,0x46,0x45,0x45,0x48,0x3b,0x61,0x44}} 
> dynamic_power=3960

When I ran schbench (-t 16 -r 5) on hikey960 I get multiple (~50)
instances of ~80ms task activity phase and then ~20ms idle phase on all
CPUs.

So I assume that scenario 1 is at the beginning (but you mentioned the
task were sleeping?) and scenario 2 is somewhere in the middle of the
testrun?
IMHO, the util-based approach delivers really better results at the
beginning and at the end of the entire testrun.
During the testrun, the util-based and the idle-based approach deliver
similar results.

It's a little bit tricky to compare test results since the IPA sampling
rate is 100ms and the load values you get depend on how the workload
pattern and the IPA sampling align.

> Here, the "Old" line gives the load and requested_power (dynamic_power
> here) numbers calculated using the idle time based implementation, while
> "New" is based on the CPU utilization from scheduler.
> 
> As can be clearly seen, the load and requested_power numbers are simply
> incorrect in the idle time based approach and the numbers collected from
> CPU's utilization are much closer to the reality.

I assume the IPA sampling is done after ~50ms of the first task activity
phase.

> Scenario 2: The CPUs were busy in the previous polling window of the IPA
> governor:
> 
>  Old: thermal_power_cpu_get_power: cpus=,00ff freq=120 
> total_load=800 load={{0x64,0x64,0x64,0x64,0x64,0x64,0x64,0x64}} 
> dynamic_power=5280
>  New: thermal_power_cpu_get_power: cpus=,00ff freq=120 
> total_load=708 load={{0x4d,0x5c,0x5c,0x5b,0x5c,0x5c,0x51,0x5b}} 
> dynamic_power=4672
> 
> As can be seen, the idle time based load is 100% for all the CPUs as it
> took only the last window into account, but in reality the CPUs aren't
> that loaded as shown by the utilization numbers.

Is this an IPA sampling at the end of the ~20ms idle phase?

[...]

Re: [RFC PATCH v2 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2020-12-01 Thread Dietmar Eggemann

On 01/12/2020 03:59, Barry Song wrote:

[...]

> Although I believe there is still a lot to do, sending a RFC to get feedbacks
> of community experts might be helpful for the next step.
> 
> Barry Song (1):
>   scheduler: add scheduler level for clusters
> 
> Jonathan Cameron (1):
>   topology: Represent clusters of CPUs within a die.

Just to make sure. Since this is v2, the v1 is
https://lkml.kernel.org/r//20201016152702.1513592-1-jonathan.came...@huawei.com

Might not be obvious to everyone since sender and subject have changed.

Re: [PATCH V4 2/3] sched/core: Rename schedutil_cpu_util() and allow rest of the kernel to use it

2020-11-30 Thread Dietmar Eggemann

On 24/11/2020 14:22, Viresh Kumar wrote:
> On 24-11-20, 09:10, Quentin Perret wrote:
>> Hey Viresh,
>>
>> On Tuesday 24 Nov 2020 at 11:56:15 (+0530), Viresh Kumar wrote:
>>> There is nothing schedutil specific in schedutil_cpu_util(), rename it
>>> to effective_cpu_util(). Also create and expose another wrapper
>>> sched_cpu_util() which can be used by other parts of the kernel, like
>>> thermal core (that will be done in a later commit).
>>>
>>> Signed-off-by: Viresh Kumar 
>>> ---
>>>  include/linux/sched.h| 21 +
>>>  kernel/sched/core.c  | 11 +--
>>>  kernel/sched/cpufreq_schedutil.c |  2 +-
>>>  kernel/sched/fair.c  |  6 +++---
>>>  kernel/sched/sched.h | 19 ++-
>>>  5 files changed, 36 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 063cd120b459..926b944dae5e 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1926,6 +1926,27 @@ extern long sched_getaffinity(pid_t pid, struct 
>>> cpumask *mask);
>>>  #define TASK_SIZE_OF(tsk)  TASK_SIZE
>>>  #endif
>>>  
>>> +#ifdef CONFIG_SMP
>>> +/**
>>> + * enum cpu_util_type - CPU utilization type
>>> + * @FREQUENCY_UTIL:Utilization used to select frequency
>>> + * @ENERGY_UTIL:   Utilization used during energy calculation
>>> + *
>>> + * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ 
>>> time
>>> + * need to be aggregated differently depending on the usage made of them. 
>>> This
>>> + * enum is used within sched_cpu_util() to differentiate the types of
>>> + * utilization expected by the callers, and adjust the aggregation 
>>> accordingly.
>>> + */
>>> +enum cpu_util_type {
>>> +   FREQUENCY_UTIL,
>>> +   ENERGY_UTIL,
>>> +};
>>> +
>>> +/* Returns effective CPU utilization, as seen by the scheduler */
>>> +unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
>>> +unsigned long max);
>>
>> Are 'type' and 'max' useful to anybody outside of kernel/sched ?
>> If not then how about we hide them, keep the cpu_util_type enum in
>> kernel/sched/sched.h and evaluate arch_scale_cpu_capacity() in
>> sched_cpu_util() directly?
> 
> cpufreq_cooling uses 'max' (as can be seen in the next patch).

But the enum cpu_util_type type doesn't have to be exported outside the
task scheduler code?

--8<--

diff --git a/drivers/thermal/cpufreq_cooling.c 
b/drivers/thermal/cpufreq_cooling.c
index 5aff2ac4b77f..cd9d717654a8 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -149,7 +149,7 @@ static u32 get_load(struct cpufreq_cooling_device 
*cpufreq_cdev, int cpu,
unsigned long max = arch_scale_cpu_capacity(cpu);
unsigned long util;
 
-   util = sched_cpu_util(cpu, ENERGY_UTIL, max);
+   util = sched_cpu_util(cpu, max);
return (util * 100) / max;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fcd70c075349..0511e4fb946f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1954,24 +1954,8 @@ extern long sched_getaffinity(pid_t pid, struct cpumask 
*mask);
 #endif
 
 #ifdef CONFIG_SMP
-/**
- * enum cpu_util_type - CPU utilization type
- * @FREQUENCY_UTIL:Utilization used to select frequency
- * @ENERGY_UTIL:   Utilization used during energy calculation
- *
- * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
- * need to be aggregated differently depending on the usage made of them. This
- * enum is used within sched_cpu_util() to differentiate the types of
- * utilization expected by the callers, and adjust the aggregation accordingly.
- */
-enum cpu_util_type {
-   FREQUENCY_UTIL,
-   ENERGY_UTIL,
-};
-
 /* Returns effective CPU utilization, as seen by the scheduler */
-unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
-unsigned long max);
+unsigned long sched_cpu_util(int cpu, unsigned long max);
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f20dacc2fa7..270f10e01ad2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5757,11 +5757,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long 
util_cfs,
return min(max, util);
 }
 
-unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
-unsigned long max)
+unsigned long sched_cpu_util(int cpu, unsigned long max)
 {
-   return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max, type,
- NULL);
+   return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max,
+ ENERGY_UTIL, NULL);
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b7611020d5cf..a49d6c3e9147 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2561,6 +2561,21 @@ static inline unsigned long

Re: [PATCH V3] PM / EM: Micro optimization in em_cpu_energy

2020-11-24 Thread Dietmar Eggemann

On 23/11/2020 11:35, Pavankumar Kondeti wrote:
> When the sum of the utilization of CPUs in a performance domain is
> zero, return the energy as 0 without doing any computations.
> 
> Signed-off-by: Pavankumar Kondeti 
> ---
> V3: %s/power/performance as corrected by Quentin
> V2: Fixed the function name in the commit message.
> 
>  include/linux/energy_model.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index b67a51c..8810f1f 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -103,6 +103,9 @@ static inline unsigned long em_cpu_energy(struct 
> em_perf_domain *pd,
>   struct em_perf_state *ps;
>   int i, cpu;
>  
> + if (!sum_util)
> + return 0;
> +
>   /*
>* In order to predict the performance state, map the utilization of
>* the most utilized CPU of the performance domain to a requested


LGTM.

Reviewed-by: Dietmar Eggemann

Re: [RFC] Documentation/scheduler/schedutil.txt

2020-11-23 Thread Dietmar Eggemann

On 23/11/2020 14:42, Vincent Guittot wrote:
> On Mon, 23 Nov 2020 at 12:27, Dietmar Eggemann  
> wrote:
>>
>> On 23/11/2020 11:05, Vincent Guittot wrote:
>>> On Mon, 23 Nov 2020 at 10:30, Dietmar Eggemann  
>>> wrote:
>>>>
>>>> On 20/11/2020 09:56, Peter Zijlstra wrote:
>>>>> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:

[...]

>> I thought the question was whether 'runnable_avg = 1.5 x
>> SCHED_CAPACITY_SCALE' is a good threshold to decide to drive frequency
>> by runnable_avg or util_avg.
> 
> we can't use SCHED_CAPACITY_SCALE and must use cpu's capacity

To get some idle time on the LITTLE CPU I extended the time in which
task3 and then task2-3 run on CPU5 to 128ms.

The moment CPU4 has some idle time for the first time after the first
migration (~207ms), the runnable_avg drops from 1323 (task0: 316,
task1: 1020) to 1.
I can't see the dependency to the CPU capacity here.
Util_avg is also larger than the CPU capacity.

The steep fall from runnable=1323 to 1 is due to lost_idle_time update.

...
migration/4-29 [004] 60.34: sched_migrate_task: comm=task3-3 pid=1690 
prio=101 orig_cpu=4 dest_cpu=5
migration/4-29 [004] 60.46: sched_pelt_cfs: cpu=4 path=/ load=163618 
runnable=2296 util=748
...
migration/4-29 [004] 60.142088: sched_migrate_task: comm=task2-2 pid=1689 
prio=101 orig_cpu=4 dest_cpu=5
migration/4-29 [004] 60.142100: sched_pelt_cfs: cpu=4 path=/ load=93358 
runnable=1325 util=628
...
task0-0-1687   [004] 60.201385: sched_pelt_se:  cpu=4 path=(null) 
comm=task0-0 pid=1687 load=22317 runnable=316 util=316
task0-0-1687   [004] 60.201387: sched_pelt_cfs: cpu=4 path=/ load=93978 
runnable=1336 util=788
...
task1-1-1688   [004] 60.207225: sched_pelt_se:  cpu=4 path=(null) 
comm=task1-1 pid=1688 load=71610 runnable=1020 util=497
task1-1-1688   [004] 60.207227: sched_pelt_cfs: cpu=4 path=/ load=93017 
runnable=1323 util=800
 -0  [004] 60.207254: cpu_idle:   state=0 cpu_id=4
...
 -0  [004] 60.209397: sched_pelt_cfs: cpu=4 path=/ load=80 
runnable=1 util=0
...

Re: [RFC] Documentation/scheduler/schedutil.txt

2020-11-23 Thread Dietmar Eggemann

On 23/11/2020 11:05, Vincent Guittot wrote:
> On Mon, 23 Nov 2020 at 10:30, Dietmar Eggemann  
> wrote:
>>
>> On 20/11/2020 09:56, Peter Zijlstra wrote:
>>> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
>>>>  - In saturated scenarios task movement will cause some transient dips,
>>>>suppose we have a CPU saturated with 4 tasks, then when we migrate a 
>>>> task
>>>>to an idle CPU, the old CPU will have a 'running' value of 0.75 while 
>>>> the
>>>>new CPU will gain 0.25. This is inevitable and time progression will
>>>>correct this. XXX do we still guarantee f_max due to no idle-time?
>>>
>>> Do we want something like this? Is the 1.5 threshold sane? (it's been too
>>> long since I looked at actual numbers here)
>>
>> Did some tests on a big.little system:
>>
>>  (1) rt-app workload on big CPU:
>>
>>  - task0-3 (runtime/period=4000us/16000us, started with
>>4000us delay to each other) run on CPU1
>>  - then task3 migrates to CPU2 and runs there for 64ms
>>  - then task2 migrates to CPU2 too and both tasks run there
>>for another 64ms
>>
>> ...
>> task3-3-1684  [001]  3982.798729: sched_pelt_cfs:   cpu=1 path=/ 
>> load=232890 runnable=3260 util=1011
>> migration/1-14[001]  3982.798756: sched_migrate_task:   comm=task3-3 
>> pid=1684 prio=101 orig_cpu=1 dest_cpu=2*
>> migration/1-14[001]  3982.798767: sched_pelt_cfs:   cpu=1 path=/ 
>> load=161374 runnable=2263 util=*700* <-- util dip !!!
>> task1-1-1682  [001]  3982.799802: sched_pelt_cfs:   cpu=1 path=/ 
>> load=160988 runnable=2257 util=706
>> ...
>> task2-2-1683  [001]  3982.849123: sched_pelt_cfs:   cpu=1 path=/ 
>> load=161124 runnable=2284 util=904
>> task2-2-1683  [001]  3982.851960: sched_pelt_cfs:   cpu=1 path=/ 
>> load=160130 runnable=2271 util=911
>> migration/1-14[001]  3982.851984: sched_migrate_task:   comm=task2-2 
>> pid=1683 prio=101 orig_cpu=1 dest_cpu=2**
>> migration/1-14[001]  3982.851995: sched_pelt_cfs:   cpu=1 path=/ 
>> load=88672 runnable=*1257* util=512 <-- runnable below 1536
>> task1-1-1682  [001]  3982.852983: sched_pelt_cfs:   cpu=1 path=/ 
>> load=88321 runnable=1252 util=521
>> ...
>>
>>
>> *  task1,2,3 remain on CPU1 and still have to catch up, no idle
>>time on CPU1
>>
>> ** task 1,2 remain on CPU1, there is idle time on CPU1!
>>
>>
>> (2) rt-app workload on LITTLE CPU (orig cpu_capacity: 446)
>>
>>  - task0-3 (runtime/period=1742us/16000us, started with
>>4000us delay to each other) run on CPU4
>>  - then task3 migrates to CPU5 and runs there for 64ms
>>  - then task2 migrates to CPU5 too and both tasks run there
>>for another 64ms
>>
>> ...
>> task1-1-1777  [004]   789.443015: sched_pelt_cfs:   cpu=4 path=/ 
>> load=234718 runnable=3018 util=976
>> migration/4-29[004]   789.444718: sched_migrate_task:   comm=task3-3 
>> pid=1779 prio=101 orig_cpu=4 dest_cpu=5*
>> migration/4-29[004]   789.444739: sched_pelt_cfs:   cpu=4 path=/ 
>> load=163543 runnable=2114 util=*778* <--util dip !!!
>> task2-2-1778  [004]   789.447013: sched_pelt_cfs:   cpu=4 path=/ 
>> load=163392 runnable=2120 util=777
>> ...
>> task1-1-1777  [004]   789.507012: sched_pelt_cfs:   cpu=4 path=/ 
>> load=164482 runnable=2223 util=879
>> migration/4-29[004]   789.508023: sched_migrate_task:   comm=task2-2 
>> pid=1778 prio=101 orig_cpu=4 dest_cpu=5**
>> migration/4-29[004]   789.508044: sched_pelt_cfs:   cpu=4 path=/ 
>> load=94099 runnable=*1264* util=611 <-- runnable below 1536
>> task0-0-1776  [004]   789.511011: sched_pelt_cfs:   cpu=4 path=/ 
>> load=93898 runnable=1264 util=622
>> ...
>>
>> *  task1,2,3 remain on CPU1 and still have to catch up, no idle
>>time on CPU1
>>
>> ** task 1,2 remain on CPU1, no idle time on CPU1 yet.
>>
>> So for the big CPU, there is idle time and for the LITTLE there
>> isn't with runnable below the threshold.
> 
> I'm not sure to catch what you want to highlight with your tests ?

I thought the question was whether 'runnable_avg = 1.5 x
SCHED_CAPACITY_SCALE' is a good threshold to decide to drive frequency
by runnable_avg or util_avg.

[...]

Re: [RFC] Documentation/scheduler/schedutil.txt

2020-11-23 Thread Dietmar Eggemann

On 20/11/2020 09:56, Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
>>  - In saturated scenarios task movement will cause some transient dips,
>>suppose we have a CPU saturated with 4 tasks, then when we migrate a task
>>to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
>>new CPU will gain 0.25. This is inevitable and time progression will
>>correct this. XXX do we still guarantee f_max due to no idle-time?
> 
> Do we want something like this? Is the 1.5 threshold sane? (it's been too
> long since I looked at actual numbers here)

Did some tests on a big.little system:

 (1) rt-app workload on big CPU:

 - task0-3 (runtime/period=4000us/16000us, started with
   4000us delay to each other) run on CPU1
 - then task3 migrates to CPU2 and runs there for 64ms
 - then task2 migrates to CPU2 too and both tasks run there
   for another 64ms

...
task3-3-1684  [001]  3982.798729: sched_pelt_cfs:   cpu=1 path=/ 
load=232890 runnable=3260 util=1011
migration/1-14[001]  3982.798756: sched_migrate_task:   comm=task3-3 
pid=1684 prio=101 orig_cpu=1 dest_cpu=2*
migration/1-14[001]  3982.798767: sched_pelt_cfs:   cpu=1 path=/ 
load=161374 runnable=2263 util=*700* <-- util dip !!!
task1-1-1682  [001]  3982.799802: sched_pelt_cfs:   cpu=1 path=/ 
load=160988 runnable=2257 util=706
...
task2-2-1683  [001]  3982.849123: sched_pelt_cfs:   cpu=1 path=/ 
load=161124 runnable=2284 util=904
task2-2-1683  [001]  3982.851960: sched_pelt_cfs:   cpu=1 path=/ 
load=160130 runnable=2271 util=911
migration/1-14[001]  3982.851984: sched_migrate_task:   comm=task2-2 
pid=1683 prio=101 orig_cpu=1 dest_cpu=2**
migration/1-14[001]  3982.851995: sched_pelt_cfs:   cpu=1 path=/ 
load=88672 runnable=*1257* util=512 <-- runnable below 1536
task1-1-1682  [001]  3982.852983: sched_pelt_cfs:   cpu=1 path=/ 
load=88321 runnable=1252 util=521
...


*  task1,2,3 remain on CPU1 and still have to catch up, no idle
   time on CPU1

** task 1,2 remain on CPU1, there is idle time on CPU1!


(2) rt-app workload on LITTLE CPU (orig cpu_capacity: 446)

 - task0-3 (runtime/period=1742us/16000us, started with
   4000us delay to each other) run on CPU4
 - then task3 migrates to CPU5 and runs there for 64ms
 - then task2 migrates to CPU5 too and both tasks run there
   for another 64ms

...
task1-1-1777  [004]   789.443015: sched_pelt_cfs:   cpu=4 path=/ 
load=234718 runnable=3018 util=976
migration/4-29[004]   789.444718: sched_migrate_task:   comm=task3-3 
pid=1779 prio=101 orig_cpu=4 dest_cpu=5*
migration/4-29[004]   789.444739: sched_pelt_cfs:   cpu=4 path=/ 
load=163543 runnable=2114 util=*778* <--util dip !!!
task2-2-1778  [004]   789.447013: sched_pelt_cfs:   cpu=4 path=/ 
load=163392 runnable=2120 util=777
...
task1-1-1777  [004]   789.507012: sched_pelt_cfs:   cpu=4 path=/ 
load=164482 runnable=2223 util=879
migration/4-29[004]   789.508023: sched_migrate_task:   comm=task2-2 
pid=1778 prio=101 orig_cpu=4 dest_cpu=5**
migration/4-29[004]   789.508044: sched_pelt_cfs:   cpu=4 path=/ 
load=94099 runnable=*1264* util=611 <-- runnable below 1536
task0-0-1776  [004]   789.511011: sched_pelt_cfs:   cpu=4 path=/ 
load=93898 runnable=1264 util=622
...

*  task1,2,3 remain on CPU1 and still have to catch up, no idle
   time on CPU1

** task 1,2 remain on CPU1, no idle time on CPU1 yet.

So for the big CPU, there is idle time and for the LITTLE there
isn't with runnable below the threshold. 

As Quentin pointed out, sugov_cpu_is_busy() (only implemented on
'single') tries to do something similar.

I assume that 'another utilization metric' mentioned in commit
b7eaf1aab9f8 ("cpufreq: schedutil: Avoid reducing frequency of
busy CPUs prematurely") is rq->cfs.avg.runnable_avg.

Re: [RFC] Documentation/scheduler/schedutil.txt

2020-11-23 Thread Dietmar Eggemann

On 20/11/2020 08:55, Peter Zijlstra wrote:

[...]

> PELT (Per Entity Load Tracking)
> ---

[...]

> Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
> reflects the time an entity spends on the CPU, while 'runnable' reflects the
> time an entity spends on the runqueue. When there is only a single task these
> two metrics are the same, but once there is contention for the CPU 'running'
> will decrease to reflect the fraction of time each task spends on the CPU
> while 'runnable' will increase to reflect the amount of contention.

People might find it confusing to map 'running and 'runnable' into the 3
PELT signals (load_avg, runnable_avg and util_avg) being used in the
scheduler ... with load_avg being 'runnable' and 'weight' based.

> For more detail see: kernel/sched/pelt.c
> 
> 
> Frequency- / Heterogeneous Invariance
> -

We call 'Heterogeneous Invariance' CPU invariance in chapter 2.3
Documentation/scheduler/sched-capacity.rst.

[...]

> For more detail see:
> 
>  - kernel/sched/pelt.h:update_rq_clock_pelt()
>  - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."

drivers/base/arch_topology.c:"f_cur/f_max ratio computation".

> UTIL_EST / UTIL_EST_FASTUP
> --

[...]

>   util_est := \Sum_t max( t_running, t_util_est_ewma )
> 
> For more detail see: kernel/sched/fair.h:util_est_dequeue()

s/fair.h/fair.c

> UCLAMP
> --
> 
> It is possible to set effective u_min and u_max clamps on each task; the

s/on each task/on each CFS or RT task

[...]

[tip: sched/core] sched/uclamp: Allow to reset a task uclamp constraint value

2020-11-20 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 480a6ca2dc6ed82c783faf7e4a9644769b8397d8
Gitweb:
https://git.kernel.org/tip/480a6ca2dc6ed82c783faf7e4a9644769b8397d8
Author:Dietmar Eggemann 
AuthorDate:Fri, 13 Nov 2020 12:34:54 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:47 +01:00

sched/uclamp: Allow to reset a task uclamp constraint value

In case the user wants to stop controlling a uclamp constraint value
for a task, use the magic value -1 in sched_util_{min,max} with the
appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
the reset.

The advantage over the 'additional flag' approach (i.e. introducing
SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
exported via uapi. This avoids the need to document how this new flag
has be used in conjunction with the existing uclamp related flags.

The following subtle issue is fixed as well. When a uclamp constraint
value is set on a !user_defined uclamp_se it is currently first reset
and then set.
Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
stands for the 'sched class change' case.
The related condition 'if (uc_se->user_defined)' moved from
__setscheduler_uclamp() into uclamp_reset().

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Yun Hsiang 
Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggem...@arm.com
---
 include/uapi/linux/sched/types.h |  2 +-
 kernel/sched/core.c  | 70 ++-
 2 files changed, 53 insertions(+), 19 deletions(-)

diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index c852153..f2c4589 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -96,6 +96,8 @@ struct sched_param {
  * on a CPU with a capacity big enough to fit the specified value.
  * A task with a max utilization value smaller than 1024 is more likely
  * scheduled on a CPU with no more capacity than the specified value.
+ *
+ * A task utilization boundary can be reset by setting the attribute to -1.
  */
 struct sched_attr {
__u32 size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a9e6d63..e6473ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1413,17 +1413,24 @@ done:
 static int uclamp_validate(struct task_struct *p,
   const struct sched_attr *attr)
 {
-   unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
-   unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
+   int util_min = p->uclamp_req[UCLAMP_MIN].value;
+   int util_max = p->uclamp_req[UCLAMP_MAX].value;
 
-   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
-   lower_bound = attr->sched_util_min;
-   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
-   upper_bound = attr->sched_util_max;
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+   util_min = attr->sched_util_min;
 
-   if (lower_bound > upper_bound)
-   return -EINVAL;
-   if (upper_bound > SCHED_CAPACITY_SCALE)
+   if (util_min + 1 > SCHED_CAPACITY_SCALE + 1)
+   return -EINVAL;
+   }
+
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+   util_max = attr->sched_util_max;
+
+   if (util_max + 1 > SCHED_CAPACITY_SCALE + 1)
+   return -EINVAL;
+   }
+
+   if (util_min != -1 && util_max != -1 && util_min > util_max)
return -EINVAL;
 
/*
@@ -1438,20 +1445,41 @@ static int uclamp_validate(struct task_struct *p,
return 0;
 }
 
+static bool uclamp_reset(const struct sched_attr *attr,
+enum uclamp_id clamp_id,
+struct uclamp_se *uc_se)
+{
+   /* Reset on sched class change for a non user-defined clamp value. */
+   if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)) &&
+   !uc_se->user_defined)
+   return true;
+
+   /* Reset on sched_util_{min,max} == -1. */
+   if (clamp_id == UCLAMP_MIN &&
+   attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
+   attr->sched_util_min == -1) {
+   return true;
+   }
+
+   if (clamp_id == UCLAMP_MAX &&
+   attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
+   attr->sched_util_max == -1) {
+   return true;
+   }
+
+   return false;
+}
+
 static void __setscheduler_uclamp(struct task_struct *p,
  const struct sched_attr *attr)
 {
enum uclamp_id clamp_id;
 
-   /*
-* On scheduling class change, reset to default clamps for tasks
-* without a task-specific

Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-11-13 Thread Dietmar Eggemann

On 12/11/2020 17:01, Dietmar Eggemann wrote:
> On 12/11/2020 15:41, Qais Yousef wrote:
>> On 11/11/20 18:41, Dietmar Eggemann wrote:
>>> On 10/11/2020 13:21, Peter Zijlstra wrote:
>>>> On Tue, Nov 03, 2020 at 10:37:56AM +0800, Yun Hsiang wrote:

[...]

>> If you or Yun would still like to send the patch to protect
>> SCHED_FLAG_UTIL_CLAMP and SCHED_FLAG_ALL with __kernel__ that'd be great.
> 
> Ah yes! Can add an extra patch for this when sending out the next version.

On second thought, why should we risk a change in UAPI? Since we're now
not introducing a new flag the meaning of SCHED_FLAG_ALL or
SCHED_FLAG_UTIL_CLAMP won't change.

[PATCH v2] sched/uclamp: Allow to reset a task uclamp constraint value

2020-11-13 Thread Dietmar Eggemann

In case the user wants to stop controlling a uclamp constraint value
for a task, use the magic value -1 in sched_util_{min,max} with the
appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
the reset.

The advantage over the 'additional flag' approach (i.e. introducing
SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
exported via uapi. This avoids the need to document how this new flag
has be used in conjunction with the existing uclamp related flags.

The following subtle issue is fixed as well. When a uclamp constraint
value is set on a !user_defined uclamp_se it is currently first reset
and then set.
Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
stands for the 'sched class change' case.
The related condition 'if (uc_se->user_defined)' moved from
__setscheduler_uclamp() into uclamp_reset().

Signed-off-by: Dietmar Eggemann 
---

v1 [1] -> v2:

1) Removed struct sched_attr (UAPI) change.

2) Use single branch range check in uclamp_validate().

[1] https://lkml.kernel.org/r/f3b59aad-3d5d-039b-205d-024308b60...@arm.com

 include/uapi/linux/sched/types.h |  2 +
 kernel/sched/core.c  | 70 +++-
 2 files changed, 53 insertions(+), 19 deletions(-)

diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index c852153ddb0d..f2c4589d4dbf 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -96,6 +96,8 @@ struct sched_param {
  * on a CPU with a capacity big enough to fit the specified value.
  * A task with a max utilization value smaller than 1024 is more likely
  * scheduled on a CPU with no more capacity than the specified value.
+ *
+ * A task utilization boundary can be reset by setting the attribute to -1.
  */
 struct sched_attr {
__u32 size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3dc415f58bd7..a4805747b304 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1413,17 +1413,24 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
 static int uclamp_validate(struct task_struct *p,
   const struct sched_attr *attr)
 {
-   unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
-   unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
+   int util_min = p->uclamp_req[UCLAMP_MIN].value;
+   int util_max = p->uclamp_req[UCLAMP_MAX].value;
 
-   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
-   lower_bound = attr->sched_util_min;
-   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
-   upper_bound = attr->sched_util_max;
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+   util_min = attr->sched_util_min;
 
-   if (lower_bound > upper_bound)
-   return -EINVAL;
-   if (upper_bound > SCHED_CAPACITY_SCALE)
+   if (util_min + 1 > SCHED_CAPACITY_SCALE + 1)
+   return -EINVAL;
+   }
+
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+   util_max = attr->sched_util_max;
+
+   if (util_max + 1 > SCHED_CAPACITY_SCALE + 1)
+   return -EINVAL;
+   }
+
+   if (util_min != -1 && util_max != -1 && util_min > util_max)
return -EINVAL;
 
/*
@@ -1438,20 +1445,41 @@ static int uclamp_validate(struct task_struct *p,
return 0;
 }
 
+static bool uclamp_reset(const struct sched_attr *attr,
+enum uclamp_id clamp_id,
+struct uclamp_se *uc_se)
+{
+   /* Reset on sched class change for a non user-defined clamp value. */
+   if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)) &&
+   !uc_se->user_defined)
+   return true;
+
+   /* Reset on sched_util_{min,max} == -1. */
+   if (clamp_id == UCLAMP_MIN &&
+   attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
+   attr->sched_util_min == -1) {
+   return true;
+   }
+
+   if (clamp_id == UCLAMP_MAX &&
+   attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
+   attr->sched_util_max == -1) {
+   return true;
+   }
+
+   return false;
+}
+
 static void __setscheduler_uclamp(struct task_struct *p,
  const struct sched_attr *attr)
 {
enum uclamp_id clamp_id;
 
-   /*
-* On scheduling class change, reset to default clamps for tasks
-* without a task-specific value.
-*/
for_each_clamp_id(clamp_id) {
struct uclamp_se *uc_se = >uclamp_req[clamp_id];
+   unsigned int value;
 
-   /* Keep using defined clamps across class changes */
-   if (uc_se->user_defined)
+   if (!uclamp

Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-11-12 Thread Dietmar Eggemann

On 12/11/2020 15:41, Qais Yousef wrote:
> On 11/11/20 18:41, Dietmar Eggemann wrote:
>> On 10/11/2020 13:21, Peter Zijlstra wrote:
>>> On Tue, Nov 03, 2020 at 10:37:56AM +0800, Yun Hsiang wrote:

[...]

> I assume we agree then that we don't want to explicitly document this quirky
> feature and keep it for advanced users?
> 
> I am wary of the UAPI change that is both explicit and implicit. It explicitly
> requests a reset, but implicitly requests a cgroup behavior change.
> 
> With this magic value at least we can more easily return an error if we 
> decided
> to deprecate it, which has been my main ask so far. I don't want us to end up
> not being able to easily modify this code in the future.

Another advantage for this 'magic value' approach.

> I don't have strong opinion too though.
> 
> If you or Yun would still like to send the patch to protect
> SCHED_FLAG_UTIL_CLAMP and SCHED_FLAG_ALL with __kernel__ that'd be great.

Ah yes! Can add an extra patch for this when sending out the next version.

[...]

Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-11-12 Thread Dietmar Eggemann

On 11/11/2020 19:04, Peter Zijlstra wrote:
> On Wed, Nov 11, 2020 at 06:41:07PM +0100, Dietmar Eggemann wrote:
>> diff --git a/include/uapi/linux/sched/types.h 
>> b/include/uapi/linux/sched/types.h
>> index c852153ddb0d..b9165f17dddc 100644
>> --- a/include/uapi/linux/sched/types.h
>> +++ b/include/uapi/linux/sched/types.h
>> @@ -115,8 +115,8 @@ struct sched_attr {
>>  __u64 sched_period;
>>  
>>  /* Utilization hints */
>> -__u32 sched_util_min;
>> -__u32 sched_util_max;
>> +__s32 sched_util_min;
>> +__s32 sched_util_max;
> 
> So that's UAPI, not sure we can change the type here.

Yes, will remove this chunk. Not needed.

I probably should add some documentation there:

diff --git a/include/uapi/linux/sched/types.h
b/include/uapi/linux/sched/types.h
index c852153ddb0d..f2c4589d4dbf 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -96,6 +96,8 @@ struct sched_param {
  * on a CPU with a capacity big enough to fit the specified value.
  * A task with a max utilization value smaller than 1024 is more likely
  * scheduled on a CPU with no more capacity than the specified value.
+ *
+ * A task utilization boundary can be reset by setting the attribute to -1.
  */
 struct sched_attr {
__u32 size;

[...]

>> +if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
>> +util_max = attr->sched_util_max;
>> +
>> +if (util_max < -1 || util_max > SCHED_CAPACITY_SCALE)
>> +return -EINVAL;
>> +}
> 
> Luckily we can write that range as a single branch like:
> 
>   if (util_{min,max} + 1 > SCHED_CAPACITY_SCALE+1)
> 
> which assumes u32 :-)

Cool, will change it.

>> +
>> +if (util_min != -1 && util_max != -1 && util_min > util_max)
>>  return -EINVAL;
> 
> I think that will compile as is, otherwise write it like ~0u, which is
> the same bit pattern.

Yes, it compiles for me (arm64, gcc 9.2 and arm, gcc 8.3). Started a
0-Day build job to make sure.

Will do some more testing before sending out the updated version.

Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-11-11 Thread Dietmar Eggemann

On 10/11/2020 13:21, Peter Zijlstra wrote:
> On Tue, Nov 03, 2020 at 10:37:56AM +0800, Yun Hsiang wrote:
>> If the user wants to stop controlling uclamp and let the task inherit
>> the value from the group, we need a method to reset.
>>
>> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
>> sched_setattr syscall.
>>
>> The policy is
>> _CLAMP_RESET   => reset both min and max
>> _CLAMP_RESET | _CLAMP_MIN  => reset min value
>> _CLAMP_RESET | _CLAMP_MAX  => reset max value
>> _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max
>>
> 
> The obvious alternative would be to use a magic value in
> sched_util_{min,max} to indicate reset. After all, we strictly enforce
> the values are inside [0,1024], which leaves us with many unused values.
> 
> Specifically -1 comes to mind. It would allow doing this without an
> extra flag, OTOH the explicit flag is well, more explicit.
> 
> I don't have a strong preference either way, but I wanted to make sure
> it was considered, and perhaps we can record why this isn't as nice a
> solution, dunno.

IMHO the '-1'  magic value approach is cleaner. Did some light testing on it.

>From 2e6a64fac4f2f66a2c6246de33db22c467fa7d33 Mon Sep 17 00:00:00 2001
From: Dietmar Eggemann 
Date: Wed, 11 Nov 2020 01:14:33 +0100
Subject: [PATCH] sched/uclamp: Allow to reset a task uclamp constraint value

In case the user wants to stop controlling a uclamp constraint value
for a task, use the magic value -1 in sched_util_{min,max} with the
appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
the reset.

The advantage over the 'additional flag' approach (i.e. introducing
SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
exported via uapi. This avoids the need to document how this new flag
has be used in conjunction with the existing uclamp related flags.

The following subtle issue is fixed as well. When a uclamp constraint
value is set on a !user_defined uclamp_se it is currently first reset
and then set.
Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
stands for the 'sched class change' case.
The related condition 'if (uc_se->user_defined)' moved from
__setscheduler_uclamp() into uclamp_reset().

Signed-off-by: Dietmar Eggemann 
---
 include/uapi/linux/sched/types.h |  4 +-
 kernel/sched/core.c  | 70 +++-
 2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index c852153ddb0d..b9165f17dddc 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -115,8 +115,8 @@ struct sched_attr {
__u64 sched_period;
 
/* Utilization hints */
-   __u32 sched_util_min;
-   __u32 sched_util_max;
+   __s32 sched_util_min;
+   __s32 sched_util_max;
 
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3dc415f58bd7..caaa2a8434b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1413,17 +1413,24 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
 static int uclamp_validate(struct task_struct *p,
   const struct sched_attr *attr)
 {
-   unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
-   unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
+   int util_min = p->uclamp_req[UCLAMP_MIN].value;
+   int util_max = p->uclamp_req[UCLAMP_MAX].value;
 
-   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
-   lower_bound = attr->sched_util_min;
-   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
-   upper_bound = attr->sched_util_max;
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+   util_min = attr->sched_util_min;
 
-   if (lower_bound > upper_bound)
-   return -EINVAL;
-   if (upper_bound > SCHED_CAPACITY_SCALE)
+   if (util_min < -1 || util_min > SCHED_CAPACITY_SCALE)
+   return -EINVAL;
+   }
+
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+   util_max = attr->sched_util_max;
+
+   if (util_max < -1 || util_max > SCHED_CAPACITY_SCALE)
+   return -EINVAL;
+   }
+
+   if (util_min != -1 && util_max != -1 && util_min > util_max)
return -EINVAL;
 
/*
@@ -1438,20 +1445,41 @@ static int uclamp_validate(struct task_struct *p,
return 0;
 }
 
+static bool uclamp_reset(const struct sched_attr *attr,
+enum uclamp_id clamp_id,
+struct uclamp_se *uc_se)
+{
+   /* Reset on sched class change for a non user-defined clamp value. */
+

Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-11-04 Thread Dietmar Eggemann

On 03/11/2020 14:48, Qais Yousef wrote:
> Oops, +Juri for real this time.
> 
> On 11/03/20 13:46, Qais Yousef wrote:
>> Hi Yun
>>
>> +Juri (A question for you below)
>>
>> On 11/03/20 10:37, Yun Hsiang wrote:

[...]

>>>  include/uapi/linux/sched.h |  7 +++--
>>>  kernel/sched/core.c| 59 --
>>>  2 files changed, 49 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>>> index 3bac0a8ceab2..6c823ddb1a1e 100644
>>> --- a/include/uapi/linux/sched.h
>>> +++ b/include/uapi/linux/sched.h
>>> @@ -132,17 +132,20 @@ struct clone_args {
>>>  #define SCHED_FLAG_KEEP_PARAMS 0x10
>>>  #define SCHED_FLAG_UTIL_CLAMP_MIN  0x20
>>>  #define SCHED_FLAG_UTIL_CLAMP_MAX  0x40
>>> +#define SCHED_FLAG_UTIL_CLAMP_RESET0x80
>>
>> The new flag needs documentation about how it should be used. It has a none
>> obvious policy that we can't expect users to just get it.

See (1) further below.

>>>  #define SCHED_FLAG_KEEP_ALL(SCHED_FLAG_KEEP_POLICY | \
>>>  SCHED_FLAG_KEEP_PARAMS)
>>>  
>>>  #define SCHED_FLAG_UTIL_CLAMP  (SCHED_FLAG_UTIL_CLAMP_MIN | \
>>> -SCHED_FLAG_UTIL_CLAMP_MAX)
>>> +SCHED_FLAG_UTIL_CLAMP_MAX | \
>>> +SCHED_FLAG_UTIL_CLAMP_RESET)
>>
>> Either do this..
>>
>>>  
>>>  #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK   | \
>>>  SCHED_FLAG_RECLAIM | \
>>>  SCHED_FLAG_DL_OVERRUN  | \
>>>  SCHED_FLAG_KEEP_ALL| \
>>> -SCHED_FLAG_UTIL_CLAMP)
>>> +SCHED_FLAG_UTIL_CLAMP  | \
>>> +SCHED_FLAG_UTIL_CLAMP_RESET)
>>
>> Or this.
>>
>> I checked glibc and it seems they don't use the sched.h from linux and more
>> surprisingly they don't seem to have a wrapper for sched_setattr(). bionic 
>> libc
>> from Android does take sched.h from linux, but didn't find any user. So we
>> might be okay with modifying these here.

Schould be package linux-libc-dev. Debian 10 (buster-backports)
5.8.10-1~bpo10+1 contains the uclamp bits as well.

/usr/include/linux/sched/types.h
/usr/include/linux/sched.h

/usr/include/linux/sched.h contains SCHED_FLAG_UTIL_CLAMP and
SCHED_FLAG_ALL.

But there is no glibc wrapper for sched_[sg]etattr so syscall wrapping
is still needed.

>> I still would like us to document better what we expect from these defines.
>> Are they for internal kernel use or really for user space? If the former we
>> should take them out of here. If the latter, then adding the RESET is 
>> dangerous
>> as it'll cause an observable change in behavior; that is if an application 
>> was
>> using SCHED_FLAG_ALL or SCHED_FLAG_UTIL_CLAMP to update the UTIL_MIN/MAX of
>> a task, existing binaries will find out now that instead of modifying the 
>> value
>> they're actually resetting them.

I doubt that any application uses SCHED_FLAG_ALL so far since it already
mixes e.g. DL and UCLAMP stuff. AFAIK the only tools supporting uclamp
so far is rt-app and uclampset, which both use their own files for DL
and uclamp definition.

[..]

>> Add the policy part of the commit message as a documentation to this function
>> please.
>>
>> ie:
>>
>>  /*
>>   * The policy is
>>   * _CLAMP_RESET   => reset both min and max
>>   * _CLAMP_RESET | _CLAMP_MIN  => reset min value
>>   * _CLAMP_RESET | _CLAMP_MAX  => reset max value
>>   * _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max
>>   */
>>

(1) Related to documentation, wouldn't it be better to document in
include/uapi/linux/sched.h, i.e. where the flags are defined, so it gets
exported via linux-libc-dev?
Like it's done for struct sched_attr members in
include/uapi/linux/sched/types.h.

[...]

>>> -   if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>>> +   if (likely(!(attr->sched_flags && SCHED_FLAG_UTIL_CLAMP)) ||
>>> +   attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET)
>>> return;

Another multi line statement so the 'return' could go with braces.

>>>  
>>> -   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>>> +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
>>> uclamp_se_set(>uclamp_req[UCLAMP_MIN],
>>> - attr->sched_util_min, true);
>>> -   }
>>> +   attr->sched_util_min, true);
>>>  
>>> -   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
>>> +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
>>> uclamp_se_set(>uclamp_req[UCLAMP_MAX],
>>> - attr->sched_util_max, true);
>>> -   }
>>> +   attr->sched_util_max, true);
>>
>> These two hunks seem unrelated too. Multi line statement should still have
>> braces AFAIK. Why change it?

+1

Re: [PATCH v3] sched/fair: prefer prev cpu in asymmetric wakeup path

2020-11-03 Thread Dietmar Eggemann

On 29/10/2020 17:18, Vincent Guittot wrote:

[...]

> - On hikey960 with performance governor (EAS disable)
> 
> ./perf bench sched pipe -T -l 5
>  mainline   w/ patch
> # migrations   999364  0
> ops/sec149313(+/-0.28%)   182587(+/- 0.40) +22%
> 
> - On hikey with performance governor
> 
> ./perf bench sched pipe -T -l 5
>  mainline   w/ patch
> # migrations0  0
> ops/sec 47721(+/-0.76%)47899(+/- 0.56) +0.4%

Tested on hikey960 (big cluster 0xf0) with perf gov on tip sched/core +
patch) and defconfig plus:

# CONFIG_ARM_CPUIDLE is not set
# CONFIG_CPU_THERMAL is not set
# CONFIG_HISI_THERMAL is not set

and for 'w/ uclamp' tests:

CONFIG_UCLAMP_TASK=y
CONFIG_UCLAMP_BUCKETS_COUNT=5
CONFIG_UCLAMP_TASK_GROUP=y

(a) perf stat -n -r 20 taskset 0xf0 perf bench sched pipe -T -l 5

(b) perf stat -n -r 20 -- cgexec -g cpu:A/B taskset 0xf0 perf bench
sched pipe -T -l 5


(1) w/o uclamp

(a) w/o patch: 0.392850 +- 0.000289 seconds time elapsed  ( +-  0.07% )

w/  patch: 0.330786 +- 0.000401 seconds time elapsed  ( +-  0.12% )

(b) w/o patch: 0.414644 +- 0.000375 seconds time elapsed  ( +-  0.09% )

w/  patch: 0.353113 +- 0.000393 seconds time elapsed  ( +-  0.11% )

(2) w/ uclamp

(a) w/o patch: 0.393781 +- 0.000488 seconds time elapsed  ( +-  0.12% )

w/  patch: 0.342726 +- 0.000661 seconds time elapsed  ( +-  0.19% )

(b) w/o patch: 0.416645 +- 0.000520 seconds time elapsed  ( +-  0.12% )

w/  patch: 0.358098 +- 0.000577 seconds time elapsed  ( +-  0.16% )

Tested-by: Dietmar Eggemann 

> According to test on hikey, the patch doesn't impact symmetric system
> compared to current implementation (only tested on arm64)
> 
> Also read the uclamped value of task's utilization at most twice instead
> instead each time we compare task's utilization with cpu's capacity.

task_util could be passed into select_idle_capacity() avoiding the
second call to uclamp_task_util()?

With this I see a small improvement for (a)

(3) w/ uclamp and passing task_util into sic()

(a) w/  patch: 0.337032 +- 0.000564 seconds time elapsed  ( +-  0.17% )

(b) w/  patch: 0.358467 +- 0.000381 seconds time elapsed  ( +-  0.11% )

[...]

> -symmetric:
> - if (available_idle_cpu(target) || sched_idle_cpu(target))
> + if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> + asym_fits_capacity(task_util, target))
>   return target;

Braces because of multi-line condition ?

>   /*
>* If the previous CPU is cache affine and idle, don't be stupid:
>*/
>   if (prev != target && cpus_share_cache(prev, target) &&
> - (available_idle_cpu(prev) || sched_idle_cpu(prev)))
> + (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> + asym_fits_capacity(task_util, prev))
>   return prev;

and here ...

[...]

Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-29 Thread Dietmar Eggemann

On 29/10/2020 14:06, Qais Yousef wrote:
> On 10/29/20 21:02, Yun Hsiang wrote:
>> Hi Qais,
>>
>> On Thu, Oct 29, 2020 at 11:08:18AM +, Qais Yousef wrote:
>>> Hi Yun
>>>
>>> Sorry for chipping in late.
>>>
>>> On 10/25/20 15:36, Yun Hsiang wrote:

[...]

  #define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
 -   SCHED_FLAG_UTIL_CLAMP_MAX)
 +   SCHED_FLAG_UTIL_CLAMP_MAX | \
 +   SCHED_FLAG_UTIL_CLAMP_RESET)
>>>
>>> Is it safe to change this define in a uapi header without a potential
>>> consequence?

AFAICS, there're 3 occurrences, besides the one in
__setscheduler_uclamp(), in which we use SCHED_FLAG_UTIL_CLAMP.

1) call uclamp_validate() in __sched_setscheduler()

2) jump to 'change' label in __sched_setscheduler()

3) check that the uattr->size is SCHED_ATTR_SIZE_VER1 in
   sched_copy_attr()

2) and 3) require SCHED_FLAG_UTIL_CLAMP_RESET to be part of
SCHED_FLAG_UTIL_CLAMP but IMHO 1) needs this change:

@@ -1413,8 +1413,14 @@ int sysctl_sched_uclamp_handler(struct ctl_table
*table, int write,
 static int uclamp_validate(struct task_struct *p,
   const struct sched_attr *attr)
 {
-   unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
-   unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
+   unsigned int lower_bound, upper_bound;
+
+   /* Do not check uclamp attributes values in reset case. */
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET)
+   return 0;
+
+   lower_bound = p->uclamp_req[UCLAMP_MIN].value;
+   upper_bound = p->uclamp_req[UCLAMP_MAX].value;

if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
lower_bound = attr->sched_util_min;

Otherwise a bogus sa.sched_util_min or sa.sched_util_max with
SCHED_FLAG_UTIL_CLAMP_RESET could return -EINVAL.

>>> FWIW I still have concerns about this approach. We're doing a reset to 
>>> control
>>> cgroup behavior, I don't see any correlation between the two. Besides the
>>> difference between RESET and setting uclamp_min=0 without RESET is not 
>>> obvious
>>> nor intuitive for someone who didn't look at the code.
>>>
>>> I propose something like the below which is more explicit about what is 
>>> being
>>> requested and delivered here. And if we decide to deprecate this behavior,
>>> it'd be much easier to just ignore this flag.
>>>
>>> You must set this flag with your uclamp request to retain the cgroup
>>> inheritance behavior. If the flag is not set, we automatically clear it.
>>
>> I think this behavior may not meet android requirement. Becasue in
>> android there is group like top-app. And we want to boost the
>> group by setting group uclamp_min. If group inheritance is explicit, we
>> need to set this flag for all the tasks in top-app. This might be
>> costly.
> 
> You will not have to set it for every task. It's on by default like it works
> now. This behavior doesn't change.
> 
> But if you change the uclamp value of a task but still want it to continue to
> inherit the cgroup values if it's attached to one, you must set this flag when
> changing the uclamp value.

I'm not really fond of this idea because:

(1) explicit cgroup(-behavior) related settings through a per-task user
interface.

(2) uclamp reset makes already sense in the !CONFIG_UCLAMP_TASK_GROUP
case. A task can reset its uclamp values here as well, and then
'inheriting' the system defaults again. Already mentioned in
https://lkml.kernel.org/r/87362ihxvw.derkl...@matbug.net

[...]

Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-29 Thread Dietmar Eggemann

On 28/10/2020 19:41, Yun Hsiang wrote:
> Hi Patrick,
> 
> On Wed, Oct 28, 2020 at 11:11:07AM +0100, Patrick Bellasi wrote:

[...]

>> On Tue, Oct 27, 2020 at 16:58:13 +0100, Yun Hsiang  
>> wrote...
>>
>>> Hi Diet mar,
>>> On Mon, Oct 26, 2020 at 08:00:48PM +0100, Dietmar Eggemann wrote:
>>>> On 26/10/2020 16:45, Yun Hsiang wrote:

[...]

>>>> From: Dietmar Eggemann 

[...]

>>>> +static bool uclamp_reset(enum uclamp_id clamp_id, unsigned long flags)
>>>> +{
>>
>> Maybe we can add in some comments?
>>
> I'll add these comment.

Yeah, why not.

>> /* No _UCLAMP_RESET flag set: do not reset */
>>>> +  if (!(flags & SCHED_FLAG_UTIL_CLAMP_RESET))
>>>> +  return false;
>>>> +
>>
>> /* Only _UCLAMP_RESET flag set: reset both clamps */
>>>> +  if (!(flags & (SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX)))
>>>> +  return true;
>>>> +
>> /* Both _UCLAMP_RESET and _UCLAMP_MIN flags are set: reset only min 
>> */
>>>> +  if ((flags & SCHED_FLAG_UTIL_CLAMP_MIN) && clamp_id == UCLAMP_MIN)
>>>> +  return true;
>>>> +
>>
>> /* Both _UCLAMP_RESET and _UCLAMP_MAX flags are set: reset only max 
>> */
>>>> +  if ((flags & SCHED_FLAG_UTIL_CLAMP_MAX) && clamp_id == UCLAMP_MAX)
>>>> +  return true;
>>
>> Since the evaluation ordering is important, do we have to better
>> _always_ use a READ_ONCE() for all flags accesses above, to ensure it is
>> preserved?
>>
> 
> Is this mean that we want to use READ_ONCE to avoid compiler reordering these
> conditions?

Why would you need a READ_ONCE() on flags here?

[...]

>>>>/* Keep using defined clamps across class changes */
>>>> -  if (uc_se->user_defined)
>>>> +  if (!uclamp_reset(clamp_id, attr->sched_flags) &&
>>>> +  uc_se->user_defined) {
>>>>continue;
>>>> +  }
>>
>> I think we miss to reset the user_defined flag here.
>>
>> What about replacing the above chunk with:
>>
>> if (uclamp_reset(clamp_id, attr->sched_flags))
>> uc_se->user_defined = false;
>> if (uc-se->user_defined)
>> continue;
>>
>> ?
> 
> user_defined flag will be reset later by uclamp_se_set(uc_se, value,
> false). But I agree to split it to two condition because it seems
> clearer.

IMHO it's more elegant to use uclamp_reset() in the condition next to
uc-se->user_defined and let uclamp_se_set() set uc-se->user_defined to
false later.

>>>>/*
>>>> * RT by default have a 100% boost value that could be modified
>>>> * at runtime.
>>>> */
>>>>if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
>>>> -  __uclamp_update_util_min_rt_default(p);
>>>> +  value = sysctl_sched_uclamp_util_min_rt_default;
>>
>> By removing this usage of __uclamp_updadate_util_min_rt_default(p),
>> the only other usage remaining is the call from:
>>uclamp_udpate_util_min_rt_default().
>>
>> What about an additional cleanup by in-lining the only surviving usage?

Don't see why not.

>>>>else
>>>> -  uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
>>>> +  value = uclamp_none(clamp_id);
>>>>  
>>>> +  uclamp_se_set(uc_se, value, false);
>>>>}
>>>>  
>>>> -  if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>>>> +  if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)) ||
>>>> +  attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET) {
>>
>> The likely() above should not wrap both conditions to be effective?
> 
> Got it.

I thought the likely is for no uclamp activities, i.e. policy change.
And a uclamp reset is different to a policy change. But is this likely too?

[tip: sched/core] sched/cpupri: Remove pri_to_cpu[CPUPRI_IDLE]

2020-10-29 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5e054bca44fe92323de5e9b71478d1904b8bb1b7
Gitweb:
https://git.kernel.org/tip/5e054bca44fe92323de5e9b71478d1904b8bb1b7
Author:Dietmar Eggemann 
AuthorDate:Tue, 22 Sep 2020 10:39:33 +02:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 29 Oct 2020 11:00:29 +01:00

sched/cpupri: Remove pri_to_cpu[CPUPRI_IDLE]

pri_to_cpu[CPUPRI_IDLE=0] isn't used since cpupri_set(..., newpri) is
never called with newpri = MAX_PRIO (140).

Current mapping:

p->rt_priority   p->prio   newpri   cpupri

   -1   -1 (CPUPRI_INVALID)

  1400 (CPUPRI_IDLE)

  1001 (CPUPRI_NORMAL)

 198   983
   ...
4950   50   51
5049   49   52
   ...
99 00  101

Even when cpupri was introduced with commit 6e0534f27819 ("sched: use a
2-d bitmap for searching lowest-pri CPU") in v2.6.27, only

   (1) CPUPRI_INVALID (-1),
   (2) MAX_RT_PRIO (100),
   (3) an RT prio (RT1..RT99)

were used as newprio in cpupri_set(..., newpri) -> convert_prio(newpri).

MAX_RT_PRIO is used only in dec_rt_tasks() -> dec_rt_prio() ->
dec_rt_prio_smp() -> cpupri_set() in case of !rt_rq->rt_nr_running.
I.e. it stands for a non-rt task, including the IDLE task.

Commit 57785df5ac53 ("sched: Fix task priority bug") removed code in
v2.6.33 which did set the priority of the IDLE task to MAX_PRIO.
Although this happened after the introduction of cpupri, it didn't have
an effect on the values used for cpupri_set(..., newpri).

Remove CPUPRI_IDLE and adapt the cpupri implementation accordingly.
This will save a useless for loop with an atomic_read in
cpupri_find_fitness() calling __cpupri_find().

New mapping:

p->rt_priority   p->prio   newpri   cpupri

   -1   -1 (CPUPRI_INVALID)

  1000 (CPUPRI_NORMAL)

 198   982
   ...
4950   50   50
5049   49   51
   ...
99     00  100

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200922083934.19275-2-dietmar.eggem...@arm.com
---
 kernel/sched/cpupri.c | 10 --
 kernel/sched/cpupri.h |  7 +++
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 0033731..a5d14ed 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -11,7 +11,7 @@
  *  This code tracks the priority of each CPU so that global migration
  *  decisions are easy to calculate.  Each CPU can be in a state as follows:
  *
- * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ * (INVALID), NORMAL, RT1, ... RT99
  *
  *  going from the lowest priority to the highest.  CPUs in the INVALID state
  *  are not eligible for routing.  The system maintains this state with
@@ -19,24 +19,22 @@
  *  in that class).  Therefore a typical application without affinity
  *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  *  searches).  For tasks with affinity restrictions, the algorithm has a
- *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  worst case complexity of O(min(101, nr_domcpus)), though the scenario that
  *  yields the worst case search is fairly contrived.
  */
 #include "sched.h"
 
-/* Convert between a 140 based task->prio, and our 102 based cpupri */
+/* Convert between a 140 based task->prio, and our 101 based cpupri */
 static int convert_prio(int prio)
 {
int cpupri;
 
if (prio == CPUPRI_INVALID)
cpupri = CPUPRI_INVALID;
-   else if (prio == MAX_PRIO)
-   cpupri = CPUPRI_IDLE;
else if (prio >= MAX_RT_PRIO)
cpupri = CPUPRI_NORMAL;
else
-   cpupri = MAX_RT_PRIO - prio + 1;
+   cpupri = MAX_RT_PRIO - prio;
 
return cpupri;
 }
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index efbb492..1a16236 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -1,11 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
-#define CPUPRI_NR_PRIORITIES   (MAX_RT_PRIO + 2)
+#define CPUPRI_NR_PRIORITIES   (MAX_RT_PRIO + 1)
 
 #define CPUPRI_INVALID -1
-#define CPUPRI_IDLE 0
-#define CPUPRI_NORMAL   1
-/* values 2-101 are RT priorities 0-99 */
+#define CPUPRI_NORMAL   0
+/* values 2-100 are RT priorities 0-99 */
 
 struct cpupri_vec {
atomic_tcount;

[tip: sched/core] sched/cpupri: Remove pri_to_cpu[1]

2020-10-29 Thread tip-bot2 for Dietmar Eggemann

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 1b08782ce31f612d98e11f3e3df9a147a67d
Gitweb:
https://git.kernel.org/tip/1b08782ce31f612d98e11f3e3df9a147a67d
Author:Dietmar Eggemann 
AuthorDate:Tue, 22 Sep 2020 10:39:34 +02:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 29 Oct 2020 11:00:29 +01:00

sched/cpupri: Remove pri_to_cpu[1]

pri_to_cpu[1] isn't used since cpupri_set(..., newpri) is
never called with newpri = 99.

The valid RT priorities RT1..RT99 (p->rt_priority = [1..99]) map into
cpupri (idx of pri_to_cpu[]) = [2..100]

Current mapping:

p->rt_priority   p->prio   newpri   cpupri

   -1   -1 (CPUPRI_INVALID)

  1000 (CPUPRI_NORMAL)

 198   982
   ...
4950   50   50
5049   49   51
   ...
99 00  100

So cpupri = 1 isn't used.

Reduce the size of pri_to_cpu[] by 1 and adapt the cpupri
implementation accordingly. This will save a useless for loop with an
atomic_read in cpupri_find_fitness() calling __cpupri_find().

New mapping:

p->rt_priority   p->prio   newpri   cpupri

   -1   -1 (CPUPRI_INVALID)

  1000 (CPUPRI_NORMAL)

 198   981
   ...
4950   50   49
5049   49   50
   ...
99 00   99

Signed-off-by: Dietmar Eggemann 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200922083934.19275-3-dietmar.eggem...@arm.com
---
 kernel/sched/cpupri.c | 6 +++---
 kernel/sched/cpupri.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a5d14ed..8d9952a 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -19,12 +19,12 @@
  *  in that class).  Therefore a typical application without affinity
  *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  *  searches).  For tasks with affinity restrictions, the algorithm has a
- *  worst case complexity of O(min(101, nr_domcpus)), though the scenario that
+ *  worst case complexity of O(min(100, nr_domcpus)), though the scenario that
  *  yields the worst case search is fairly contrived.
  */
 #include "sched.h"
 
-/* Convert between a 140 based task->prio, and our 101 based cpupri */
+/* Convert between a 140 based task->prio, and our 100 based cpupri */
 static int convert_prio(int prio)
 {
int cpupri;
@@ -34,7 +34,7 @@ static int convert_prio(int prio)
else if (prio >= MAX_RT_PRIO)
cpupri = CPUPRI_NORMAL;
else
-   cpupri = MAX_RT_PRIO - prio;
+   cpupri = MAX_RT_PRIO - prio - 1;
 
return cpupri;
 }
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 1a16236..e28e1ed 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -1,10 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
-#define CPUPRI_NR_PRIORITIES   (MAX_RT_PRIO + 1)
+#define CPUPRI_NR_PRIORITIES   MAX_RT_PRIO
 
 #define CPUPRI_INVALID -1
 #define CPUPRI_NORMAL   0
-/* values 2-100 are RT priorities 0-99 */
+/* values 1-99 are for RT1-RT99 priorities */
 
 struct cpupri_vec {
atomic_tcount;

Re: [PATCH v2] sched: sched_domain fix highest_flag_domain function

2020-10-27 Thread Dietmar Eggemann

On 27/10/2020 04:32, Xuewen Yan wrote:
> the highest_flag_domain is to search the highest sched_domain
> containing flag, but if the lower sched_domain didn't contain
> the flag, but the higher sched_domain contains the flag, the
> function will return NULL instead of the higher sched_domain.
> 
> For example:
> In MC domain : no SD_ASYM_CPUCAPACITY flag;
> In DIE domain : containing SD_ASYM_CPUCAPACITY flag;
> the "highest_flag_domain(cpu, SD_ASYM_CPUCAPACITY)" will return NULL.
> 
> Signed-off-by: Xuewen Yan 
> ---
>  kernel/sched/sched.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 28709f6..2c7c566 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1427,7 +1427,7 @@ static inline struct sched_domain 
> *highest_flag_domain(int cpu, int flag)
>  
>   for_each_domain(cpu, sd) {
>   if (!(sd->flags & flag))
> - break;
> + continue;
>   hsd = sd;
>   }

We distinguish between SDF_SHARED_PARENT and SDF_SHARED_CHILD SD flags.

1 SD_FLAG(*SD_ASYM_CPUCAPACITY*, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
2 SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
3 SD_FLAG(SD_OVERLAP, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
4 SD_FLAG(SD_NUMA, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)

1 SD_FLAG(SD_BALANCE_NEWIDLE, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
2 SD_FLAG(SD_BALANCE_EXEC, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
3 SD_FLAG(SD_BALANCE_FORK, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
4 SD_FLAG(SD_BALANCE_WAKE, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
5 82 SD_FLAG(SD_WAKE_AFFINE, SDF_SHARED_CHILD)
6 SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
7 SD_FLAG(SD_SHARE_PKG_RESOURCES, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
8 SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

We call lowest_flag_domain() on SDF_SHARED_PARENT and
highest_flag_domain() on SDF_SHARED_CHILD SD flags.

1 sd = lowest_flag_domain(cpu, SD_NUMA);
2 sd = lowest_flag_domain(cpu, *SD_ASYM_CPUCAPACITY*);

1 sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
2 sd = highest_flag_domain(cpu, SD_ASYM_PACKING);

Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-26 Thread Dietmar Eggemann

On 26/10/2020 16:45, Yun Hsiang wrote:
> Hi Dietmar,
> 
> On Mon, Oct 26, 2020 at 10:47:11AM +0100, Dietmar Eggemann wrote:
>> On 25/10/2020 08:36, Yun Hsiang wrote:
>>> If the user wants to stop controlling uclamp and let the task inherit
>>> the value from the group, we need a method to reset.
>>>
>>> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
>>> sched_setattr syscall.
>>>
>>> The policy is
>>> _CLAMP_RESET   => reset both min and max
>>> _CLAMP_RESET | _CLAMP_MIN  => reset min value
>>> _CLAMP_RESET | _CLAMP_MAX  => reset max value
>>> _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max
>>>
>>> Signed-off-by: Yun Hsiang 
>>
>> [...]
>>
>>> @@ -1451,7 +1464,8 @@ static void __setscheduler_uclamp(struct task_struct 
>>> *p,
>>> struct uclamp_se *uc_se = >uclamp_req[clamp_id];
>>>  
>>> /* Keep using defined clamps across class changes */
>>> -   if (uc_se->user_defined)
>>> +   if (flags != SCHED_FLAG_UTIL_CLAMP_RESET &&
>>> +   uc_se->user_defined)
>>> continue;
>>
>> With:
>>
>> (1) _CLAMP_RESET   => reset both min and max
>> (2) _CLAMP_RESET | _CLAMP_MIN  => reset min value
>> (3) _CLAMP_RESET | _CLAMP_MAX  => reset max value
>> (4) _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max
>>
>> If you reset an RT task with (1) you don't reset its uclamp.min value.
>>
>> __uclamp_update_util_min_rt_default(p) doesn't know about
>> SCHED_FLAG_UTIL_CLAMP_RESET. It only knows user_defined and will bail early.
>>
> 
> Sorry I didn't notice __uclamp_update_util_min_rt_default will return
> directly if user_defined is set, I'll fix it.
> 
>> [...]
>>
>>> -   if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>>> +   if (likely(!flags || flags == SCHED_FLAG_UTIL_CLAMP_RESET))
>>> return;
>>>  
>>> -   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>>> +   if (flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>>> +   if (reset) {
>>> +   clamp_value = __default_uclamp_value(p, UCLAMP_MIN);
>>> +   user_defined = false;
>>> +   } else {
>>> +   clamp_value = attr->sched_util_min;
>>> +   user_defined = true;
>>> +   }
>>
>> Why do you reset for (1) in the for_each_clamp_id(clamp_id) loop and for
>> (2)-(4) in the if (flags & SCHED_FLAG_UTIL_CLAMP_MXX) condition?
>>
>> You could reset (1)-(4) in the for_each_clamp_id(clamp_id) loop? In this
>> case you wouldn't need __default_uclamp_value().
> 
> Do you mean adding these code in for_each_clamp_id(clamp_id) loop?
> 
> if (clamp_id == UCLAMP_MIN) {
>   if (flags == SCHED_FLAG_UTIL_CLAMP_RESET || 
>   (reset && (flags || SCHED_FLAG_UTIL_CLAMP_MIN)) ||
>   !se->user_defined) {
>   if (task_rt(p)) {
>   clamp_value = sysctl_sched_uclamp_util_min_rt_default
>   } else {
>   clamp_value = uclamp_none(clamp_id);
>   }
>   } else 
>   continue;
> }
> /* similar code for UCLAMP_MAX */
> ...
> uclamp_se_set(uc_se, clamp_value, false);
> 
> It seems more clear.
> If you think this one is better, I'll use this method and send patch V4.

I thought about something like this. Only lightly tested. 

---8<---

From: Dietmar Eggemann 
Date: Mon, 26 Oct 2020 13:52:23 +0100
Subject: [PATCH] SCHED_FLAG_UTIL_CLAMP_RESET

Signed-off-by: Dietmar Eggemann 
---
 include/uapi/linux/sched.h |  4 +++-
 kernel/sched/core.c| 31 +++
 2 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..0dd890822751 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -132,12 +132,14 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS 0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN  0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX  0x40
+#define SCHED_FLAG_UTIL_CLAMP_RESET0x80
 
 #define SCHED_FLAG_KEEP_ALL(SCHED_FLAG_KEEP_POLICY | \
 SCHED_FLAG_KEEP_PARAMS)
 
 #define SCHED_FLAG_UTIL_CLAMP  (SCHED_FLAG_UTIL_CLAMP_MIN | \
-

Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-26 Thread Dietmar Eggemann

On 25/10/2020 08:36, Yun Hsiang wrote:
> If the user wants to stop controlling uclamp and let the task inherit
> the value from the group, we need a method to reset.
> 
> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
> sched_setattr syscall.
> 
> The policy is
> _CLAMP_RESET   => reset both min and max
> _CLAMP_RESET | _CLAMP_MIN  => reset min value
> _CLAMP_RESET | _CLAMP_MAX  => reset max value
> _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max
> 
> Signed-off-by: Yun Hsiang 

[...]

> @@ -1451,7 +1464,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
>   struct uclamp_se *uc_se = >uclamp_req[clamp_id];
>  
>   /* Keep using defined clamps across class changes */
> - if (uc_se->user_defined)
> + if (flags != SCHED_FLAG_UTIL_CLAMP_RESET &&
> + uc_se->user_defined)
>   continue;

With:

(1) _CLAMP_RESET   => reset both min and max
(2) _CLAMP_RESET | _CLAMP_MIN  => reset min value
(3) _CLAMP_RESET | _CLAMP_MAX  => reset max value
(4) _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max

If you reset an RT task with (1) you don't reset its uclamp.min value.

__uclamp_update_util_min_rt_default(p) doesn't know about
SCHED_FLAG_UTIL_CLAMP_RESET. It only knows user_defined and will bail early.

[...]

> - if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> + if (likely(!flags || flags == SCHED_FLAG_UTIL_CLAMP_RESET))
>   return;
>  
> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + if (flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + if (reset) {
> + clamp_value = __default_uclamp_value(p, UCLAMP_MIN);
> + user_defined = false;
> + } else {
> + clamp_value = attr->sched_util_min;
> + user_defined = true;
> + }

Why do you reset for (1) in the for_each_clamp_id(clamp_id) loop and for
(2)-(4) in the if (flags & SCHED_FLAG_UTIL_CLAMP_MXX) condition?

You could reset (1)-(4) in the for_each_clamp_id(clamp_id) loop? In this
case you wouldn't need __default_uclamp_value().

[...]

Re: [PATCH] sched/fair: prefer prev cpu in asymmetric wakeup path

2020-10-23 Thread Dietmar Eggemann

On 22/10/2020 17:33, Vincent Guittot wrote:
> On Thu, 22 Oct 2020 at 16:53, Valentin Schneider
>  wrote:
>>
>>
>> Hi Vincent,
>>
>> On 22/10/20 14:43, Vincent Guittot wrote:

[...]

>>>  static int
>>> -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int 
>>> target)
>>> +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int 
>>> prev, int target)
>>>  {
>>>   unsigned long best_cap = 0;
>>>   int cpu, best_cpu = -1;
>>> @@ -6178,9 +6178,22 @@ select_idle_capacity(struct task_struct *p, struct 
>>> sched_domain *sd, int target)
>>>
>>>   sync_entity_load_avg(>se);
>>>
>>> + if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
>>> + task_fits_capacity(p, capacity_of(target)))
>>> + return target;
>>> +
>>
>> I think we still need to check for CPU affinity here.
> 
> yes good point

We don't check CPU affinity on target and prev in the symmetric case.

I always thought that since we:

(1) check 'want_affine = ... && cpumask_test_cpu(cpu, p->cpus_ptr);' in
select_task_rq_fair() and

(2) we have the select_fallback_rq() in select_task_rq() for prev

that this would be sufficient?

[...]

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1404 matches

Mail list logo