from:"Aubrey"

[tip: sched/core] sched/fair: Reduce long-tail newly idle balance cost

2021-03-23 Thread tip-bot2 for Aubrey Li

The following commit has been merged into the sched/core branch of tip:

Commit-ID: acb4decc1e900468d51b33c5f1ee445278e716a7
Gitweb:
https://git.kernel.org/tip/acb4decc1e900468d51b33c5f1ee445278e716a7
Author:Aubrey Li 
AuthorDate:Wed, 24 Feb 2021 16:15:49 +08:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 23 Mar 2021 16:01:59 +01:00

sched/fair: Reduce long-tail newly idle balance cost

A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.

Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1 or even 0 if the running task becomes idle, this
causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
load_balance redo at the same sched_domain level.

In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.

This patch clears LBF_ALL_PINNED flag for this race condition, and hence
reduces the long-tail cost of newly idle balance.

Signed-off-by: Aubrey Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/1614154549-116078-1-git-send-email-aubrey...@intel.com
---
 kernel/sched/fair.c |  9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aaa0dfa..6d73bdb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7687,6 +7687,15 @@ static int detach_tasks(struct lb_env *env)
 
lockdep_assert_held(>src_rq->lock);
 
+   /*
+* Source run queue has been emptied by another CPU, clear
+* LBF_ALL_PINNED flag as we will not test any task.
+*/
+   if (env->src_rq->nr_running <= 1) {
+   env->flags &= ~LBF_ALL_PINNED;
+   return 0;
+   }
+
if (env->imbalance <= 0)
return 0;

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 20:56, Peter Zijlstra wrote:
> On Mon, Mar 22, 2021 at 08:31:09PM +0800, Li, Aubrey wrote:
>> Please let me know if I put cookie match check at the right position
>> in task_hot(), if so, I'll obtain some performance data of it.
>>
>> Thanks,
>> -Aubrey
>>
>> ===
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7f2fb08..d4bdcf9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1912,6 +1912,13 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>>  if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  continue;
>>  
>> +/*
>> + * Skip this cpu if source task's cookie does not match
>> + * with CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +continue;
>> +
>>  env->dst_cpu = cpu;
>>  if (task_numa_compare(env, taskimp, groupimp, maymove))
>>  break;
> 
> This one might need a little help too, I've not fully considered NUMA
> balancing though.
> 
I dropped this numa change for now as it may be too strong, too. I'll
do more experiment about this on the new iteration.

The following patch is rebased on top of queue tree, cookie check is moved
from can_migrate_task to task_hot.

please let me know if any issues.

Thanks,
-Aubrey
==
>From 70d0ed9bab658b0bad60fda73f81b747f20975f0 Mon Sep 17 00:00:00 2001
From: Aubrey Li 
Date: Tue, 23 Mar 2021 03:26:34 +
Subject: [PATCH] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task may be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 29 ++
 kernel/sched/sched.h | 73 
 2 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efde8df2bc35..a74061484194 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5877,11 +5877,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -5967,9 +5971,10 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
-static inline int __select_idle_cpu(int cpu)
+static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
return cpu;
 
return -1;
@@ -6039,7 +6044,7 @@ static int select_idle_core(struct task_struct *p, int 
core, struct cpumask *cpu
int cpu;
 
if (!static_branch_likely(_smt_present))
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
@@ -6077,7 +6082,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 
 static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
 {
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6130,7 +6135,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 20:56, Peter Zijlstra wrote:
> On Mon, Mar 22, 2021 at 08:31:09PM +0800, Li, Aubrey wrote:
>> Please let me know if I put cookie match check at the right position
>> in task_hot(), if so, I'll obtain some performance data of it.
>>
>> Thanks,
>> -Aubrey
>>
>> ===
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7f2fb08..d4bdcf9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1912,6 +1912,13 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>>  if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  continue;
>>  
>> +/*
>> + * Skip this cpu if source task's cookie does not match
>> + * with CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +continue;
>> +
>>  env->dst_cpu = cpu;
>>  if (task_numa_compare(env, taskimp, groupimp, maymove))
>>  break;
> 
> This one might need a little help too, I've not fully considered NUMA
> balancing though.
> 
>> @@ -6109,7 +6120,9 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>  for_each_cpu_wrap(cpu, cpus, target) {
>>  if (!--nr)
>>  return -1;
>> -if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> +
>> +if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>> +sched_cpu_cookie_match(cpu_rq(cpu), p))
>>  break;
>>  }
>>  
> 
> This doesn't even apply... That code has changed.
> 
>> @@ -7427,6 +7440,14 @@ static int task_hot(struct task_struct *p, struct 
>> lb_env *env)
>>  
>>  if (sysctl_sched_migration_cost == -1)
>>  return 1;
>> +
>> +/*
>> + * Don't migrate task if the task's cookie does not match
>> + * with the destination CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> +return 1;
>> +
>>  if (sysctl_sched_migration_cost == 0)
>>  return 0;
>>  
> 
> Should work I think, but you've put it in a weird spot for breaking up
> that sysctl_sched_migration_cost thing. I'd have put it either in front
> or after that other SMT thing we have there.
> 

I did it on purpose.

If migration cost is huge, the task should not migrate, no matter the
cookie is matched or not. So have to after sysctl_sched_migration_cost == -1.

And if migration cost = 0 or delta < migrate cost , the task can be migrated,
but before migrate, We need to check whether cookie is matched or not. So before
sysctl_sched_migration_cost == 0.

Please correct me if I was wrong.

Thanks,
-Aubrey

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 16:57, Peter Zijlstra wrote:

> 
>> Do you have any suggestions before we drop it?
> 
> Yeah, how about you make it part of task_hot() ? Have task_hot() refuse
> migration it the cookie doesn't match.
> 
> task_hot() is a hint and will get ignored when appropriate.
> 

Please let me know if I put cookie match check at the right position
in task_hot(), if so, I'll obtain some performance data of it.

Thanks,
-Aubrey

===
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f2fb08..d4bdcf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1912,6 +1912,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5847,11 +5854,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6109,7 +6120,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7427,6 +7440,14 @@ static int task_hot(struct task_struct *p, struct lb_env 
*env)
 
if (sysctl_sched_migration_cost == -1)
return 1;
+
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 1;
+
if (sysctl_sched_migration_cost == 0)
return 0;
 
@@ -8771,6 +8792,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f094435..13254ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1093,6 +1093,7 @@ static inline int cpu_of(struct rq *rq)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 static inline bool sched_core_enabled(struct rq *rq)
 {
@@ -1109,6 +1110,61 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+   /* Ignore cookie match if core scheduler is not enabled on the CPU. */
+   if (!sched_core_enabled(rq))
+   return true;
+
+   return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct 
*p)
+{
+   bool idle_core = true;
+   int cpu;
+
+   /* Ignore cookie match if core scheduler is not enabled on the CPU. */
+   if (!sched_core_enabled(rq))
+   return true;
+
+   for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+   if (!available_idle_cpu(cpu)) {
+   idle_core = false;
+   break;
+

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 15:48, Peter Zijlstra wrote:
> On Sun, Mar 21, 2021 at 09:34:00PM +0800, Li, Aubrey wrote:
>> Hi Peter,
>>
>> On 2021/3/20 23:34, Peter Zijlstra wrote:
>>> On Fri, Mar 19, 2021 at 04:32:48PM -0400, Joel Fernandes (Google) wrote:
>>>> @@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct 
>>>> lb_env *env)
>>>> * We do not migrate tasks that are:
>>>> * 1) throttled_lb_pair, or
>>>> * 2) cannot be migrated to this CPU due to cpus_ptr, or
>>>> -   * 3) running (obviously), or
>>>> -   * 4) are cache-hot on their current CPU.
>>>> +   * 3) task's cookie does not match with this CPU's core cookie
>>>> +   * 4) running (obviously), or
>>>> +   * 5) are cache-hot on their current CPU.
>>>> */
>>>>if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>>>return 0;
>>>> @@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
>>>> lb_env *env)
>>>>return 0;
>>>>}
>>>>  
>>>> +  /*
>>>> +   * Don't migrate task if the task's cookie does not match
>>>> +   * with the destination CPU's core cookie.
>>>> +   */
>>>> +  if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>>>> +  return 0;
>>>> +
>>>>/* Record that we found atleast one task that could run on dst_cpu */
>>>>env->flags &= ~LBF_ALL_PINNED;
>>>>  
>>>
>>> This one is too strong.. persistent imbalance should be able to override
>>> it.
>>>
>>
>> IIRC, this change can avoid the following scenario:
>>
>> One sysbench cpu thread(cookieA) and sysbench mysql thread(cookieB) running
>> on the two siblings of core_1, the other sysbench cpu thread(cookieA) and
>> sysbench mysql thread(cookieB) running on the two siblings of core2, which
>> causes 50% force idle.
>>
>> This is not an imbalance case.
> 
> But suppose there is an imbalance; then this cookie crud can forever
> stall balance.
> 
> Imagine this cpu running a while(1); with a uniqie cookie on, then it
> will _never_ accept other tasks == BAD.
> 

How about putting the following check in sched_core_cookie_match()?

+   /*
+* Ignore cookie match if there is a big imbalance between the src rq
+* and dst rq.
+*/
+   if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
+   return true;

This change has significant impact of my sysbench cpu+mysql colocation.

- with this change,
  sysbench cpu tput = 2796 events/s, sysbench mysql = 1315 events/s

- without it, 
  sysbench cpu tput= 3513 events/s, sysbench mysql = 646 events.

Do you have any suggestions before we drop it?

Thanks,
-Aubrey

Re: [PATCH] sched/fair: remove redundant test_idle_cores for non-smt

2021-03-21 Thread Li, Aubrey

Hi Barry,

On 2021/3/21 6:14, Barry Song wrote:
> update_idle_core() is only done for the case of sched_smt_present.
> but test_idle_cores() is done for all machines even those without
> smt.

The patch looks good to me.
May I know for what case we need to keep CONFIG_SCHED_SMT for non-smt
machines?

Thanks,
-Aubrey


> this could contribute to up 8%+ hackbench performance loss on a
> machine like kunpeng 920 which has no smt. this patch removes the
> redundant test_idle_cores() for non-smt machines.
> 
> we run the below hackbench with different -g parameter from 2 to
> 14, for each different g, we run the command 10 times and get the
> average time:
> $ numactl -N 0 hackbench -p -T -l 2 -g $1
> 
> hackbench will report the time which is needed to complete a certain
> number of messages transmissions between a certain number of tasks,
> for example:
> $ numactl -N 0 hackbench -p -T -l 2 -g 10
> Running in threaded mode with 10 groups using 40 file descriptors each
> (== 400 tasks)
> Each sender will pass 2 messages of 100 bytes
> 
> The below is the result of hackbench w/ and w/o this patch:
> g=2  4 6   8  10 12  14
> w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
>   +4.1%  +8.3%  +7.3%   +6.3%
> 
> Signed-off-by: Barry Song 
> ---
>  kernel/sched/fair.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e2ab1e..de42a32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def)
>  {
>   struct sched_domain_shared *sds;
>  
> - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> - if (sds)
> - return READ_ONCE(sds->has_idle_cores);
> + if (static_branch_likely(_smt_present)) {
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (sds)
> + return READ_ONCE(sds->has_idle_cores);
> + }
>  
>   return def;
>  }
>

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-21 Thread Li, Aubrey

Hi Peter,

On 2021/3/20 23:34, Peter Zijlstra wrote:
> On Fri, Mar 19, 2021 at 04:32:48PM -0400, Joel Fernandes (Google) wrote:
>> @@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct 
>> lb_env *env)
>>   * We do not migrate tasks that are:
>>   * 1) throttled_lb_pair, or
>>   * 2) cannot be migrated to this CPU due to cpus_ptr, or
>> - * 3) running (obviously), or
>> - * 4) are cache-hot on their current CPU.
>> + * 3) task's cookie does not match with this CPU's core cookie
>> + * 4) running (obviously), or
>> + * 5) are cache-hot on their current CPU.
>>   */
>>  if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>  return 0;
>> @@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
>> lb_env *env)
>>  return 0;
>>  }
>>  
>> +/*
>> + * Don't migrate task if the task's cookie does not match
>> + * with the destination CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> +return 0;
>> +
>>  /* Record that we found atleast one task that could run on dst_cpu */
>>  env->flags &= ~LBF_ALL_PINNED;
>>  
> 
> This one is too strong.. persistent imbalance should be able to override
> it.
> 

IIRC, this change can avoid the following scenario:

One sysbench cpu thread(cookieA) and sysbench mysql thread(cookieB) running
on the two siblings of core_1, the other sysbench cpu thread(cookieA) and
sysbench mysql thread(cookieB) running on the two siblings of core2, which
causes 50% force idle.

This is not an imbalance case.

Thanks,
-Aubrey

[PATCH v10] sched/fair: select idle cpu from idle cpumask for task wakeup

2021-03-15 Thread Aubrey Li

From: Aubrey Li 

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v9->v10:
- Update scan cost only when the idle cpumask is scanned, i.e, the
  idle cpumask is not empty

v8->v9:
- rebase on top of tip/sched/core, no code change

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 47 --
 kernel/sched/idle.c|  5 +
 kernel/sched/sched.h   |  4 
 kernel/sched/topology.c|  3 ++-
 6 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778..905e382 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,8 +74,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb62..310bf9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4552,6 +4552,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, rq->idle_balance);
trigger_load_balance(rq);
 #endif
 }
@@ -8209,6 +8210,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb..24384b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6134,7 +6134,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
@@ -6173,7 +6178,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (smt)
set_idle_cores(this, false);
 
-   if (sched_feat(SIS_PROP) && !smt) {
+   if (sched_feat(SIS_PROP) && !smt && (cpu < nr_cpumask_bits)) {
time = cpu_clock(this) - time;

Re: [PATCH v2] sched/fair: reduce long-tail newly idle balance cost

2021-03-15 Thread Li, Aubrey

On 2021/2/24 16:15, Aubrey Li wrote:
> A long-tail load balance cost is observed on the newly idle path,
> this is caused by a race window between the first nr_running check
> of the busiest runqueue and its nr_running recheck in detach_tasks.
> 
> Before the busiest runqueue is locked, the tasks on the busiest
> runqueue could be pulled by other CPUs and nr_running of the busiest
> runqueu becomes 1 or even 0 if the running task becomes idle, this
> causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
> load_balance redo at the same sched_domain level.
> 
> In order to find the new busiest sched_group and CPU, load balance will
> recompute and update the various load statistics, which eventually leads
> to the long-tail load balance cost.
> 
> This patch clears LBF_ALL_PINNED flag for this race condition, and hence
> reduces the long-tail cost of newly idle balance.

Ping...

> 
> Cc: Vincent Guittot 
> Cc: Mel Gorman 
> Cc: Andi Kleen 
> Cc: Tim Chen 
> Cc: Srinivas Pandruvada 
> Cc: Rafael J. Wysocki 
> Signed-off-by: Aubrey Li 
> ---
>  kernel/sched/fair.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce2..5c67804 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7675,6 +7675,15 @@ static int detach_tasks(struct lb_env *env)
>  
>   lockdep_assert_held(>src_rq->lock);
>  
> + /*
> +  * Source run queue has been emptied by another CPU, clear
> +  * LBF_ALL_PINNED flag as we will not test any task.
> +  */
> + if (env->src_rq->nr_running <= 1) {
> + env->flags &= ~LBF_ALL_PINNED;
> + return 0;
> + }
> +
>   if (env->imbalance <= 0)
>   return 0;
>  
>

[PATCH v9 1/2] sched/fair: select idle cpu from idle cpumask for task wakeup

2021-03-09 Thread Aubrey Li

From: Aubrey Li 

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v8->v9:
- rebase on top of tip/sched/core, no functionality change

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 45 +-
 kernel/sched/idle.c|  5 +
 kernel/sched/sched.h   |  4 
 kernel/sched/topology.c|  3 ++-
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778..905e382 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,8 +74,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb62..310bf9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4552,6 +4552,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, rq->idle_balance);
trigger_load_balance(rq);
 #endif
 }
@@ -8209,6 +8210,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb..15d23d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6134,7 +6134,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
@@ -6838,6 +6843,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ *
+ * This function is called with interrupts disabled.
+ */
+void update_idle_cpumask(int cpu, bool idle)
+{
+   struct sched_domain *sd;
+   struct rq *rq = cpu_rq(cpu);
+   int idle_state;
+
+   /*
+

[PATCH v9 2/2] sched/fair: Remove SIS_PROP

2021-03-09 Thread Aubrey Li

From: Aubrey Li 

Scanning idle cpu from the idle cpumask avoid superfluous scans
of the LLC domain, as the first bit in the idle cpumask is the
target. Considering the selected target could become busy, the
idle check is reserved, but SIS_PROP feature becomes meaningless,
so remove avg_scan_cost computation as well.

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Tim Chen 
Cc: Jiang Biao 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h |  2 --
 kernel/sched/fair.c| 33 ++---
 kernel/sched/features.h|  5 -
 3 files changed, 2 insertions(+), 38 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 905e382..2a37596 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -113,8 +113,6 @@ struct sched_domain {
u64 max_newidle_lb_cost;
unsigned long next_decay_max_lb_cost;
 
-   u64 avg_scan_cost;  /* select_idle_sibling */
-
 #ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 15d23d2..6236822 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6117,18 +6117,15 @@ static inline int select_idle_core(struct task_struct 
*p, int core, struct cpuma
 #endif /* CONFIG_SCHED_SMT */
 
 /*
- * Scan the LLC domain for idle CPUs; this is dynamically regulated by
- * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
- * average idle time for this rq (as found in rq->avg_idle).
+ * Scan idle cpumask in the LLC domain for idle CPUs
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-   int i, cpu, idle_cpu = -1, nr = INT_MAX;
+   int i, cpu, idle_cpu = -1;
bool smt = test_idle_cores(target, false);
int this = smp_processor_id();
struct sched_domain *this_sd;
-   u64 time;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6141,25 +6138,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 */
cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
-   if (sched_feat(SIS_PROP) && !smt) {
-   u64 avg_cost, avg_idle, span_avg;
-
-   /*
-* Due to large variance we need a large fuzz factor;
-* hackbench in particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
-
-   span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
-   nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
-
-   time = cpu_clock(this);
-   }
-
for_each_cpu_wrap(cpu, cpus, target) {
if (smt) {
i = select_idle_core(p, cpu, cpus, _cpu);
@@ -6167,8 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
return i;
 
} else {
-   if (!--nr)
-   return -1;
idle_cpu = __select_idle_cpu(cpu);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;
@@ -6178,11 +6154,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (smt)
set_idle_cores(this, false);
 
-   if (sched_feat(SIS_PROP) && !smt) {
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
-   }
-
return idle_cpu;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b15..267aa774 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -53,11 +53,6 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
 SCHED_FEAT(TTWU_QUEUE, true)
 
 /*
- * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
- */
-SCHED_FEAT(SIS_PROP, true)
-
-/*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
  * annotations are not complete.
-- 
2.7.4

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2021-03-08 Thread Li, Aubrey

On 2021/3/8 19:30, Vincent Guittot wrote:
> Hi Aubrey,
> 
> On Thu, 4 Mar 2021 at 14:51, Li, Aubrey  wrote:
>>
>> Hi Peter,
>>
>> On 2020/12/11 23:07, Vincent Guittot wrote:
>>> On Thu, 10 Dec 2020 at 02:44, Aubrey Li  wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>>>> kbuild have been tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>> snip
>>>>
>>>> Cc: Peter Zijlstra 
>>>> Cc: Mel Gorman 
>>>> Cc: Vincent Guittot 
>>>> Cc: Qais Yousef 
>>>> Cc: Valentin Schneider 
>>>> Cc: Jiang Biao 
>>>> Cc: Tim Chen 
>>>> Signed-off-by: Aubrey Li 
>>>
>>> This version looks good to me. I don't see regressions of v5 anymore
>>> and see some improvements on heavy cases
>>>
>>> Reviewed-by: Vincent Guittot 
>>
>> May I know your thoughts about this patch?
>> Is it cpumask operation potentially too expensive to be here?
> 
> Could you rebase your patch ? It doesn't apply anymore on
> tip/sched/core was recent changes

Okay, I'll rebase it and send a v9 out soon, thanks Vincent.

> 
>>
>> Thanks,
>> -Aubrey
>>>
>>>> ---
>>>>  include/linux/sched/topology.h | 13 ++
>>>>  kernel/sched/core.c|  2 ++
>>>>  kernel/sched/fair.c| 45 +-
>>>>  kernel/sched/idle.c|  5 
>>>>  kernel/sched/sched.h   |  4 +++
>>>>  kernel/sched/topology.c|  3 ++-
>>>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h 
>>>> b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>> atomic_tref;
>>>> atomic_tnr_busy_cpus;
>>>> int has_idle_cores;
>>>> +   /*
>>>> +* Span of all idle CPUs in this domain.
>>>> +*
>>>> +* NOTE: this field is variable length. (Allocated dynamically
>>>> +* by attaching extra space to the end of the structure,
>>>> +* depending on how many CPUs the kernel has booted up with)
>>>> +*/
>>>> +   unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
>>>> *sds)
>>>> +{
>>>> +   return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>> /* These fields must be setup */
>>>> struct sched_domain __rcu *parent;  /* top domain must be null 
>>>> terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c4da7e17b906..b136e2440ea4 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>>>
>>>>  #ifdef CONFIG_SMP
>>>> rq->idle_balance = idle_cpu(cpu);
>>>> +   update_idle_cpumask(cpu, rq->idle_balance);
>>>> trigger_load_balance(rq);
>>>>  #endif
>>>>  }
>>>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>>>> rq->idle_stamp = 0;
>>>> rq->avg_idle = 2*sysctl_sched_migration_cost;
>>>> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>>>> +   rq->last_idle_state = 1;
>>>>
>>>>

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2021-03-04 Thread Li, Aubrey

Hi Peter,

On 2020/12/11 23:07, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 02:44, Aubrey Li  wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>> kbuild have been tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
snip
>>
>> Cc: Peter Zijlstra 
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
> 
> This version looks good to me. I don't see regressions of v5 anymore
> and see some improvements on heavy cases
> 
> Reviewed-by: Vincent Guittot 

May I know your thoughts about this patch? 
Is it cpumask operation potentially too expensive to be here?

Thanks,
-Aubrey
> 
>> ---
>>  include/linux/sched/topology.h | 13 ++
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 45 +-
>>  kernel/sched/idle.c|  5 
>>  kernel/sched/sched.h   |  4 +++
>>  kernel/sched/topology.c|  3 ++-
>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_tref;
>> atomic_tnr_busy_cpus;
>> int has_idle_cores;
>> +   /*
>> +* Span of all idle CPUs in this domain.
>> +*
>> +* NOTE: this field is variable length. (Allocated dynamically
>> +* by attaching extra space to the end of the structure,
>> +* depending on how many CPUs the kernel has booted up with)
>> +*/
>> +   unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +   return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>> /* These fields must be setup */
>> struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b136e2440ea4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>
>>  #ifdef CONFIG_SMP
>> rq->idle_balance = idle_cpu(cpu);
>> +   update_idle_cpumask(cpu, rq->idle_balance);
>> trigger_load_balance(rq);
>>  #endif
>>  }
>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>> rq->idle_stamp = 0;
>> rq->avg_idle = 2*sysctl_sched_migration_cost;
>> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> +   rq->last_idle_state = 1;
>>
>> INIT_LIST_HEAD(>cfs_tasks);
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c0c4d9ad7da8..25f36ecfee54 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>
>> time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   /*
>> +* sched_domain_shared is set only at shared cache level,
>> +* this works only because select_idle_cpu is called with
>> +* sd_llc.
>> +*/
>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)
>> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
>> struct rq_flags *rf)
>>
>> return newidle_balance(rq, rf) != 0;
>>  }
>> +
>> +/*
>

[PATCH v2] sched/fair: reduce long-tail newly idle balance cost

2021-02-24 Thread Aubrey Li

A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.

Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1 or even 0 if the running task becomes idle, this
causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
load_balance redo at the same sched_domain level.

In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.

This patch clears LBF_ALL_PINNED flag for this race condition, and hence
reduces the long-tail cost of newly idle balance.

Cc: Vincent Guittot 
Cc: Mel Gorman 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Srinivas Pandruvada 
Cc: Rafael J. Wysocki 
Signed-off-by: Aubrey Li 
---
 kernel/sched/fair.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce2..5c67804 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7675,6 +7675,15 @@ static int detach_tasks(struct lb_env *env)
 
lockdep_assert_held(>src_rq->lock);
 
+   /*
+* Source run queue has been emptied by another CPU, clear
+* LBF_ALL_PINNED flag as we will not test any task.
+*/
+   if (env->src_rq->nr_running <= 1) {
+   env->flags &= ~LBF_ALL_PINNED;
+   return 0;
+   }
+
if (env->imbalance <= 0)
return 0;
 
-- 
2.7.4

Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-02-23 Thread Li, Aubrey

On 2021/2/24 1:33, Vincent Guittot wrote:
> On Tue, 23 Feb 2021 at 06:41, Li, Aubrey  wrote:
>>
>> Hi Vincent,
>>
>> Sorry for the delay, I just returned from Chinese New Year holiday.
>>
>> On 2021/1/25 22:51, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey  wrote:
>>>>
>>>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li  wrote:
>>>>>>
>>>>>> A long-tail load balance cost is observed on the newly idle path,
>>>>>> this is caused by a race window between the first nr_running check
>>>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>>>
>>>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>>>
>>>>> We should better detect that when trying to detach task like below
>>>>
>>>> This should be a compromise from my understanding. If we give up load 
>>>> balance
>>>> this time due to the race condition, we do reduce the load balance cost on 
>>>> the
>>>> newly idle path, but if there is an imbalance indeed at the same 
>>>> sched_domain
>>>
>>> Redo path is there in case, LB has found an imbalance but it can't
>>> move some loads from this busiest rq to dest rq because of some cpu
>>> affinity. So it tries to fix the imbalance by moving load onto another
>>> rq of the group. In your case, the imbalance has disappeared because
>>> it has already been pulled by another rq so you don't have to try to
>>> find another imbalance. And I would even say you should not in order
>>> to let other level to take a chance to spread the load
>>>
>>>> level, we have to wait the next softirq entry to handle that imbalance. 
>>>> This
>>>> means the tasks on the second busiest runqueue have to stay longer, which 
>>>> could
>>>> introduce tail latency as well. That's why I introduced a variable to 
>>>> control
>>>> the redo loops. I'll send this to the benchmark queue to see if it makes 
>>>> any
>>>
>>> TBH, I don't like multiplying the number of knobs
>>
>> Sure, I can take your approach, :)
>>
>>>>>
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
>>>>>
>>>>> lockdep_assert_held(>src_rq->lock);
>>>>>
>>>>> +   /*
>>>>> +* Another CPU has emptied this runqueue in the meantime.
>>>>> +* Just return and leave the load_balance properly.
>>>>> +*/
>>>>> +   if (env->src_rq->nr_running <= 1 && !env->loop) {
>>
>> May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked
> 
> IIRC,  my point was to do the test only when trying to detach the 1st
> task. A lot of things can happen when a break is involved but TBH I
> can't remember a precise UC. It may be over cautious

When the break happens, rq unlock and local irq restored, so it's still possible
the rq is emptied by another CPU.

> 
>> from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's
>> nr_running <=1, we should return immediately with LBF_ALL_PINNED flag 
>> cleared.
>>
>> How about the following change?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 04a3ce20da67..1761d33accaa 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env)
>>  * We don't want to steal all, otherwise we may be treated 
>> likewise,
>>      * which could at worst lead to a livelock crash.
>>  */
>> -   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 
>> 1)
>> +   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 
>> 1) {
> 
> IMO, we must do the test before:  while (!list_empty(tasks)) {
> 
> because src_rq might have become empty if waiting tasks have been
> pulled by another cpu and the running one became idle in the meantime

Okay, after the running one became idle, it still has LBF_ALL_PINNED, which
needs to be cleared as well. Thanks!

> 
>> +   /* Clear the flag as we will not test any task */
>> +   env->flag &= ~LBF_ALL_PINNED;
>> break;
>> +   }
>>
>> p = list_last_entry(tasks, struct task_struct, 
>> se.group_node);
>>
>> Thanks,
>> -Aubrey

Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-02-22 Thread Li, Aubrey

Hi Vincent,

Sorry for the delay, I just returned from Chinese New Year holiday.

On 2021/1/25 22:51, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey  wrote:
>>
>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li  wrote:
>>>>
>>>> A long-tail load balance cost is observed on the newly idle path,
>>>> this is caused by a race window between the first nr_running check
>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>
>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>
>>> We should better detect that when trying to detach task like below
>>
>> This should be a compromise from my understanding. If we give up load balance
>> this time due to the race condition, we do reduce the load balance cost on 
>> the
>> newly idle path, but if there is an imbalance indeed at the same sched_domain
> 
> Redo path is there in case, LB has found an imbalance but it can't
> move some loads from this busiest rq to dest rq because of some cpu
> affinity. So it tries to fix the imbalance by moving load onto another
> rq of the group. In your case, the imbalance has disappeared because
> it has already been pulled by another rq so you don't have to try to
> find another imbalance. And I would even say you should not in order
> to let other level to take a chance to spread the load
> 
>> level, we have to wait the next softirq entry to handle that imbalance. This
>> means the tasks on the second busiest runqueue have to stay longer, which 
>> could
>> introduce tail latency as well. That's why I introduced a variable to control
>> the redo loops. I'll send this to the benchmark queue to see if it makes any
> 
> TBH, I don't like multiplying the number of knobs

Sure, I can take your approach, :)

>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
>>>
>>> lockdep_assert_held(>src_rq->lock);
>>>
>>> +   /*
>>> +* Another CPU has emptied this runqueue in the meantime.
>>> +* Just return and leave the load_balance properly.
>>> +*/
>>> +   if (env->src_rq->nr_running <= 1 && !env->loop) {

May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked
from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's
nr_running <=1, we should return immediately with LBF_ALL_PINNED flag cleared.

How about the following change?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..1761d33accaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env)
 * We don't want to steal all, otherwise we may be treated 
likewise,
 * which could at worst lead to a livelock crash.
 */
-   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) {
+   /* Clear the flag as we will not test any task */
+   env->flag &= ~LBF_ALL_PINNED;
break;
+   }
 
p = list_last_entry(tasks, struct task_struct, se.group_node);
 
Thanks,
-Aubrey

Re: [PATCH v5 0/4] Scan for an idle sibling in a single pass

2021-01-31 Thread Li, Aubrey

On 2021/1/27 21:51, Mel Gorman wrote:
> Changelog since v4
> o Avoid use of intermediate variable during select_idle_cpu
> 
> Changelog since v3
> o Drop scanning based on cores, SMT4 results showed problems
> 
> Changelog since v2
> o Remove unnecessary parameters
> o Update nr during scan only when scanning for cpus
> 
> Changlog since v1
> o Move extern declaration to header for coding style
> o Remove unnecessary parameter from __select_idle_cpu
> 
> This series of 4 patches reposts three patches from Peter entitled
> "select_idle_sibling() wreckage". It only scans the runqueues in a single
> pass when searching for an idle sibling.
> 
> Three patches from Peter were dropped. The first patch altered how scan
> depth was calculated. Scan depth deletion is a random number generator
> with two major limitations. The avg_idle time is based on the time
> between a CPU going idle and being woken up clamped approximately by
> 2*sysctl_sched_migration_cost.  This is difficult to compare in a sensible
> fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
> of scan failures is recorded and it does not decay.  This requires deeper
> surgery that would justify a patch on its own although Peter notes that
> https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is
> potentially useful for an alternative avg_idle metric.
> 
> The second patch dropped scanned based on cores instead of CPUs as it
> rationalised the difference between core scanning and CPU scanning.
> Unfortunately, Vincent reported problems with SMT4 so it's dropped
> for now until depth searching can be fixed.
> 
> The third patch dropped converted the idle core scan throttling mechanism
> to SIS_PROP. While this would unify the throttling of core and CPU
> scanning, it was not free of regressions and has_idle_cores is a fairly
> effective throttling mechanism with the caveat that it can have a lot of
> false positives for workloads like hackbench.
> 
> Peter's series tried to solve three problems at once, this subset addresses
> one problem.
> 
>  kernel/sched/fair.c | 151 +++-
>  kernel/sched/features.h |   1 -
>  2 files changed, 70 insertions(+), 82 deletions(-)
> 

4 benchmarks measured on a x86 4s system with 24 cores per socket and
2 HTs per core, total 192 CPUs. 

The load level is [25%, 50%, 75%, 100%].

- hackbench almost has a universal win.
- netperf high load has notable changes, as well as tbench 50% load.

Details below:

hackbench: 10 iterations, 1 loops, 40 fds per group
==

- pipe process

group   base%stdv5  %std
3   1   19.18   1.0266  9.06
6   1   9.170.987   13.03
9   1   7.111.0195  4.61
12  1   1.070.9927  1.43

- pipe thread

group   base%stdv5  %std
3   1   11.14   0.9742  7.27
6   1   9.150.9572  7.48
9   1   2.950.986   4.05
12  1   1.750.9992  1.68

- socket process

group   base%stdv5  %std
3   1   2.9 0.9586  2.39
6   1   0.680.9641  1.3
9   1   0.640.9388  0.76
12  1   0.560.9375  0.55

- socket thread

group   base%stdv5  %std
3   1   3.820.9686  2.97
6   1   2.060.9667  1.91
9   1   0.440.9354  1.25
12  1   0.540.9362  0.6

netperf: 10 iterations x 100 seconds, transactions rate / sec
=

- tcp request/response performance

thread  base%stdv4  %std
25% 1   5.341.0039  5.13
50% 1   4.971.0115  6.3
75% 1   5.090.9257  6.75
100%1   4.530.908   4.83



- udp request/response performance

thread  base%stdv4  %std
25% 1   6.180.9896  6.09
50% 1   5.881.0198  8.92
75% 1   24.38   0.9236  29.14
100%1   26.16   0.9063  22.16

tbench: 10 iterations x 100 seconds, throughput / sec
=

thread  base%stdv4  %std
25% 1   0.451.003   1.48
50% 1   1.710.9286  0.82
75% 1   0.840.9928  0.94
100%1   0.760.9762  0.59

schbench: 10 iterations x 100 seconds, 99th percentile latency
==

mthread base%stdv4  %std
25% 1   2.890.9884  7.34
    50% 1   40.38   1.0055  38.37
75% 1   4.761.0095  4.62
100%1   10.09   1.0083  8.03

Thanks,
-Aubrey

Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-01-26 Thread Li, Aubrey

On 2021/1/25 22:51, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey  wrote:
>>
>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li  wrote:
>>>>
>>>> A long-tail load balance cost is observed on the newly idle path,
>>>> this is caused by a race window between the first nr_running check
>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>
>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>
>>> We should better detect that when trying to detach task like below
>>
>> This should be a compromise from my understanding. If we give up load balance
>> this time due to the race condition, we do reduce the load balance cost on 
>> the
>> newly idle path, but if there is an imbalance indeed at the same sched_domain
> 
> Redo path is there in case, LB has found an imbalance but it can't
> move some loads from this busiest rq to dest rq because of some cpu
> affinity. So it tries to fix the imbalance by moving load onto another
> rq of the group. In your case, the imbalance has disappeared because
> it has already been pulled by another rq so you don't have to try to
> find another imbalance. And I would even say you should not in order
> to let other level to take a chance to spread the load

Here is one simple case I have seen:
1) CPU_a becomes idle and invoke newly idle balance
2) Group_b is found as the busiest group
3) CPU_b_1 is found as the busiest CPU, nr_running = 5
4) detach_tasks check CPU_b_1's run queue again, nr_running = 1, goto redo
5) Group_b is still found as the busiest group
6) This time CPU_b_2 is found as the busiest CPU, nr_running = 3
7) detach_tasks successfully, 2 tasks moved.

If we skipped redo,
- CPU_a exit load balance and remain idle
- tasks stay on CPU_b_2's runqueue, wait for the next load balancing

The two tasks could have been moved to the idle CPU and get executed
immediately.

> 
>> level, we have to wait the next softirq entry to handle that imbalance. This
>> means the tasks on the second busiest runqueue have to stay longer, which 
>> could
>> introduce tail latency as well. That's why I introduced a variable to control
>> the redo loops. I'll send this to the benchmark queue to see if it makes any
> 
> TBH, I don't like multiplying the number of knobs
> I see.

Thanks,
-Aubrey

Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-01-25 Thread Li, Aubrey

On 2021/1/25 18:56, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 06:50, Aubrey Li  wrote:
>>
>> A long-tail load balance cost is observed on the newly idle path,
>> this is caused by a race window between the first nr_running check
>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>
>> Before the busiest runqueue is locked, the tasks on the busiest
>> runqueue could be pulled by other CPUs and nr_running of the busiest
>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
> 
> We should better detect that when trying to detach task like below

This should be a compromise from my understanding. If we give up load balance
this time due to the race condition, we do reduce the load balance cost on the
newly idle path, but if there is an imbalance indeed at the same sched_domain
level, we have to wait the next softirq entry to handle that imbalance. This
means the tasks on the second busiest runqueue have to stay longer, which could
introduce tail latency as well. That's why I introduced a variable to control
the redo loops. I'll send this to the benchmark queue to see if it makes any
difference.

Thanks,
-Aubrey

> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
> 
> lockdep_assert_held(>src_rq->lock);
> 
> +   /*
> +* Another CPU has emptied this runqueue in the meantime.
> +* Just return and leave the load_balance properly.
> +*/
> +   if (env->src_rq->nr_running <= 1 && !env->loop) {
> +   /* Clear the flag as we will not test any task */
> +   env->flags &= ~LBF_ALL_PINNED;
> +   return 0;
> +   }
> +
> if (env->imbalance <= 0)
> return 0;
> 
> 
>> flag set, and triggers load_balance redo at the same sched_domain level.
>>
>> In order to find the new busiest sched_group and CPU, load balance will
>> recompute and update the various load statistics, which eventually leads
>> to the long-tail load balance cost.
>>
>> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
>> redo times, combined with sysctl_sched_nr_migrate, the max load balance
>> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
>> 192 logical CPUs.
>>
>> Cc: Andi Kleen 
>> Cc: Tim Chen 
>> Cc: Srinivas Pandruvada 
>> Cc: Rafael J. Wysocki 
>> Signed-off-by: Aubrey Li 
>> ---
>>  kernel/sched/fair.c | 7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ae7ceba..b59f371 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7407,6 +7407,8 @@ struct lb_env {
>> unsigned intloop;
>> unsigned intloop_break;
>> unsigned intloop_max;
>> +   unsigned intredo_cnt;
>> +   unsigned intredo_max;
>>
>> enum fbq_type   fbq_type;
>> enum migration_type migration_type;
>> @@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
>> return group_balance_cpu(sg) == env->dst_cpu;
>>  }
>>
>> +static const unsigned int sched_nr_lb_redo = 1;
>>  /*
>>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>>   * tasks if there is an imbalance.
>> @@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>> .dst_grpmask= sched_group_span(sd->groups),
>> .idle   = idle,
>> .loop_break = sched_nr_migrate_break,
>> +   .redo_max   = sched_nr_lb_redo,
>> .cpus   = cpus,
>> .fbq_type   = all,
>> .tasks  = LIST_HEAD_INIT(env.tasks),
>> @@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>>  * destination group that is receiving any migrated
>>  * load.
>>  */
>> -   if (!cpumask_subset(cpus, env.dst_grpmask)) {
>> +   if (!cpumask_subset(cpus, env.dst_grpmask) &&
>> +   ++env.redo_cnt < env.redo_max) {
>> env.loop = 0;
>> env.loop_break = sched_nr_migrate_break;
>> goto redo;
>> --
>> 2.7.4
>>

Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

2021-01-25 Thread Li, Aubrey

On 2021/1/25 17:04, Mel Gorman wrote:
> On Mon, Jan 25, 2021 at 12:29:47PM +0800, Li, Aubrey wrote:
>>>>> hackbench -l 2560 -g 1 on 8 cores arm64
>>>>> v5.11-rc4 : 1.355 (+/- 7.96)
>>>>> + sis improvement : 1.923 (+/- 25%)
>>>>> + the patch below : 1.332 (+/- 4.95)
>>>>>
>>>>> hackbench -l 2560 -g 256 on 8 cores arm64
>>>>> v5.11-rc4 : 2.116 (+/- 4.62%)
>>>>> + sis improvement : 2.216 (+/- 3.84%)
>>>>> + the patch below : 2.113 (+/- 3.01%)
>>>>>
>>
>> 4 benchmarks reported out during weekend, with patch 3 on a x86 4s system
>> with 24 cores per socket and 2 HT per core, total 192 CPUs.
>>
>> It looks like mid-load has notable changes on my side:
>> - netperf 50% num of threads in TCP mode has 27.25% improved
>> - tbench 50% num of threads has 9.52% regression
>>
> 
> It's interesting that patch 3 would make any difference on x64 given that
> it's SMT2. The scan depth should have been similar. It's somewhat expected
> that it will not be a universal win, particularly once the utilisation
> is high enough to spill over in sched domains (25%, 50%, 75% utilisation
> being interesting on 4-socket systems). In such cases, double scanning can
> still show improvements for workloads that idle rapidly like tbench and
> hackbench even though it's expensive. The extra scanning gives more time
> for a CPU to go idle enough to be selected which can improve throughput
> but at the cost of wake-up latency,

aha, sorry for the confusion. Since you and Vincent discussed to drop
patch3, I just mentioned I tested 5 patches with patch3, not patch3 alone.

> 
> Hopefully v4 can be tested as well which is now just a single scan.
> 

Sure, may I know the baseline of v4?

Thanks,
-Aubrey

Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-01-25 Thread Li, Aubrey

On 2021/1/25 17:06, Mel Gorman wrote:
> On Mon, Jan 25, 2021 at 02:02:58PM +0800, Aubrey Li wrote:
>> A long-tail load balance cost is observed on the newly idle path,
>> this is caused by a race window between the first nr_running check
>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>
>> Before the busiest runqueue is locked, the tasks on the busiest
>> runqueue could be pulled by other CPUs and nr_running of the busiest
>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>> flag set, and triggers load_balance redo at the same sched_domain level.
>>
>> In order to find the new busiest sched_group and CPU, load balance will
>> recompute and update the various load statistics, which eventually leads
>> to the long-tail load balance cost.
>>
>> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
>> redo times, combined with sysctl_sched_nr_migrate, the max load balance
>> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
>> 192 logical CPUs.
>>
>> Cc: Andi Kleen 
>> Cc: Tim Chen 
>> Cc: Srinivas Pandruvada 
>> Cc: Rafael J. Wysocki 
>> Signed-off-by: Aubrey Li 
> 
> If redo_max is a constant, why is it not a #define instead of increasing
> the size of lb_env?
> 

I followed the existing variable sched_nr_migrate_break, I think this might
be a tunable as well.

Thanks,
-Aubrey

[RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-01-24 Thread Aubrey Li

A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.

Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
flag set, and triggers load_balance redo at the same sched_domain level.

In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.

This patch introduces a variable(sched_nr_lb_redo) to limit load balance
redo times, combined with sysctl_sched_nr_migrate, the max load balance
cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
192 logical CPUs.

Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Srinivas Pandruvada 
Cc: Rafael J. Wysocki 
Signed-off-by: Aubrey Li 
---
 kernel/sched/fair.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba..b59f371 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7407,6 +7407,8 @@ struct lb_env {
unsigned intloop;
unsigned intloop_break;
unsigned intloop_max;
+   unsigned intredo_cnt;
+   unsigned intredo_max;
 
enum fbq_type   fbq_type;
enum migration_type migration_type;
@@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
return group_balance_cpu(sg) == env->dst_cpu;
 }
 
+static const unsigned int sched_nr_lb_redo = 1;
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.dst_grpmask= sched_group_span(sd->groups),
.idle   = idle,
.loop_break = sched_nr_migrate_break,
+   .redo_max   = sched_nr_lb_redo,
.cpus   = cpus,
.fbq_type   = all,
.tasks  = LIST_HEAD_INIT(env.tasks),
@@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 * destination group that is receiving any migrated
 * load.
 */
-   if (!cpumask_subset(cpus, env.dst_grpmask)) {
+   if (!cpumask_subset(cpus, env.dst_grpmask) &&
+   ++env.redo_cnt < env.redo_max) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
goto redo;
-- 
2.7.4

Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

2021-01-24 Thread Li, Aubrey

On 2021/1/22 21:22, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 11:14, Mel Gorman  wrote:
>>
>> On Fri, Jan 22, 2021 at 10:30:52AM +0100, Vincent Guittot wrote:
>>> Hi Mel,
>>>
>>> On Tue, 19 Jan 2021 at 13:02, Mel Gorman  
>>> wrote:

 On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
> On Tue, 19 Jan 2021 at 12:22, Mel Gorman  
> wrote:
>>
>> Changelog since v2
>> o Remove unnecessary parameters
>> o Update nr during scan only when scanning for cpus
>
> Hi Mel,
>
> I haven't looked at your previous version mainly because I'm chasing a
> performance regression on v5.11-rcx which prevents me from testing the
> impact of your patchset on my !SMT2 system.
> Will do this as soon as this problem is fixed
>

 Thanks, that would be appreciated as I do not have access to a !SMT2
 system to do my own evaluation.
>>>
>>> I have been able to run tests with your patchset on both large arm64
>>> SMT4 system and small arm64 !SMT system and patch 3 is still a source
>>> of regression on both. Decreasing min number of loops to 2 instead of
>>> 4 and scaling it with smt weight doesn't seem to be a good option as
>>> regressions disappear when I remove them as I tested with the patch
>>> below
>>>
>>> hackbench -l 2560 -g 1 on 8 cores arm64
>>> v5.11-rc4 : 1.355 (+/- 7.96)
>>> + sis improvement : 1.923 (+/- 25%)
>>> + the patch below : 1.332 (+/- 4.95)
>>>
>>> hackbench -l 2560 -g 256 on 8 cores arm64
>>> v5.11-rc4 : 2.116 (+/- 4.62%)
>>> + sis improvement : 2.216 (+/- 3.84%)
>>> + the patch below : 2.113 (+/- 3.01%)
>>>

4 benchmarks reported out during weekend, with patch 3 on a x86 4s system
with 24 cores per socket and 2 HT per core, total 192 CPUs.

It looks like mid-load has notable changes on my side:
- netperf 50% num of threads in TCP mode has 27.25% improved
- tbench 50% num of threads has 9.52% regression

Details below:

hackbench: 10 iterations, 1 loops, 40 fds per group
==

- pipe process

group   base%stdpatch   %std
6   1   5.271.0469  8.53
12  1   1.031.0398  1.44
24  1   2.361.0275  3.34

- pipe thread

group   base%stdpatch   %std
6   1   7.481.0747  5.25
12  1   0.971.0432  1.95
24  1   7.011.0299  6.81

- socket process

group   base%stdpatch   %std
6   1   1.010.9656  1.09
12  1   0.350.9853  0.49
24  1   1.330.9877  1.20

- socket thread

group   base%stdpatch   %std
6   1   2.520.9346  2.75
12  1   0.860.9830  0.66
24  1   1.170.9791  1.23

netperf: 10 iterations x 100 seconds, transactions rate / sec
=

- tcp request/response performance

thread  base%stdpatch   %std
50% 1   3.981.2725   7.52
100%1   2.730.9446   2.86
200%1   39.36   0.9955  29.45

- udp request/response performance

thread  base%stdpatch   %std
50% 1   6.181.0704  11.99
100%1   47.85   0.9637  45.83
200%1   45.74   1.0162  36.99

tbench: 10 iterations x 100 seconds, throughput / sec
=

thread  base%stdpatch   %std
50% 1   1.380.9048  2.46 
100%1   1.050.9640  0.68 
200%1   6.760.9886  2.86 

schbench: 10 iterations x 100 seconds, 99th percentile latency
==

mthread base%stdpatch   %std
6   1   29.07   0.8714  25.73
12  1   15.32   1.  12.39
24  10.08   0.9996   0.01

>>> So starting with a min of 2 loops instead of 4 currently and scaling
>>> nr loop with smt weight doesn't seem to be a good option and we should
>>> remove it for now
>>>
>> Note that this is essentially reverting the patch. As you remove "nr *=
>> sched_smt_weight", the scan is no longer proportional to cores, it's
> 
> Yes. My goal above was to narrow the changes only to lines that
> generate the regressions but i agree that removing patch 3 is the
> right solution> 
>> proportial to logical CPUs and the rest of the patch and changelog becomes
>> meaningless. On that basis, I'll queue tests over the weekend that remove
>> this patch entirely and keep the CPU scan as a single pass.
>>
>> --
>> Mel Gorman
>> SUSE Labs

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-18 Thread Li, Aubrey

On 2021/1/15 18:08, Mel Gorman wrote:
> From: Peter Zijlstra (Intel) 
> 
> Both select_idle_core() and select_idle_cpu() do a loop over the same
> cpumask. Observe that by clearing the already visited CPUs, we can
> fold the iteration and iterate a core at a time.
> 
> All we need to do is remember any non-idle CPU we encountered while
> scanning for an idle core. This way we'll only iterate every CPU once.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/fair.c | 97 +++--
>  1 file changed, 59 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12e08da90024..6c0f841e9e75 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain 
> *sd, struct task_struct *p
>   return new_cpu;
>  }
>  
> +static inline int __select_idle_cpu(struct task_struct *p, int core, struct 
> cpumask *cpus)

Sorry if I missed anything, why p and cpus are needed here?

> +{
> + if (available_idle_cpu(core) || sched_idle_cpu(core))
> + return core;
> +
> + return -1;
> +}
> +
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>  EXPORT_SYMBOL_GPL(sched_smt_present);
> @@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq)
>   * there are no idle cores left in the system; tracked through
>   * sd_llc->shared->has_idle_cores and enabled through update_idle_core() 
> above.
>   */
> -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
> int target)
> +static int select_idle_core(struct task_struct *p, int core, struct cpumask 
> *cpus, int *idle_cpu)
>  {
> - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> - int core, cpu;
> + bool idle = true;
> + int cpu;
>  
>   if (!static_branch_likely(_smt_present))
> - return -1;
> -
> - if (!test_idle_cores(target, false))
> - return -1;
> -
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + return __select_idle_cpu(p, core, cpus);
>  
> - for_each_cpu_wrap(core, cpus, target) {
> - bool idle = true;
> -
> - for_each_cpu(cpu, cpu_smt_mask(core)) {
> - if (!available_idle_cpu(cpu)) {
> - idle = false;
> - break;
> + for_each_cpu(cpu, cpu_smt_mask(core)) {
> + if (!available_idle_cpu(cpu)) {
> + idle = false;
> + if (*idle_cpu == -1) {
> + if (sched_idle_cpu(cpu) && 
> cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + *idle_cpu = cpu;
> + break;
> + }
> + continue;
>   }
> + break;
>   }
> -
> - if (idle)
> - return core;
> -
> - cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> + if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> + *idle_cpu = cpu;
>   }
>  
> - /*
> -  * Failed to find an idle core; stop looking for one.
> -  */
> - set_idle_cores(target, 0);
> + if (idle)
> + return core;
>  
> + cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>   return -1;
>  }
>  
> @@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, 
> struct sched_domain *sd, int
>  
>  #define sched_smt_weight 1
>  
> -static inline int select_idle_core(struct task_struct *p, struct 
> sched_domain *sd, int target)
> +static inline void set_idle_cores(int cpu, int val)
>  {
> - return -1;
> +}
> +
> +static inline bool test_idle_cores(int cpu, bool def)
> +{
> + return def;
> +}
> +
> +static inline int select_idle_core(struct task_struct *p, int core, struct 
> cpumask *cpus, int *idle_cpu)
> +{
> + return __select_idle_cpu(p, core, cpus);
>  }
>  
>  #endif /* CONFIG_SCHED_SMT */
> @@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct 
> *p, struct sched_domain *s
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, 
> int target)
>  {
>   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> + int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + bool smt = test_idle_cores(target, false);
> + int this = smp_processor_id();
>   struct sched_domain *this_sd;
>   u64 time;
> - int this = smp_processor_id();
> - int cpu, nr = INT_MAX;
>  
>   this_sd = rcu_dereference(*this_cpu_ptr(_llc));
>   if (!this_sd)
> @@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>  
>   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>  
> - if (sched_feat(SIS_PROP)) {
> +

Re: [PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores

2021-01-18 Thread Li, Aubrey

On 2021/1/15 18:08, Mel Gorman wrote:
> From: Peter Zijlstra (Intel) 
> 
> Instead of calculating how many (logical) CPUs to scan, compute how
> many cores to scan.
> 
> This changes behaviour for anything !SMT2.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/core.c  | 18 +-
>  kernel/sched/fair.c  | 12 ++--
>  kernel/sched/sched.h |  2 ++
>  3 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 15d2562118d1..ada8faac2e4d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7444,11 +7444,19 @@ int sched_cpu_activate(unsigned int cpu)
>   balance_push_set(cpu, false);
>  
>  #ifdef CONFIG_SCHED_SMT
> - /*
> -  * When going up, increment the number of cores with SMT present.
> -  */
> - if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> - static_branch_inc_cpuslocked(_smt_present);
> + do {
> + int weight = cpumask_weight(cpu_smt_mask(cpu));
> +
> + if (weight > sched_smt_weight)
> + sched_smt_weight = weight;
> +
> + /*
> +  * When going up, increment the number of cores with SMT 
> present.
> +  */
> + if (weight == 2)
> + static_branch_inc_cpuslocked(_smt_present);
> +
> + } while (0);
>  #endif
>   set_cpu_active(cpu, true);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8d8e185cf3b..0811e2fe4f19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain 
> *sd, struct task_struct *p
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>  EXPORT_SYMBOL_GPL(sched_smt_present);
>  
> +int sched_smt_weight __read_mostly = 1;
> +
>  static inline void set_idle_cores(int cpu, int val)
>  {
>   struct sched_domain_shared *sds;
> @@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, 
> struct sched_domain *sd, int t
>  
>  #else /* CONFIG_SCHED_SMT */
>  
> +#define sched_smt_weight 1
> +
>  static inline int select_idle_core(struct task_struct *p, struct 
> sched_domain *sd, int target)
>  {
>   return -1;
> @@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct 
> *p, struct sched_domain *sd
>  
>  #endif /* CONFIG_SCHED_SMT */
>  
> +#define sis_min_cores2
> +
>  /*
>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>   avg_cost = this_sd->avg_scan_cost + 1;
>  
>   span_avg = sd->span_weight * avg_idle;
> - if (span_avg > 4*avg_cost)
> + if (span_avg > sis_min_cores*avg_cost)
>   nr = div_u64(span_avg, avg_cost);
>   else
> - nr = 4;
> + nr = sis_min_cores;
> +
> + nr *= sched_smt_weight;

Is it better to put this into an inline wrapper to hide sched_smt_weight if 
!CONFIG_SCHED_SMT?

Thanks,
-Aubrey

[PATCH] cpuset: fix typos in comments

2021-01-12 Thread Aubrey Li

Change hierachy to hierarchy and congifured to configured, no functionality
changed.

Signed-off-by: Aubrey Li 
---
 kernel/cgroup/cpuset.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 57b5b5d..15f4300 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -98,7 +98,7 @@ struct cpuset {
 * and if it ends up empty, it will inherit the parent's mask.
 *
 *
-* On legacy hierachy:
+* On legacy hierarchy:
 *
 * The user-configured masks are always the same with effective masks.
 */
@@ -1286,10 +1286,10 @@ static int update_parent_subparts_cpumask(struct cpuset 
*cpuset, int cmd,
  * @cs:  the cpuset to consider
  * @tmp: temp variables for calculating effective_cpus & partition setup
  *
- * When congifured cpumask is changed, the effective cpumasks of this cpuset
+ * When configured cpumask is changed, the effective cpumasks of this cpuset
  * and all its descendants need to be updated.
  *
- * On legacy hierachy, effective_cpus will be the same with cpu_allowed.
+ * On legacy hierarchy, effective_cpus will be the same with cpu_allowed.
  *
  * Called with cpuset_mutex held
  */
-- 
2.7.4

Re: [RFC][PATCH 0/5] select_idle_sibling() wreckage

2020-12-16 Thread Li, Aubrey

Hi Peter,

On 2020/12/15 0:48, Peter Zijlstra wrote:
> Hai, here them patches Mel asked for. They've not (yet) been through the
> robots, so there might be some build fail for configs I've not used.
> 
> Benchmark time :-)
> 

Here is the data on my side, benchmarks were tested on a x86 4 sockets system
with 24 cores per socket and 2 hyperthreads per core, total 192 CPUs.

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   0.781.0072  1.09
  144   1   0.581.0204  0.83
  192   1   0.661.0151  0.52
  240   1   2.080.8990  0.75

hackbench: process mode, 25600 loops, 40 file descriptors per group

  group baseline-avg%stdpatch-avg   %std
  2(80) 1   10.02   1.0339  9.94
  3(120)1   6.691.0049  6.92
  4(160)1   6.760.8663  8.74
  5(200)1   2.960.9651  4.28

schbench: 99th percentile latency, 16 workers per message thread

  mthread   baseline-avg%stdpatch-avg   %std
  6(96) 1   0.881.0055  0.81
  9(144)1   0.591.0007  0.37
  12(192)   1   0.610.9973  0.82
  15(240)   1   25.05   0.9251  18.36

sysbench mysql throughput: read/write, table size = 10,000,000

  threadbaseline-avg%stdpatch-avg   %std
  961   6.620.9668  4.04
  144   1   9.290.9579  6.53
  192   1   9.520.9503  5.35
  240   1   8.550.9657  3.34

It looks like 
- hackbench has a significant improvement of 4 groups
- uperf has a significant regression of 240 threads

Please let me know if you have any interested cases I can run/rerun.

Thanks,
-Aubrey

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-15 Thread Li, Aubrey

Hi Bao Hua,

Sorry I almost missed this message, :(

On 2020/12/14 7:29, Song Bao Hua (Barry Song) wrote:
> 
> Hi Aubrey,
> 
> The patch looks great. But I didn't find any hackbench improvement
> on kunpeng 920 which has 24 cores for each llc span. Llc span is also
> one numa node. The topology is like:
> # numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> node 0 size: 128669 MB
> node 0 free: 126995 MB
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47
> node 1 size: 128997 MB
> node 1 free: 127539 MB
> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
> 67 68 69 70 71
> node 2 size: 129021 MB
> node 2 free: 127106 MB
> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
> 91 92 93 94 95
> node 3 size: 127993 MB
> node 3 free: 126739 MB
> node distances:
> node   0   1   2   3
>   0:  10  12  20  22
>   1:  12  10  22  24
>   2:  20  22  10  12
>   3:  22  24  12  10
> 
> Benchmark command:
> numactl -N 0-1 hackbench -p -T -l 2 -g $1
> 
> for each g, I ran 10 times to get the average time. And I tested
> g from 1 to 10.
> 
> g 1  2  3  4  5  6   7 89   10
> w/o   1.4733 1.5992 1.9353 2.1563 2.8448 3.3305 3.9616 4.4870 5.0786 5.6983
> w/1.4709 1.6152 1.9474 2.1512 2.8298 3.2998 3.9472 4.4803 5.0462 5.6505
> 
> Is it because the core number is small in llc span in my test?

I guess it is with SIS_PROP, when the system is very busy that idle cpu scan
loop is throttled by nr(4). The patch actually reduces 4 times scan so the
data change looks marginal. Vincent mentioned a notable change at here:

https://lkml.org/lkml/2020/12/14/109

Maybe you can increase the group number to see if it can be reproduced on
your side.

Thanks,
-Aubrey

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2020-12-15 Thread Li, Aubrey

On 2020/12/15 15:59, Peter Zijlstra wrote:
> On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
>> On 2020/12/15 0:48, Peter Zijlstra wrote:
>>> We compute the average cost of the total scan, but then use it as a
>>> per-cpu scan cost when computing the scan proportion. Fix this by
>>> properly computing a per-cpu scan cost.
>>>
>>> This also fixes a bug where we would terminate early (!--nr, case) and
>>> not account that cost at all.
>>
>> I'm a bit worried this may introduce a regression under heavy load.
>> The overhead of adding another cpu_clock() and calculation becomes 
>> significant when sis_scan is throttled by nr.
> 
> The thing is, the code as it exists today makes no sense what so ever.
> It's plain broken batshit.
> 
> We calculate the total scanning time (irrespective of how many CPUs we
> touched), and then use that calculate the number of cpus to scan. That's
> just daft.
> 
> After this patch we calculate the avg cost of scanning 1 cpu and use
> that to calculate how many cpus to scan. Which is coherent and sane.

I see and all of these make sense to me.

> 
> Maybe it can be improved, but that's a completely different thing.
> 

OK, I'll go through the workloads in hand and paste the data here.

Thanks,
-Aubrey

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2020-12-14 Thread Li, Aubrey

On 2020/12/15 0:48, Peter Zijlstra wrote:
> We compute the average cost of the total scan, but then use it as a
> per-cpu scan cost when computing the scan proportion. Fix this by
> properly computing a per-cpu scan cost.
> 
> This also fixes a bug where we would terminate early (!--nr, case) and
> not account that cost at all.

I'm a bit worried this may introduce a regression under heavy load.
The overhead of adding another cpu_clock() and calculation becomes 
significant when sis_scan is throttled by nr.

I'm not sure if it's a good idea to not account the scan cost at all
when sis_scan is throttled, that is, remove the first cpu_clock() as
well. The avg scan cost remains the value when the system is not very
busy, and when the load comes down and span avg idle > span avg cost,
we account the cost again. This should make select_idle_cpu() a bit
faster when the load is very high.

Thanks,
-Aubrey
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/fair.c |   13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6144,10 +6144,10 @@ static inline int select_idle_smt(struct
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, 
> int target)
>  {
>   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> + int cpu, loops = 1, nr = INT_MAX;
> + int this = smp_processor_id();
>   struct sched_domain *this_sd;
>   u64 time;
> - int this = smp_processor_id();
> - int cpu, nr = INT_MAX;
>  
>   this_sd = rcu_dereference(*this_cpu_ptr(_llc));
>   if (!this_sd)
> @@ -6175,14 +6175,19 @@ static int select_idle_cpu(struct task_s
>   }
>  
>   for_each_cpu_wrap(cpu, cpus, target) {
> - if (!--nr)
> - return -1;
>   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>   break;
> +
> + if (loops >= nr) {
> + cpu = -1;
> + break;
> + }
> + loops++;
>   }
>  
>   if (sched_feat(SIS_PROP)) {
>   time = cpu_clock(this) - time;
> + time = div_u64(time, loops);
>   update_avg(_sd->avg_scan_cost, time);
>   }
>  
> 
>

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-13 Thread Li, Aubrey

On 2020/12/10 19:34, Mel Gorman wrote:
> On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
>>> I ran this patch with tbench on top of of the schedstat patches that
>>> track SIS efficiency. The tracking adds overhead so it's not a perfect
>>> performance comparison but the expectation would be that the patch reduces
>>> the number of runqueues that are scanned
>>
>> Thanks for the measurement! I don't play with tbench so may need a while
>> to digest the data.
>>
> 
> They key point is that it appears the idle mask was mostly equivalent to
> the full domain mask, at least for this test.
> 
>>>
>>> tbench4
>>>   5.10.0-rc6 5.10.0-rc6
>>>   schedstat-v1r1  idlemask-v7r1
>>> Hmean 1504.76 (   0.00%)  500.14 *  -0.91%*
>>> Hmean 2   1001.22 (   0.00%)  970.37 *  -3.08%*
>>> Hmean 4   1930.56 (   0.00%) 1880.96 *  -2.57%*
>>> Hmean 8   3688.05 (   0.00%) 3537.72 *  -4.08%*
>>> Hmean 16  6352.71 (   0.00%) 6439.53 *   1.37%*
>>> Hmean 32 10066.37 (   0.00%)10124.65 *   0.58%*


>>> Hmean 64 12846.32 (   0.00%)11627.27 *  -9.49%*

I focused on this case and run it 5 times, and here is the data on my side.
5 times x 600s tbench, thread number is 153(80% x 192(h/w thread num)).

Hmean 153   v5.9.12 v5.9.12
schedstat-v1idlemask-v8(with schedstat)
Round 1 15717.3 15608.1
Round 2 14856.9 15642.5
Round 3 14856.7 15782.1
Round 4 15408.9 15912.9
Round 5 15436.6 15927.7

>From tbench throughput data, bigger is better, it looks like idlemask wins

And here is SIS_scanned data:

Hmean 153   v5.9.12 v5.9.12
schedstat-v1idlemask-v8(with schedstat)
Round 1 22562490432 21894932302
Round 2 21288529957 21693722629
Round 3 20657521771 21268308377
Round 4 21868486414 22289128955
Round 5 21859614988 22214740417

>From SIS_scanned data, less is better, it looks like the default one is better.

But combined with throughput data, this can be explained as bigger throughput
performs more SIS_scanned.

So at least, there is no regression of this case.

Thanks,
-Aubrey

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-11 Thread Li, Aubrey

On 2020/12/11 23:22, Vincent Guittot wrote:
> On Fri, 11 Dec 2020 at 16:19, Li, Aubrey  wrote:
>>
>> On 2020/12/11 23:07, Vincent Guittot wrote:
>>> On Thu, 10 Dec 2020 at 02:44, Aubrey Li  wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>>>> kbuild have been tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>>>> v7->v8:
>>>> - refine update_idle_cpumask, no functionality change
>>>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>>>>
>>>> v6->v7:
>>>> - place the whole idle cpumask mechanism under CONFIG_SMP
>>>>
>>>> v5->v6:
>>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>>>   in idle cpumask every time the CPU enters idle
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available
>>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>>>>
>>>> v2->v3:
>>>> - change setting idle cpumask to every idle entry, otherwise schbench
>>>>   has a regression of 99th percentile latency
>>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>>>   idle cpumask is ratelimited in the idle exiting path
>>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>>>>
>>>> v1->v2:
>>>> - idle cpumask is updated in the nohz routines, by initializing idle
>>>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>>>   behavior
>>>>
>>>> Cc: Peter Zijlstra 
>>>> Cc: Mel Gorman 
>>>> Cc: Vincent Guittot 
>>>> Cc: Qais Yousef 
>>>> Cc: Valentin Schneider 
>>>> Cc: Jiang Biao 
>>>> Cc: Tim Chen 
>>>> Signed-off-by: Aubrey Li 
>>>
>>> This version looks good to me. I don't see regressions of v5 anymore
>>> and see some improvements on heavy cases
>>
>> v5 or v8?
> 
> the v8 looks good to me and I don't see the regressions that I have
> seen with the v5 anymore
> 
Sounds great, thanks, :)

> 
>>
>>>
>>> Reviewed-by: Vincent Guittot 
>>>
>>>> ---
>>>>  include/linux/sched/topology.h | 13 ++
>>>>  kernel/sched/core.c|  2 ++
>>>>  kernel/sched/fair.c| 45 +-
>>>>  kernel/sched/idle.c|  5 
>>>>  kernel/sched/sched.h   |  4 +++
>>>>  kernel/sched/topology.c|  3 ++-
>>>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h 
>>>> b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>> atomic_tref;
>>>> atomic_tnr_busy_cpus;
>>>> int has_idle_cores;
>>>> +   /*
>>>> +* Span of all idle CPUs in this domain.
>>>> +*
>>>> +* NOTE: this field is variable length. (Allocated dynamically
>>>> +* by attaching extra space to the end of the structure,
>>>> +* depending on how many CPUs the kernel has booted up with)
>>>> +*/
>>>> +   unsigned long   idle_cpus_span[];
>>>>  };
>&g

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-11 Thread Li, Aubrey

On 2020/12/11 23:07, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 02:44, Aubrey Li  wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>> kbuild have been tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
>> v7->v8:
>> - refine update_idle_cpumask, no functionality change
>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>>
>> v6->v7:
>> - place the whole idle cpumask mechanism under CONFIG_SMP
>>
>> v5->v6:
>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>   in idle cpumask every time the CPU enters idle
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior
>>
>> Cc: Peter Zijlstra 
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
> 
> This version looks good to me. I don't see regressions of v5 anymore
> and see some improvements on heavy cases

v5 or v8?

> 
> Reviewed-by: Vincent Guittot 
> 
>> ---
>>  include/linux/sched/topology.h | 13 ++
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 45 +-
>>  kernel/sched/idle.c|  5 
>>  kernel/sched/sched.h   |  4 +++
>>  kernel/sched/topology.c|  3 ++-
>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_tref;
>> atomic_tnr_busy_cpus;
>> int has_idle_cores;
>> +   /*
>> +* Span of all idle CPUs in this domain.
>> +*
>> +* NOTE: this field is variable length. (Allocated dynamically
>> +* by attaching extra space to the end of the structure,
>> +* depending on how many CPUs the kernel has booted up with)
>> +*/
>> +   unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +   return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>> /* These fields must be setup */
>> struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b136e2440ea4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>
>>  #ifdef CONFIG_SMP
>> rq->idle_balance = idle_cpu(cpu);
>> +   update_idle_cpumask(cpu, rq->idle_balance);
>> trigger_load_balance(rq);
>>  #endif
>>  }
>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>> rq->idle_stamp = 0;
>> rq->avg_idle = 2*sysctl_sched_mig

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-10 Thread Li, Aubrey

On 2020/12/10 19:34, Mel Gorman wrote:
> On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
>>> I ran this patch with tbench on top of of the schedstat patches that
>>> track SIS efficiency. The tracking adds overhead so it's not a perfect
>>> performance comparison but the expectation would be that the patch reduces
>>> the number of runqueues that are scanned
>>
>> Thanks for the measurement! I don't play with tbench so may need a while
>> to digest the data.
>>
> 
> They key point is that it appears the idle mask was mostly equivalent to
> the full domain mask, at least for this test.

I'm more interested in how tbench with heavy load behaves.
If the load is heavy enough and idle thread has no chance to switch in,
idle cpumask will be empty at the first scheduler tick and remain empty
before the load comes down, during this period of heavy load:
- default select_idle_cpu still scan the entire sched domain(or throttled to
  4) everytime
- the patched select_idle_cpu does not scan at all
> 
>>>
>>> tbench4
>>>   5.10.0-rc6 5.10.0-rc6
>>>   schedstat-v1r1  idlemask-v7r1
>>> Hmean 1504.76 (   0.00%)  500.14 *  -0.91%*
>>> Hmean 2   1001.22 (   0.00%)  970.37 *  -3.08%*
>>> Hmean 4   1930.56 (   0.00%) 1880.96 *  -2.57%*
>>> Hmean 8   3688.05 (   0.00%) 3537.72 *  -4.08%*
>>> Hmean 16  6352.71 (   0.00%) 6439.53 *   1.37%*
>>> Hmean 32 10066.37 (   0.00%)10124.65 *   0.58%*
>>> Hmean 64 12846.32 (   0.00%)11627.27 *  -9.49%*
>>> Hmean 12822278.41 (   0.00%)22304.33 *   0.12%*
>>> Hmean 25621455.52 (   0.00%)20900.13 *  -2.59%*
>>> Hmean 32021802.38 (   0.00%)21928.81 *   0.58%*
>>>
>>> Not very optimistic result. The schedstats indicate;
>>
>> How many client threads was the following schedstats collected?
>>
> 
> That's the overall summary for all client counts. While proc-schedstat
> was measured every few seconds over all client counts, presenting that
> in text format is not easy to parse. However, looking at the graphs over
> time, it did not appear that scan rates were consistently lower for any
> client count for tbench.
> 
>>>
>>> 5.10.0-rc6 5.10.0-rc6
>>> schedstat-v1r1  idlemask-v7r1
>>> Ops TTWU Count   5599714302.00  5589495123.00
>>> Ops TTWU Local   2687713250.00  2563662550.00
>>> Ops SIS Search   5596677950.00  5586381168.00
>>> Ops SIS Domain Search3268344934.00  3229088045.00
>>> Ops SIS Scanned 15909069113.00 16568899405.00
>>> Ops SIS Domain Scanned  13580736097.00 14211606282.00
>>> Ops SIS Failures 2944874939.00  2843113421.00
>>> Ops SIS Core Search   262853975.00   311781774.00
>>> Ops SIS Core Hit  185189656.00   216097102.00
>>> Ops SIS Core Miss  77664319.0095684672.00
>>> Ops SIS Recent Used Hit   124265515.00   146021086.00
>>> Ops SIS Recent Used Miss  338142547.00   403547579.00
>>> Ops SIS Recent Attempts   462408062.00   549568665.00
>>> Ops SIS Search Efficiency35.18  33.72
>>> Ops SIS Domain Search Eff24.07  22.72
>>> Ops SIS Fast Success Rate41.60  42.20
>>> Ops SIS Success Rate 47.38  49.11
>>> Ops SIS Recent Success Rate  26.87  26.57
>>>
>>> The field I would expect to decrease is SIS Domain Scanned -- the number
>>> of runqueues that were examined but it's actually worse and graphing over
>>> time shows it's worse for the client thread counts.  select_idle_cpu()
>>> is definitely being called because "Domain Search" is 10 times higher than
>>> "Core Search" and there "Core Miss" is non-zero.
>>
>> Why SIS Domain Scanned can be decreased?
>>
> 
> Because if idle CPUs are being targetted and its a subset of the entire
> domain then it follows that fewer runqueues should be examined when
> scanning the domain.

Sorry, I probably messed up "SIS Domain Scanned" and "SIS Domain Search".
How is "SIS Domain Scanned" calculated?

> 
>> I thought SIS Scanned was supposed to be decreased but it seems not on your 
>> side.
>>
> 
> It *should* have been decreased but it's indicating that more runqueues

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-10 Thread Li, Aubrey

Hi Mel,

On 2020/12/9 22:36, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote:
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>> and kbuild were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
> 
> I ran this patch with tbench on top of of the schedstat patches that
> track SIS efficiency. The tracking adds overhead so it's not a perfect
> performance comparison but the expectation would be that the patch reduces
> the number of runqueues that are scanned

Thanks for the measurement! I don't play with tbench so may need a while
to digest the data.

> 
> tbench4
>   5.10.0-rc6 5.10.0-rc6
>   schedstat-v1r1  idlemask-v7r1
> Hmean 1504.76 (   0.00%)  500.14 *  -0.91%*
> Hmean 2   1001.22 (   0.00%)  970.37 *  -3.08%*
> Hmean 4   1930.56 (   0.00%) 1880.96 *  -2.57%*
> Hmean 8   3688.05 (   0.00%) 3537.72 *  -4.08%*
> Hmean 16  6352.71 (   0.00%) 6439.53 *   1.37%*
> Hmean 32 10066.37 (   0.00%)10124.65 *   0.58%*
> Hmean 64 12846.32 (   0.00%)11627.27 *  -9.49%*
> Hmean 12822278.41 (   0.00%)22304.33 *   0.12%*
> Hmean 25621455.52 (   0.00%)20900.13 *  -2.59%*
> Hmean 32021802.38 (   0.00%)21928.81 *   0.58%*
> 
> Not very optimistic result. The schedstats indicate;

How many client threads was the following schedstats collected?

> 
> 5.10.0-rc6 5.10.0-rc6
> schedstat-v1r1  idlemask-v7r1
> Ops TTWU Count   5599714302.00  5589495123.00
> Ops TTWU Local   2687713250.00  2563662550.00
> Ops SIS Search   5596677950.00  5586381168.00
> Ops SIS Domain Search3268344934.00  3229088045.00
> Ops SIS Scanned 15909069113.00 16568899405.00
> Ops SIS Domain Scanned  13580736097.00 14211606282.00
> Ops SIS Failures 2944874939.00  2843113421.00
> Ops SIS Core Search   262853975.00   311781774.00
> Ops SIS Core Hit  185189656.00   216097102.00
> Ops SIS Core Miss  77664319.0095684672.00
> Ops SIS Recent Used Hit   124265515.00   146021086.00
> Ops SIS Recent Used Miss  338142547.00   403547579.00
> Ops SIS Recent Attempts   462408062.00   549568665.00
> Ops SIS Search Efficiency35.18  33.72
> Ops SIS Domain Search Eff24.07  22.72
> Ops SIS Fast Success Rate41.60  42.20
> Ops SIS Success Rate 47.38  49.11
> Ops SIS Recent Success Rate  26.87  26.57
> 
> The field I would expect to decrease is SIS Domain Scanned -- the number
> of runqueues that were examined but it's actually worse and graphing over
> time shows it's worse for the client thread counts.  select_idle_cpu()
> is definitely being called because "Domain Search" is 10 times higher than
> "Core Search" and there "Core Miss" is non-zero.

Why SIS Domain Scanned can be decreased?

I thought SIS Scanned was supposed to be decreased but it seems not on your 
side.

I printed some trace log on my side by uperf workload, and it looks properly.
To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 
threads.

stage 1: system idle, update_idle_cpumask is called from idle thread, set 
cpumask to 0-3

  -0   [002] d..1   137.408681: update_idle_cpumask: 
set_idle-1, cpumask: 2
  -0   [000] d..1   137.408713: update_idle_cpumask: 
set_idle-1, cpumask: 0,2
  -0   [003] d..1   137.408924: update_idle_cpumask: 
set_idle-1, cpumask: 0,2-3
  -0   [001] d..1   137.409035: update_idle_cpumask: 
set_idle-1, cpumask: 0-3

stage 2: uperf ramp up, cpumask changes back and forth

   uperf-561 [003] d..3   137.410620: select_task_rq_fair: 
scanning: 0-3
   uperf-560 [000] d..5   137.411384: select_task_rq_fair: 
scanning: 0-3
kworker/u8:3-1

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-09 Thread Li, Aubrey

On 2020/12/8 23:34, Mel Gorman wrote:
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
> 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/fair.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac7b34e7372b..5c41875aec23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>   if (!this_sd)
>   return -1;
>  
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>   if (sched_feat(SIS_PROP)) {
>   u64 avg_cost, avg_idle, span_avg;
>  
> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>   nr = div_u64(span_avg, avg_cost);
>   else
>   nr = 4;
> - }
> -
> - time = cpu_clock(this);
>  
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + time = cpu_clock(this);
> + }
>  
>   for_each_cpu_wrap(cpu, cpus, target) {
>   if (!--nr)
>   return -1;

I thought about this again and here seems not to be consistent:
- even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before 
return -1?
- if avg_scan_cost is not updated because nr is throttled, the first 
time = cpu_clock(this);
  can be optimized. As nr is calculated and we already know which of the weight 
of cpumask and nr is greater.

Thanks,
-Aubrey

[RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-09 Thread Aubrey Li

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql and
kbuild have been tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no regression
found.

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 ++
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 45 +-
 kernel/sched/idle.c|  5 
 kernel/sched/sched.h   |  4 +++
 kernel/sched/topology.c|  3 ++-
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..b136e2440ea4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4011,6 +4011,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, rq->idle_balance);
trigger_load_balance(rq);
 #endif
 }
@@ -7186,6 +7187,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..25f36ecfee54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ *
+ * This function is called with interrupts disabled.
+ */
+void update_idle_cpumask(int cpu, bool idle)
+{
+   st

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-09 Thread Li, Aubrey

On 2020/12/9 21:09, Vincent Guittot wrote:
> On Wed, 9 Dec 2020 at 11:58, Li, Aubrey  wrote:
>>
>> On 2020/12/9 16:15, Vincent Guittot wrote:
>>> Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>>>> and kbuild were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>>>> v6->v7:
>>>> - place the whole idle cpumask mechanism under CONFIG_SMP.
>>>>
>>>> v5->v6:
>>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>>>   in idle cpumask every time the CPU enters idle
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available.
>>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>>>
>>>> v2->v3:
>>>> - change setting idle cpumask to every idle entry, otherwise schbench
>>>>   has a regression of 99th percentile latency.
>>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>>>   idle cpumask is ratelimited in the idle exiting path.
>>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>>>
>>>> v1->v2:
>>>> - idle cpumask is updated in the nohz routines, by initializing idle
>>>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>>>   behavior.
>>>>
>>>> Cc: Peter Zijlstra 
>>>> Cc: Mel Gorman 
>>>> Cc: Vincent Guittot 
>>>> Cc: Qais Yousef 
>>>> Cc: Valentin Schneider 
>>>> Cc: Jiang Biao 
>>>> Cc: Tim Chen 
>>>> Signed-off-by: Aubrey Li 
>>>> ---
>>>>  include/linux/sched/topology.h | 13 +
>>>>  kernel/sched/core.c|  2 ++
>>>>  kernel/sched/fair.c| 51 +-
>>>>  kernel/sched/idle.c|  5 
>>>>  kernel/sched/sched.h   |  4 +++
>>>>  kernel/sched/topology.c|  3 +-
>>>>  6 files changed, 76 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h 
>>>> b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>  atomic_tref;
>>>>  atomic_tnr_busy_cpus;
>>>>  int has_idle_cores;
>>>> +/*
>>>> + * Span of all idle CPUs in this domain.
>>>> + *
>>>> + * NOTE: this field is variable length. (Allocated dynamically
>>>> + * by attaching extra space to the end of the structure,
>>>> + * depending on how many CPUs the kernel has booted up with)
>>>> + */
>>>> +unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
>>>> *sds)
>>>> +{
>>>> +return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>>  /* These fields must be setup */
>>>>  struct sched_domain __rcu *parent;  /* top domain must be null 
>>>> terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c4da7e17b906..c4c51ff3402a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>>>
>>>>  #ifdef CONFIG_SMP
>>>>  rq->idle_balance = idle_cpu(cpu);
>>>> +update_idle_cpumask(cpu, false);
>>>
>>> Test rq->idle_balance here instead of adding the test in 
>>> update_idle_cpumask which is only
>>> relevant for this situation.
>>
>> If called from idle path, because !set_idle is false, rq->idle_balance won't 
>> be tested actually.
>>
>> if (!set_idle && rq->idle_balance)
>> return;
>>
>> So is it okay to leave it here to keep scheduler_tick a bit concise?
> 
> I don't like having a tick specific condition in a generic function.
> rq->idle_balance is only relevant in this case
> 
> calling update_idle_cpumask(rq->idle_balance) in scheduler_tick()
> should do the job and we can remove the check of rq->idle_balance in
> update_idle_cpumask()
> 
> In case of scheduler_tick() called when idle , we will only test if
> (rq->last_idle_state == idle_state) and return
> 

I see, will come up with a v8 soon.

Thanks,
-Aubrey

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-09 Thread Li, Aubrey

On 2020/12/9 17:05, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
>>>> nr = div_u64(span_avg, avg_cost);
>>>> else
>>>> nr = 4;
>>>> -   }
>>>> -
>>>> -   time = cpu_clock(this);
>>>>
>>>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>> +   time = cpu_clock(this);
>>>> +   }
>>>>
>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>> if (!--nr)
>>
>> nr is the key of this throttling mechanism, need to be placed under 
>> sched_feat(SIS_PROP) as well.
>>
> 
> It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.
>If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing 
>directly.
But with SIS_PROP, need adding a test in the loop.
Since SIS_PROP is default true, I think it's okay to keep the current way.

Thanks,
-Aubrey

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-09 Thread Li, Aubrey

On 2020/12/9 16:15, Vincent Guittot wrote:
> Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>> and kbuild were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
>> v6->v7:
>> - place the whole idle cpumask mechanism under CONFIG_SMP.
>>
>> v5->v6:
>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>   in idle cpumask every time the CPU enters idle
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available.
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior.
>>
>> Cc: Peter Zijlstra 
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
>> ---
>>  include/linux/sched/topology.h | 13 +
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 51 +-
>>  kernel/sched/idle.c|  5 
>>  kernel/sched/sched.h   |  4 +++
>>  kernel/sched/topology.c|  3 +-
>>  6 files changed, 76 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>  atomic_tref;
>>  atomic_tnr_busy_cpus;
>>  int has_idle_cores;
>> +/*
>> + * Span of all idle CPUs in this domain.
>> + *
>> + * NOTE: this field is variable length. (Allocated dynamically
>> + * by attaching extra space to the end of the structure,
>> + * depending on how many CPUs the kernel has booted up with)
>> + */
>> +unsigned long   idle_cpus_span[];
>>  };
>>  
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>>  /* These fields must be setup */
>>  struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..c4c51ff3402a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>  
>>  #ifdef CONFIG_SMP
>>  rq->idle_balance = idle_cpu(cpu);
>> +update_idle_cpumask(cpu, false);
> 
> Test rq->idle_balance here instead of adding the test in update_idle_cpumask 
> which is only
> relevant for this situation.

If called from idle path, because !set_idle is false, rq->idle_balance won't be 
tested actually.

if (!set_idle && rq->idle_balance)
return;

So is it okay to leave it here to keep scheduler_tick a bit concise?

Thanks,
-Aubrey

[RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-08 Thread Aubrey Li

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql
and kbuild were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no regression
found.

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP.

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 51 +-
 kernel/sched/idle.c|  5 
 kernel/sched/sched.h   |  4 +++
 kernel/sched/topology.c|  3 +-
 6 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..c4c51ff3402a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4011,6 +4011,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, false);
trigger_load_balance(rq);
 #endif
 }
@@ -7186,6 +7187,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..7306f8886120 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6806,6 +6811,50 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(int cpu, bool set_idle)
+{
+   struct sched_domain *sd;
+   struct rq *rq = cpu_rq(cpu);
+   int idle_state;
+
+   /*
+* If called from scheduler tick, only update
+* idle cpumask if the CPU is busy, as id

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-08 Thread Li, Aubrey

On 2020/12/9 0:03, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman  wrote:
>>
>> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
>> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
>> check and while we are at it, exclude the cost of initialising the CPU
>> mask from the average scan cost.
>>
>> Signed-off-by: Mel Gorman 
>> ---
>>  kernel/sched/fair.c | 14 --
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ac7b34e7372b..5c41875aec23 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>> if (!this_sd)
>> return -1;
> 
> Just noticed while reviewing the patch that the above related to
> this_sd can also go under sched_feat(SIS_PROP)
> 
>>
>> +   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +
>> if (sched_feat(SIS_PROP)) {
>> u64 avg_cost, avg_idle, span_avg;
>>
>> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>> nr = div_u64(span_avg, avg_cost);
>> else
>> nr = 4;
>> -   }
>> -
>> -   time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   time = cpu_clock(this);
>> +   }
>>
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)

nr is the key of this throttling mechanism, need to be placed under 
sched_feat(SIS_PROP) as well.

Thanks,
-Aubrey

Re: [RFC PATCH v6] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-08 Thread Li, Aubrey

Hi Peter,

Thanks for the comments.

On 2020/12/8 22:16, Peter Zijlstra wrote:
> On Tue, Dec 08, 2020 at 09:49:57AM +0800, Aubrey Li wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b8af602dea79 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3999,6 +3999,7 @@ void scheduler_tick(void)
>>  rq_lock(rq, );
>>  
>>  update_rq_clock(rq);
>> +update_idle_cpumask(rq, false);
> 
> Does that really need to be done with rq->lock held?> 
>>  thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>>  update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> 
>> @@ -6808,6 +6813,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
>> struct rq_flags *rf)
>>  }
>>  #endif /* CONFIG_SMP */
>>  
>> +/*
>> + * Update cpu idle state and record this information
>> + * in sd_llc_shared->idle_cpus_span.
>> + */
>> +void update_idle_cpumask(struct rq *rq, bool set_idle)
>> +{
>> +struct sched_domain *sd;
>> +int cpu = cpu_of(rq);
>> +int idle_state;
>> +
>> +/*
>> + * If called from scheduler tick, only update
>> + * idle cpumask if the CPU is busy, as idle
>> + * cpumask is also updated on idle entry.
>> + *
>> + */
>> +if (!set_idle && idle_cpu(cpu))
>> +return;
> 
> scheduler_tick() already calls idle_cpu() when SMP.
> 
>> +/*
>> + * Also set SCHED_IDLE cpu in idle cpumask to
>> + * allow SCHED_IDLE cpu as a wakeup target
>> + */
>> +idle_state = set_idle || sched_idle_cpu(cpu);
>> +/*
>> + * No need to update idle cpumask if the state
>> + * does not change.
>> + */
>> +if (rq->last_idle_state == idle_state)
>> +return;
>> +
>> +rcu_read_lock();
> 
> This is called with IRQs disabled, surely we can forgo rcu_read_lock()
> here.
> 
>> +sd = rcu_dereference(per_cpu(sd_llc, cpu));
>> +if (!sd || !sd->shared)
>> +goto unlock;
> 
> I don't think !sd->shared is possible here.
> 
>> +if (idle_state)
>> +cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>> +else
>> +cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>> +
>> +rq->last_idle_state = idle_state;
>> +unlock:
>> +rcu_read_unlock();
>> +}
>> +
>>  static unsigned long wakeup_gran(struct sched_entity *se)
>>  {
>>  unsigned long gran = sysctl_sched_wakeup_granularity;
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index f324dc36fc43..f995660edf2b 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -156,6 +156,11 @@ static void cpuidle_idle_call(void)
>>  return;
>>  }
>>  
>> +/*
>> + * The CPU is about to go idle, set it in idle cpumask
>> + * to be a wake up target.
>> + */
>> +update_idle_cpumask(this_rq(), true);
> 
> This should be in do_idle(), right around arch_cpu_idle_enter().
> 
>>  /*
>>   * The RCU framework needs to be told that we are entering an idle
>>   * section, so no more rcu read side critical sections and one more
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 8d1ca65db3b0..db460b20217a 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1004,6 +1004,7 @@ struct rq {
>>  /* This is used to determine avg_idle's max value */
>>  u64 max_idle_balance_cost;
>>  #endif /* CONFIG_SMP */
>> +unsigned char   last_idle_state;
> 
> All of that is pointless for UP. Also, is this the best location?
> 
Good point, I should put all of these under SMP. I'll refine the patch soon.

Thanks,
-Aubrey

Re: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

2020-12-07 Thread Li, Aubrey

On 2020/12/7 23:42, Mel Gorman wrote:
> On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote:
>> On Mon, 7 Dec 2020 at 10:15, Mel Gorman  wrote:
>>>
>>> This is a minimal series to reduce the amount of runqueue scanning in
>>> select_idle_sibling in the worst case.
>>>
>>> Patch 1 removes SIS_AVG_CPU because it's unused.
>>>
>>> Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
>>> of scanning. It should be relatively uncontroversial
>>>
>>> Patch 3-4 scans the runqueues in a single pass for select_idle_core()
>>> and select_idle_cpu() so runqueues are not scanned twice. It's
>>> a tradeoff because it benefits deep scans but introduces overhead
>>> for shallow scans.
>>>
>>> Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
>>
>> patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
>> like patch 4  which manipulates different cpumask including
>> load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
>> patchset which should fix the problem of possibly  scanning twice busy
>> cpus in select_idle_core and select_idle_cpu
>>
> 
> Seems fair, we can see where we stand after V6 of Aubrey's work.  A lot
> of the motivation for patch 4 would go away if we managed to avoid calling
> select_idle_core() unnecessarily. As it stands, we can call it a lot from
> hackbench even though the chance of getting an idle core are minimal.
> 

Sorry for the delay, I sent v6 out just now. Comparing to v5, v6 followed 
Vincent's
suggestion to decouple idle cpumask update from stop_tick signal, that is, the
CPU is set in idle cpumask every time the CPU enters idle, this should address
Peter's concern about the facebook trail-latency workload, as I didn't see
any regression in schbench workload 99.th latency report.

However, I also didn't see any significant benefit so far, probably I should
put more load on the system. I'll do more characterization of uperf workload
to see if I can find anything.

Thanks,
-Aubrey

[RFC PATCH v6] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-07 Thread Aubrey Li

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql
and kbuild were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no significant
data change found.

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 52 +-
 kernel/sched/idle.c|  5 
 kernel/sched/sched.h   |  2 ++
 kernel/sched/topology.c|  3 +-
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..b8af602dea79 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3999,6 +3999,7 @@ void scheduler_tick(void)
rq_lock(rq, );
 
update_rq_clock(rq);
+   update_idle_cpumask(rq, false);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
@@ -7197,6 +7198,7 @@ void __init sched_init(void)
rq_csd_init(rq, >nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
+   rq->last_idle_state = 1;
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..1b5c7ed08544 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6808,6 +6813,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq, bool set_idle)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+   int idle_state;
+
+   /*
+* If called from scheduler tick, only update
+* idle cpumask if the CPU is busy, as id

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Li, Aubrey

On 2020/12/4 21:47, Vincent Guittot wrote:
> On Fri, 4 Dec 2020 at 14:40, Li, Aubrey  wrote:
>>
>> On 2020/12/4 21:17, Vincent Guittot wrote:
>>> On Fri, 4 Dec 2020 at 14:13, Vincent Guittot  
>>> wrote:
>>>>
>>>> On Fri, 4 Dec 2020 at 12:30, Mel Gorman  
>>>> wrote:
>>>>>
>>>>> On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
>>>>>>> The intent was that the sibling might still be an idle candidate. In
>>>>>>> the current draft of the series, I do not even clear this so that the
>>>>>>> SMT sibling is considered as an idle candidate. The reasoning is that if
>>>>>>> there are no idle cores then an SMT sibling of the target is as good an
>>>>>>> idle CPU to select as any.
>>>>>>
>>>>>> Isn't the purpose of select_idle_smt ?
>>>>>>
>>>>>
>>>>> Only in part.
>>>>>
>>>>>> select_idle_core() looks for an idle core and opportunistically saves
>>>>>> an idle CPU candidate to skip select_idle_cpu. In this case this is
>>>>>> useless loops for select_idle_core() because we are sure that the core
>>>>>> is not idle
>>>>>>
>>>>>
>>>>> If select_idle_core() finds an idle candidate other than the sibling,
>>>>> it'll use it if there is no idle core -- it picks a busy sibling based
>>>>> on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
>>>>
>>>> My point is that it's a waste of time to loop the sibling cpus of
>>>> target in select_idle_core because it will not help to find an idle
>>>> core. The sibling  cpus will then be check either by select_idle_cpu
>>>> of select_idle_smt
>>>
>>> also, while looping the cpumask, the sibling cpus of not idle cpu are
>>> removed and will not be check
>>>
>>
>> IIUC, select_idle_core and select_idle_cpu share the same 
>> cpumask(select_idle_mask)?
>> If the target's sibling is removed from select_idle_mask from 
>> select_idle_core(),
>> select_idle_cpu() will lose the chance to pick it up?
> 
> This is only relevant for patch 10 which is not to be included IIUC
> what mel said in cover letter : "Patches 9 and 10 are stupid in the
> context of this series."

So the target's sibling can be removed from cpumask in select_idle_core
in patch 6, and need to be added back in select_idle_core in patch 10, :)

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Li, Aubrey

On 2020/12/4 21:40, Li, Aubrey wrote:
> On 2020/12/4 21:17, Vincent Guittot wrote:
>> On Fri, 4 Dec 2020 at 14:13, Vincent Guittot  
>> wrote:
>>>
>>> On Fri, 4 Dec 2020 at 12:30, Mel Gorman  wrote:
>>>>
>>>> On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
>>>>>> The intent was that the sibling might still be an idle candidate. In
>>>>>> the current draft of the series, I do not even clear this so that the
>>>>>> SMT sibling is considered as an idle candidate. The reasoning is that if
>>>>>> there are no idle cores then an SMT sibling of the target is as good an
>>>>>> idle CPU to select as any.
>>>>>
>>>>> Isn't the purpose of select_idle_smt ?
>>>>>
>>>>
>>>> Only in part.
>>>>
>>>>> select_idle_core() looks for an idle core and opportunistically saves
>>>>> an idle CPU candidate to skip select_idle_cpu. In this case this is
>>>>> useless loops for select_idle_core() because we are sure that the core
>>>>> is not idle
>>>>>
>>>>
>>>> If select_idle_core() finds an idle candidate other than the sibling,
>>>> it'll use it if there is no idle core -- it picks a busy sibling based
>>>> on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
>>>
>>> My point is that it's a waste of time to loop the sibling cpus of
>>> target in select_idle_core because it will not help to find an idle
>>> core. The sibling  cpus will then be check either by select_idle_cpu
>>> of select_idle_smt
>>
>> also, while looping the cpumask, the sibling cpus of not idle cpu are
>> removed and will not be check
>>
> 
> IIUC, select_idle_core and select_idle_cpu share the same 
> cpumask(select_idle_mask)?
> If the target's sibling is removed from select_idle_mask from 
> select_idle_core(),
> select_idle_cpu() will lose the chance to pick it up?

aha, no, select_idle_mask will be re-assigned in select_idle_cpu() by:

cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

So, yes, I guess we can remove the cpu_smt_mask(target) from select_idle_core() 
safely.

> 
> Thanks,
> -Aubrey
>

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Li, Aubrey

On 2020/12/4 21:17, Vincent Guittot wrote:
> On Fri, 4 Dec 2020 at 14:13, Vincent Guittot  
> wrote:
>>
>> On Fri, 4 Dec 2020 at 12:30, Mel Gorman  wrote:
>>>
>>> On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
>>>>> The intent was that the sibling might still be an idle candidate. In
>>>>> the current draft of the series, I do not even clear this so that the
>>>>> SMT sibling is considered as an idle candidate. The reasoning is that if
>>>>> there are no idle cores then an SMT sibling of the target is as good an
>>>>> idle CPU to select as any.
>>>>
>>>> Isn't the purpose of select_idle_smt ?
>>>>
>>>
>>> Only in part.
>>>
>>>> select_idle_core() looks for an idle core and opportunistically saves
>>>> an idle CPU candidate to skip select_idle_cpu. In this case this is
>>>> useless loops for select_idle_core() because we are sure that the core
>>>> is not idle
>>>>
>>>
>>> If select_idle_core() finds an idle candidate other than the sibling,
>>> it'll use it if there is no idle core -- it picks a busy sibling based
>>> on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
>>
>> My point is that it's a waste of time to loop the sibling cpus of
>> target in select_idle_core because it will not help to find an idle
>> core. The sibling  cpus will then be check either by select_idle_cpu
>> of select_idle_smt
> 
> also, while looping the cpumask, the sibling cpus of not idle cpu are
> removed and will not be check
>

IIUC, select_idle_core and select_idle_cpu share the same 
cpumask(select_idle_mask)?
If the target's sibling is removed from select_idle_mask from 
select_idle_core(),
select_idle_cpu() will lose the chance to pick it up?

Thanks,
-Aubrey

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-12-02 Thread Li, Aubrey

On 2020/12/2 22:09, Li, Aubrey wrote:
> Hi Balbir,
> 
> I still placed the patch embedded in this thread, welcome any comments.

Sorry, this version needs more work, refined as below, and I realized
I should place a version number to the patch, start from v2 now.

Thanks,
-Aubrey
==
>From aff2919889635aa9311d15bac3e949af0300ddc1 Mon Sep 17 00:00:00 2001
From: Aubrey Li 
Date: Thu, 3 Dec 2020 00:51:18 +
Subject: [PATCH v2] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Cc: Balbir Singh 
Cc: Vincent Guittot 
Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 33 +---
 kernel/sched/sched.h | 72 
 2 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..afdfea70c58c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no co

Re: [sched/fair] 8d86968ac3: netperf.Throughput_tps -29.5% regression

2020-12-02 Thread Li, Aubrey

Hi Mel,

On 2020/11/26 20:13, Mel Gorman wrote:
> On Thu, Nov 26, 2020 at 02:57:07PM +0800, Li, Aubrey wrote:
>> Hi Robot,
>>
>> On 2020/11/25 17:09, kernel test robot wrote:
>>> Greeting,
>>>
>>> FYI, we noticed a -29.5% regression of netperf.Throughput_tps due to commit:
>>>
>>>
>>> commit: 8d86968ac36ea5bff487f70b5ffc252a87d44c51 ("[RFC PATCH v4] 
>>> sched/fair: select idle cpu from idle cpumask for task wakeup")
>>> url: 
>>> https://github.com/0day-ci/linux/commits/Aubrey-Li/sched-fair-select-idle-cpu-from-idle-cpumask-for-task-wakeup/20201118-115145
>>> base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 
>>> 09162bc32c880a791c6c0668ce0745cf7958f576
>>
>> I tried to replicate this on my side on a 192 threads(with SMT) machine as 
>> well and didn't see the regression.
>>
>> nr_threads   v5.9.8  +patch
>> 96(50%)  1 (+/- 2.499%)  1.007672(+/- 3.0872%)
>>
>> I also tested another 100% case and see similar improvement as what I saw on 
>> uperf benchmark
>>
>> nr_threads   v5.9.8  +patch
>> 192(100%)1 (+/- 45.32%)  1.864917(+/- 23.29%)
>>
>> My base is v5.9.8 BTW.
>>
>>> ip: ipv4
>>> runtime: 300s
>>> nr_threads: 50%
>>> cluster: cs-localhost
>>> test: UDP_RR
>>> cpufreq_governor: performance
>>> ucode: 0x5003003
>>>
> 
> Note that I suspect that regressions with this will be tricky to reproduce
> because it'll depend on the timing of when the idle mask gets updated. With
> this configuration there are 50% "threads" which likely gets translates
> into 1 client/server per thread or 100% of CPUs active but as it's a
> ping-pong workload, the pairs are rapidly idling for very short periods.

I tried to replicate this regression but no solid fruit found. I tried 30 times
300s 50%.netperf running, all the data are better than the default data. The 
only
interesting thing I found is an option CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_32B=y,
but it performs different on different machines. In case anything I missed,
do you have any suggestions to replicate this regression?

> 
> If the idle mask is not getting cleared then select_idle_cpu() is
> probably returning immediately. select_idle_core() is almost certainly
> failing so that just leaves select_idle_smt() to find a potentially idle
> CPU. That's a limited search space so tasks may be getting stacked and
> missing CPUs that are idling for short periods.

Vincent suggested we decouple idle cpumask from short idle(stop tick) and
set it every time the CPU enters idle, I'll make this change in V6.

> 
> On the flip side, I expect cases like hackbench to benefit because it
> can saturate a machine to such a degree that select_idle_cpu() is a waste
> of time.

Yes, I believe that's also why I saw uperf/netperf improvement at high
load levels.

> 
> That said, I haven't followed the different versions closely. I know v5
> got a lot of feedback so will take a closer look at v6. Fundamentally
> though I expect that using the idle mask will be a mixed bag. At low
> utilisation or over-saturation, it'll be a benefit. At the point where
> the machine is almost fully busy, some workloads will benefit (lightly
> communicating workloads that occasionally migrate) and others will not
> (ping-pong workloads looking for CPUs that are idle for very brief
> periods).

Do you have any interested workload [matrix] I can do the measurement?

> 
> It's tricky enough that it might benefit from a sched_feat() check that
> is default true so it gets tested. For regressions that show up, it'll
> be easy enough to ask for the feature to be disabled to see if it fixes
> it. Over time, that might give an idea of exactly what sort of workloads
> benefit and what suffers.

Okay, I'll add a sched_feat() for this feature.

> 
> Note that the cost of select_idle_cpu() can also be reduced by enabling
> SIS_AVG_CPU so it would be interesting to know if the idle mask is superior
> or inferior to SIS_AVG_CPU for workloads that show regressions.
> 

Thanks,
-Aubrey

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-12-02 Thread Li, Aubrey

Hi Balbir,

I still placed the patch embedded in this thread, welcome any comments.

Thanks,
-Aubrey
==

>From d64455dcaf47329673903a68a9df1151400cdd7a Mon Sep 17 00:00:00 2001
From: Aubrey Li 
Date: Wed, 2 Dec 2020 13:53:30 +
Subject: [PATCH] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Cc: Balbir Singh 
Cc: Vincent Guittot 
Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 33 +---
 kernel/sched/sched.h | 71 
 2 files changed, 100 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..b8657766b660 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,

Re: [PATCH] sched/fair: Clear SMT siblings after determining the core is not idle

2020-11-30 Thread Li, Aubrey

On 2020/11/30 22:47, Vincent Guittot wrote:
> On Mon, 30 Nov 2020 at 15:40, Mel Gorman  wrote:
>>
>> The clearing of SMT siblings from the SIS mask before checking for an idle
>> core is a small but unnecessary cost. Defer the clearing of the siblings
>> until the scan moves to the next potential target. The cost of this was
>> not measured as it is borderline noise but it should be self-evident.
> 
> Good point

This is more reasonable, thanks Mel.

> 
>>
>> Signed-off-by: Mel Gorman 
> 
> Reviewed-by: Vincent Guittot 
> 
>> ---
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0d54d69ba1a5..d9acd55d309b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6087,10 +6087,11 @@ static int select_idle_core(struct task_struct *p, 
>> struct sched_domain *sd, int
>> break;
>> }
>> }
>> -   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>>
>> if (idle)
>> return core;
>> +
>> +   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>> }
>>
>> /*

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-30 Thread Li, Aubrey

On 2020/11/30 18:35, Vincent Guittot wrote:
> On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
>  wrote:
>>
>> From: Aubrey Li 
>>
>>  - Don't migrate if there is a cookie mismatch
>>  Load balance tries to move task from busiest CPU to the
>>  destination CPU. When core scheduling is enabled, if the
>>  task's cookie does not match with the destination CPU's
>>  core cookie, this task will be skipped by this CPU. This
>>  mitigates the forced idle time on the destination CPU.
>>
>>  - Select cookie matched idle CPU
>>  In the fast path of task wakeup, select the first cookie matched
>>  idle CPU instead of the first idle CPU.
>>
>>  - Find cookie matched idlest CPU
>>  In the slow path of task wakeup, find the idlest CPU whose core
>>  cookie matches with task's cookie
>>
>>  - Don't migrate task if cookie not match
>>  For the NUMA load balance, don't migrate task to the CPU whose
>>  core cookie does not match with task's cookie
>>
>> Tested-by: Julien Desfossez 
>> Signed-off-by: Aubrey Li 
>> Signed-off-by: Tim Chen 
>> Signed-off-by: Vineeth Remanan Pillai 
>> Signed-off-by: Joel Fernandes (Google) 
>> ---
>>  kernel/sched/fair.c  | 64 
>>  kernel/sched/sched.h | 29 
>>  2 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index de82f88ba98c..ceb3906c9a8a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>> continue;
>>
>> +#ifdef CONFIG_SCHED_CORE
>> +   /*
>> +* Skip this cpu if source task's cookie does not match
>> +* with CPU's core cookie.
>> +*/
>> +   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +   continue;
>> +#endif
>> +
>> env->dst_cpu = cpu;
>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>> break;
>> @@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, 
>> struct task_struct *p, int this
>>
>> /* Traverse only the allowed CPUs */
>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>> +   struct rq *rq = cpu_rq(i);
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> +   if (!sched_core_cookie_match(rq, p))
>> +   continue;
>> +#endif
>> +
>> if (sched_idle_cpu(i))
>> return i;
>>
>> if (available_idle_cpu(i)) {
>> -   struct rq *rq = cpu_rq(i);
>> struct cpuidle_state *idle = idle_get_state(rq);
>> if (idle && idle->exit_latency < min_exit_latency) {
>> /*
>> @@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)
>> return -1;
>> -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> -   break;
>> +
>> +   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>> +#ifdef CONFIG_SCHED_CORE
>> +   /*
>> +* If Core Scheduling is enabled, select this cpu
>> +* only if the process cookie matches core cookie.
>> +*/
>> +   if (sched_core_enabled(cpu_rq(cpu)) &&
>> +   p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>> +#endif
>> +   break;
>> +   }
> 
> This makes code unreadable.
> Put this coresched specific stuff in an inline function; You can have
> a look at what is done with asym_fits_capacity()
> 
This is done in a refined version. Sorry I pasted the version embedded in this 
thread,
this is not the latest version.

>> }
>>
>> time = cpu_clock(this) - time;
>> @@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct 
>> lb_env *env)
>>  * We do not migrate tasks that are:
&g

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-30 Thread Li, Aubrey

On 2020/11/30 17:33, Balbir Singh wrote:
> On Thu, Nov 26, 2020 at 05:26:31PM +0800, Li, Aubrey wrote:
>> On 2020/11/26 16:32, Balbir Singh wrote:
>>> On Thu, Nov 26, 2020 at 11:20:41AM +0800, Li, Aubrey wrote:
>>>> On 2020/11/26 6:57, Balbir Singh wrote:
>>>>> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
>>>>>> On 2020/11/24 23:42, Peter Zijlstra wrote:
>>>>>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>> +/*
>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>>> + */
>>>>>>>>>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>> +continue;
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>>>>>>>> the check for sched_core_enabled() do the right thing even when
>>>>>>>>> CONFIG_SCHED_CORE is not enabed?> 
>>>>>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>>>>>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>>>>>>>> sense to leave a core scheduler specific function here even at compile
>>>>>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>>>>>>>> a judgment.
>>>>>>>
>>>>>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
>>>>>>> more.
>>>>>>>
>>>>>>
>>>>>> Okay, I pasted the refined patch here.
>>>>>> @Joel, please let me know if you want me to send it in a separated 
>>>>>> thread.
>>>>>>
>>>>>
>>>>> You still have a bunch of #ifdefs, can't we just do
>>>>>
>>>>> #ifndef CONFIG_SCHED_CORE
>>>>> static inline bool sched_core_enabled(struct rq *rq)
>>>>> {
>>>>> return false;
>>>>> }
>>>>> #endif
>>>>>
>>>>> and frankly I think even that is not needed because there is a jump
>>>>> label __sched_core_enabled that tells us if sched_core is enabled or
>>>>> not.
>>>>
>>>> Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
>>>> How about this one?
>>>>
>>>
>>> Much better :)
>>>  
>>>> Thanks,
>>>> -Aubrey
>>>>
>>>> From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
>>>> From: Aubrey Li 
>>>> Date: Thu, 26 Nov 2020 03:08:04 +
>>>> Subject: [PATCH] sched: migration changes for core scheduling
>>>>
>>>>  - Don't migrate if there is a cookie mismatch
>>>>  Load balance tries to move task from busiest CPU to the
>>>>  destination CPU. When core scheduling is enabled, if the
>>>>  task's cookie does not match with the destination CPU's
>>>>  core cookie, this task will be skipped by this CPU. This
>>>>  mitigates the forced idle time on the destination CPU.
>>>>
>>>>  - Select cookie matched idle CPU
>>>>  In the fast path of task wakeup, select the first cookie matched
>>>>  idle CPU instead of the first idle CPU.
>>>>
>>>>  - Find cookie matched idlest CPU
>>>>  In the slow path of task wakeup, find the idlest CPU whose core
>>>>  cookie matches with task's cookie
>>>>
>>>>  - Don't migrate task if cookie not match
>>>>  For the NUMA load balance, don't migrate task to the CPU whose
>>>>  core cookie does not match with task's cookie
>>>>
>>>> Tested-by: Julien Desfossez 
>>>> Signed-off-by: Aubrey Li 
>>>> Signed-off-by: Tim Chen 
>>>> Signed-off-by: Vineeth Remanan Pillai 
>>>> Signed-off-by: Joel Fernandes (Google) 
>>>> ---
>>>>  kernel/sched/fair.c  | 57 
&

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-26 Thread Li, Aubrey

On 2020/11/26 16:14, Vincent Guittot wrote:
> On Wed, 25 Nov 2020 at 14:37, Li, Aubrey  wrote:
>>
>> On 2020/11/25 16:31, Vincent Guittot wrote:
>>> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey  wrote:
>>>>
>>>> On 2020/11/25 1:01, Vincent Guittot wrote:
>>>>> Hi Aubrey,
>>>>>
>>>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>>>>>> Hi Vincent,
>>>>>>
>>>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>>>>>> Hi Aubrey,
>>>>>>>
>>>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>>>>>> tick to ratelimit idle cpumask update.
>>>>>>>>
>>>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>>>>> especially when the system is heavily loaded.
>>>>>>>>
>>>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>>>>>> schbench have no notable change, uperf has:
>>>>>>>>
>>>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>>>>>
>>>>>>>>   threads   baseline-avg%stdpatch-avg   %std
>>>>>>>>   961   0.831.233.27
>>>>>>>>   144   1   1.031.672.67
>>>>>>>>   192   1   0.691.813.59
>>>>>>>>   240   1   2.841.512.67
>>>>>>>>
>>>>>>>> v4->v5:
>>>>>>>> - add update_idle_cpumask for s2idle case
>>>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>>>>>   idle_cpumask() everywhere
>>>>>>>>
>>>>>>>> v3->v4:
>>>>>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>>>>>   if cpu driver is available.
>>>>>>>
>>>>>>> Could you remind me why you did this change ? Clearing the cpumask is
>>>>>>> done during the tick to rate limit the number of updates of the
>>>>>>> cpumask but It's not clear for me why you have associated the set with
>>>>>>> the tick stop condition too.
>>>>>>
>>>>>> I found the current implementation has better performance at a more
>>>>>> suitable load range.
>>>>>>
>>>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>>>>>> tick) to shrink idle cpumask when the system is busy, but
>>>>>
>>>>> I'm ok with the part above
>>>>>
>>>>>>
>>>>>> - Setting the idle mask everytime the cpu enters idle requires a much
>>>>>> heavier load level to preserve the idle cpumask(not call into idle),
>>>>>> otherwise the bits cleared in scheduler tick will be restored when the
>>>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>>>>>> cpumask during task wakeup if the system load is not heavy enough.
>>>>>
>>>>> But setting the idle cpumask is useful because it helps to select an idle
>>>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
>>>>> the idle cpu mask is useful in heavy cases because a system, which is
>>>>> already fully busy with work, doesn't want to waste time looking for an
>>>>> idle cpu that doesn't exist.
>>>>
>>>> Yes, this is what v3 does.
>>>>
>>>>> But if there is an idle cpu, we should still looks for it.
>>>>

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-26 Thread Li, Aubrey

On 2020/11/26 16:32, Balbir Singh wrote:
> On Thu, Nov 26, 2020 at 11:20:41AM +0800, Li, Aubrey wrote:
>> On 2020/11/26 6:57, Balbir Singh wrote:
>>> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
>>>> On 2020/11/24 23:42, Peter Zijlstra wrote:
>>>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>> +  /*
>>>>>>>> +   * Skip this cpu if source task's cookie does not match
>>>>>>>> +   * with CPU's core cookie.
>>>>>>>> +   */
>>>>>>>> +  if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>> +  continue;
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>
>>>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>>>>>> the check for sched_core_enabled() do the right thing even when
>>>>>>> CONFIG_SCHED_CORE is not enabed?> 
>>>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>>>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>>>>>> sense to leave a core scheduler specific function here even at compile
>>>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>>>>>> a judgment.
>>>>>
>>>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
>>>>> more.
>>>>>
>>>>
>>>> Okay, I pasted the refined patch here.
>>>> @Joel, please let me know if you want me to send it in a separated thread.
>>>>
>>>
>>> You still have a bunch of #ifdefs, can't we just do
>>>
>>> #ifndef CONFIG_SCHED_CORE
>>> static inline bool sched_core_enabled(struct rq *rq)
>>> {
>>> return false;
>>> }
>>> #endif
>>>
>>> and frankly I think even that is not needed because there is a jump
>>> label __sched_core_enabled that tells us if sched_core is enabled or
>>> not.
>>
>> Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
>> How about this one?
>>
> 
> Much better :)
>  
>> Thanks,
>> -Aubrey
>>
>> From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
>> From: Aubrey Li 
>> Date: Thu, 26 Nov 2020 03:08:04 +
>> Subject: [PATCH] sched: migration changes for core scheduling
>>
>>  - Don't migrate if there is a cookie mismatch
>>  Load balance tries to move task from busiest CPU to the
>>  destination CPU. When core scheduling is enabled, if the
>>  task's cookie does not match with the destination CPU's
>>  core cookie, this task will be skipped by this CPU. This
>>  mitigates the forced idle time on the destination CPU.
>>
>>  - Select cookie matched idle CPU
>>  In the fast path of task wakeup, select the first cookie matched
>>  idle CPU instead of the first idle CPU.
>>
>>  - Find cookie matched idlest CPU
>>  In the slow path of task wakeup, find the idlest CPU whose core
>>  cookie matches with task's cookie
>>
>>  - Don't migrate task if cookie not match
>>  For the NUMA load balance, don't migrate task to the CPU whose
>>  core cookie does not match with task's cookie
>>
>> Tested-by: Julien Desfossez 
>> Signed-off-by: Aubrey Li 
>> Signed-off-by: Tim Chen 
>> Signed-off-by: Vineeth Remanan Pillai 
>> Signed-off-by: Joel Fernandes (Google) 
>> ---
>>  kernel/sched/fair.c  | 57 
>>  kernel/sched/sched.h | 43 +
>>  2 files changed, 95 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index de82f88ba98c..70dd013dff1d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>>  if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  continue;
>>  
>> +/*
>> + * Skip this cpu if source task's cookie does not match
>> + * with CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-25 Thread Li, Aubrey

On 2020/11/26 6:57, Balbir Singh wrote:
> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
>> On 2020/11/24 23:42, Peter Zijlstra wrote:
>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>> +/*
>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>> + * with CPU's core cookie.
>>>>>> + */
>>>>>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>> +continue;
>>>>>> +#endif
>>>>>> +
>>>>>
>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>>>> the check for sched_core_enabled() do the right thing even when
>>>>> CONFIG_SCHED_CORE is not enabed?> 
>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>>>> sense to leave a core scheduler specific function here even at compile
>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>>>> a judgment.
>>>
>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
>>> more.
>>>
>>
>> Okay, I pasted the refined patch here.
>> @Joel, please let me know if you want me to send it in a separated thread.
>>
> 
> You still have a bunch of #ifdefs, can't we just do
> 
> #ifndef CONFIG_SCHED_CORE
> static inline bool sched_core_enabled(struct rq *rq)
> {
> return false;
> }
> #endif
> 
> and frankly I think even that is not needed because there is a jump
> label __sched_core_enabled that tells us if sched_core is enabled or
> not.

Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
How about this one?

Thanks,
-Aubrey

>From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
From: Aubrey Li 
Date: Thu, 26 Nov 2020 03:08:04 +
Subject: [PATCH] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 57 
 kernel/sched/sched.h | 43 +
 2 files changed, 95 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..70dd013dff1d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,8 +6140,19 @@ static i

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-25 Thread Li, Aubrey

On 2020/11/25 16:31, Vincent Guittot wrote:
> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey  wrote:
>>
>> On 2020/11/25 1:01, Vincent Guittot wrote:
>>> Hi Aubrey,
>>>
>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>>>> Hi Vincent,
>>>>
>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>>>> Hi Aubrey,
>>>>>
>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li  wrote:
>>>>>>
>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>>>> tick to ratelimit idle cpumask update.
>>>>>>
>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>>> especially when the system is heavily loaded.
>>>>>>
>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>>>> schbench have no notable change, uperf has:
>>>>>>
>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>>>
>>>>>>   threads   baseline-avg%stdpatch-avg   %std
>>>>>>   961   0.831.233.27
>>>>>>   144   1   1.031.672.67
>>>>>>   192   1   0.691.813.59
>>>>>>   240   1   2.841.512.67
>>>>>>
>>>>>> v4->v5:
>>>>>> - add update_idle_cpumask for s2idle case
>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>>>   idle_cpumask() everywhere
>>>>>>
>>>>>> v3->v4:
>>>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>>>   if cpu driver is available.
>>>>>
>>>>> Could you remind me why you did this change ? Clearing the cpumask is
>>>>> done during the tick to rate limit the number of updates of the
>>>>> cpumask but It's not clear for me why you have associated the set with
>>>>> the tick stop condition too.
>>>>
>>>> I found the current implementation has better performance at a more
>>>> suitable load range.
>>>>
>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>>>> tick) to shrink idle cpumask when the system is busy, but
>>>
>>> I'm ok with the part above
>>>
>>>>
>>>> - Setting the idle mask everytime the cpu enters idle requires a much
>>>> heavier load level to preserve the idle cpumask(not call into idle),
>>>> otherwise the bits cleared in scheduler tick will be restored when the
>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>>>> cpumask during task wakeup if the system load is not heavy enough.
>>>
>>> But setting the idle cpumask is useful because it helps to select an idle
>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
>>> the idle cpu mask is useful in heavy cases because a system, which is
>>> already fully busy with work, doesn't want to waste time looking for an
>>> idle cpu that doesn't exist.
>>
>> Yes, this is what v3 does.
>>
>>> But if there is an idle cpu, we should still looks for it.
>>
>> IMHO, this is a potential opportunity can be improved. The idle cpu could be
>> in different idle state, the idle duration could be long or could be very 
>> short.
>> For example, if there are two idle cpus:
>>
>> - CPU1 is very busy, the pattern is 50us idle and 950us work.
>> - CPU2 is in idle for a tick length and wake up to do the regular work
>>
>> If both added to the idle cpumask, we want the latter one, or we can just add
>> the later one into the idle cpumask. That's why I want to associate tick stop
>> signal with it.
>>
>>>
>>>>
>>>>
>>>> - Associating with tick stop tolerates idle

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-24 Thread Li, Aubrey

On 2020/11/24 23:42, Peter Zijlstra wrote:
> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> +  /*
>>>> +   * Skip this cpu if source task's cookie does not match
>>>> +   * with CPU's core cookie.
>>>> +   */
>>>> +  if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>> +  continue;
>>>> +#endif
>>>> +
>>>
>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>> the check for sched_core_enabled() do the right thing even when
>>> CONFIG_SCHED_CORE is not enabed?> 
>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>> sense to leave a core scheduler specific function here even at compile
>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>> a judgment.
> 
> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
> more.
> 

Okay, I pasted the refined patch here.
@Joel, please let me know if you want me to send it in a separated thread.

Thanks,
-Aubrey
==
>From 18e4f4592c2a159fcbae637f3a422e37ad24cb5a Mon Sep 17 00:00:00 2001
From: Aubrey Li 
Date: Wed, 25 Nov 2020 02:43:46 +
Subject: [PATCH 14/33] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 58 
 kernel/sched/sched.h | 33 +
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..7eea5da6685a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,8 +6140,18 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   break;
+
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+   /*
+* If Core Scheduling is enabled, select this cpu
+* only if the process cookie matches core cookie.
+*/
+   if (sched_core_enabled(cpu_rq(cpu)) &&
+   p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-24 Thread Li, Aubrey

On 2020/11/25 1:01, Vincent Guittot wrote:
> Hi Aubrey,
> 
> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>> Hi Vincent,
>>
>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>> Hi Aubrey,
>>>
>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li  wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>> tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has low cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>> schbench have no notable change, uperf has:
>>>>
>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>
>>>>   threads   baseline-avg%stdpatch-avg   %std
>>>>   961   0.831.233.27
>>>>   144   1   1.031.672.67
>>>>   192   1   0.691.813.59
>>>>   240   1   2.841.512.67
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available.
>>>
>>> Could you remind me why you did this change ? Clearing the cpumask is
>>> done during the tick to rate limit the number of updates of the
>>> cpumask but It's not clear for me why you have associated the set with
>>> the tick stop condition too.
>>
>> I found the current implementation has better performance at a more 
>> suitable load range.
>>
>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>> tick) to shrink idle cpumask when the system is busy, but
> 
> I'm ok with the part above
> 
>>
>> - Setting the idle mask everytime the cpu enters idle requires a much
>> heavier load level to preserve the idle cpumask(not call into idle),
>> otherwise the bits cleared in scheduler tick will be restored when the
>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>> cpumask during task wakeup if the system load is not heavy enough.
> 
> But setting the idle cpumask is useful because it helps to select an idle
> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
> the idle cpu mask is useful in heavy cases because a system, which is
> already fully busy with work, doesn't want to waste time looking for an
> idle cpu that doesn't exist. 

Yes, this is what v3 does.

> But if there is an idle cpu, we should still looks for it.

IMHO, this is a potential opportunity can be improved. The idle cpu could be
in different idle state, the idle duration could be long or could be very short.
For example, if there are two idle cpus:

- CPU1 is very busy, the pattern is 50us idle and 950us work. 
- CPU2 is in idle for a tick length and wake up to do the regular work

If both added to the idle cpumask, we want the latter one, or we can just add
the later one into the idle cpumask. That's why I want to associate tick stop
signal with it.

> 
>>
>>
>> - Associating with tick stop tolerates idle to preserve the idle cpumask
>> but only short idle, which causes tick retains. This is more fitable for
>> the real workload.
> 
> I don't agree with this and real use cases with interaction will probably
> not agree as well as they want to run on an idle cpu if any but not wait
> on an already busy one.

The problem is scan overhead, scanning idle cpu need time. If an idle cpu
is in the short idle mode, it's very likely that when it's picked up for a
wakeup task, it goes back to work again, and the wakeup task has to wait too,
maybe longer because the running task just starts. 

One benefit of waiting on the previous one is warm cache.

> Also keep in mind that a tick can be up to 10ms long

Right, but the point here is, if this 10ms tick retains, the CPU should be
in the short idle mode.

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-23 Thread Li, Aubrey

Hi Vincent,

On 2020/11/23 17:27, Vincent Guittot wrote:
> Hi Aubrey,
> 
> On Thu, 19 Nov 2020 at 13:15, Aubrey Li  wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, if the idle driver indicates to stop tick, this CPU
>> is set in the idle cpumask to be a wakeup target. And if the CPU
>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>> tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>> schbench have no notable change, uperf has:
>>
>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>
>>   threads   baseline-avg%stdpatch-avg   %std
>>   961   0.831.233.27
>>   144   1   1.031.672.67
>>   192   1   0.691.813.59
>>   240   1   2.841.512.67
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available.
> 
> Could you remind me why you did this change ? Clearing the cpumask is
> done during the tick to rate limit the number of updates of the
> cpumask but It's not clear for me why you have associated the set with
> the tick stop condition too.

I found the current implementation has better performance at a more 
suitable load range.

The two kinds of implementions(v4 and v5) have the same rate(scheduler
tick) to shrink idle cpumask when the system is busy, but

- Setting the idle mask everytime the cpu enters idle requires a much
heavier load level to preserve the idle cpumask(not call into idle),
otherwise the bits cleared in scheduler tick will be restored when the
cpu enters idle. That is, idle cpumask is almost equal to the domain
cpumask during task wakeup if the system load is not heavy enough.

- Associating with tick stop tolerates idle to preserve the idle cpumask
but only short idle, which causes tick retains. This is more fitable for
the real workload.

> 
> This change means that a cpu will not be part of the idle mask if the
> tick is not stopped. On some arm/arm64 platforms, the tick stops only
> if the idle duration is expected to be higher than 1-2ms which starts
> to be significantly long. Also, the cpuidle governor can easily
> mis-predict a short idle duration whereas it will be finally a long
> idle duration; In this case, the next tick will correct the situation
> and select a deeper state, but this can happen up to 4ms later on
> arm/arm64.

Yes this is intented. If the tick is not stopped, that indicates the
CPU is very busy, cpu idle governor selected the polling idle state, and/or 
the expected idle duration is shorter than the tick period length. For
example, uperf enters and exits 80 times between two ticks when utilizes
100% CPU, and the average idle residency < 50us.

If this CPU is added to idle cpumask, the wakeup task likely needs to 
wait in the runqueue as this CPU will run its current task very soon.

> 
> So I would prefer to keep trying to set the idle mask everytime the
> cpu enters idle. If a tick has not happened between 2 idle phases, the
> cpumask will not be updated and the overhead will be mostly testing if
> (rq->last_idle_state == idle_state).

Not sure if I addressed your concern, did you see any workloads any cases
v4 performs better than v5?

Thanks,
-Aubrey

> 
> 
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior.
>>

Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer

2020-11-23 Thread Li, Aubrey

On 2020/11/24 7:35, Balbir Singh wrote:
> On Mon, Nov 23, 2020 at 11:07:27PM +0800, Li, Aubrey wrote:
>> On 2020/11/23 12:38, Balbir Singh wrote:
>>> On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
>>>> From: Peter Zijlstra 
>>>>
>>>> When a sibling is forced-idle to match the core-cookie; search for
>>>> matching tasks to fill the core.
>>>>
>>>> rcu_read_unlock() can incur an infrequent deadlock in
>>>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>>>>
>>> ...
>>>> +
>>>> +  if (p->core_occupation > dst->idle->core_occupation)
>>>> +  goto next;
>>>> +
>>>
>>> I am unable to understand this check, a comment or clarification in the
>>> changelog will help. I presume we are looking at either one or two cpus
>>> to define the core_occupation and we expect to match it against the
>>> destination CPU.
>>
>> IIUC, this check prevents a task from keeping jumping among the cores 
>> forever.
>>
>> For example, on a SMT2 platform:
>> - core0 runs taskA and taskB, core_occupation is 2
>> - core1 runs taskC, core_occupation is 1
>>
>> Without this check, taskB could ping-pong between core0 and core1 by core 
>> load
>> balance.
> 
> But the comparison is p->core_occuption (as in tasks core occuptation,
> not sure what that means, can a task have a core_occupation of > 1?)
>

p->core_occupation is assigned to the core occupation in the last 
pick_next_task.
(so yes, it can have a > 1 core_occupation).

Thanks,
-Aubrey

Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer

2020-11-23 Thread Li, Aubrey

On 2020/11/23 12:38, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
>> From: Peter Zijlstra 
>>
>> When a sibling is forced-idle to match the core-cookie; search for
>> matching tasks to fill the core.
>>
>> rcu_read_unlock() can incur an infrequent deadlock in
>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>>
> ...
>> +
>> +if (p->core_occupation > dst->idle->core_occupation)
>> +goto next;
>> +
> 
> I am unable to understand this check, a comment or clarification in the
> changelog will help. I presume we are looking at either one or two cpus
> to define the core_occupation and we expect to match it against the
> destination CPU.

IIUC, this check prevents a task from keeping jumping among the cores forever.

For example, on a SMT2 platform:
- core0 runs taskA and taskB, core_occupation is 2
- core1 runs taskC, core_occupation is 1

Without this check, taskB could ping-pong between core0 and core1 by core load
balance.

Thanks,
-Aubrey

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-22 Thread Li, Aubrey

On 2020/11/23 7:54, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:44PM -0500, Joel Fernandes (Google) wrote:
>> From: Aubrey Li 
>>
>>  - Don't migrate if there is a cookie mismatch
>>  Load balance tries to move task from busiest CPU to the
>>  destination CPU. When core scheduling is enabled, if the
>>  task's cookie does not match with the destination CPU's
>>  core cookie, this task will be skipped by this CPU. This
>>  mitigates the forced idle time on the destination CPU.
>>
>>  - Select cookie matched idle CPU
>>  In the fast path of task wakeup, select the first cookie matched
>>  idle CPU instead of the first idle CPU.
>>
>>  - Find cookie matched idlest CPU
>>  In the slow path of task wakeup, find the idlest CPU whose core
>>  cookie matches with task's cookie
>>
>>  - Don't migrate task if cookie not match
>>  For the NUMA load balance, don't migrate task to the CPU whose
>>  core cookie does not match with task's cookie
>>
>> Tested-by: Julien Desfossez 
>> Signed-off-by: Aubrey Li 
>> Signed-off-by: Tim Chen 
>> Signed-off-by: Vineeth Remanan Pillai 
>> Signed-off-by: Joel Fernandes (Google) 
>> ---
>>  kernel/sched/fair.c  | 64 
>>  kernel/sched/sched.h | 29 
>>  2 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index de82f88ba98c..ceb3906c9a8a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>>  if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  continue;
>>  
>> +#ifdef CONFIG_SCHED_CORE
>> +/*
>> + * Skip this cpu if source task's cookie does not match
>> + * with CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +continue;
>> +#endif
>> +
> 
> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
> the check for sched_core_enabled() do the right thing even when
> CONFIG_SCHED_CORE is not enabed?> 
Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
sense to leave a core scheduler specific function here even at compile
time. Also, for the cases in hot path, this saves CPU cycles to avoid
a judgment.


>>  env->dst_cpu = cpu;
>>  if (task_numa_compare(env, taskimp, groupimp, maymove))
>>  break;
>> @@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, 
>> struct task_struct *p, int this
>>  
>>  /* Traverse only the allowed CPUs */
>>  for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>> +struct rq *rq = cpu_rq(i);
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> +if (!sched_core_cookie_match(rq, p))
>> +continue;
>> +#endif
>> +
>>  if (sched_idle_cpu(i))
>>  return i;
>>  
>>  if (available_idle_cpu(i)) {
>> -struct rq *rq = cpu_rq(i);
>>  struct cpuidle_state *idle = idle_get_state(rq);
>>  if (idle && idle->exit_latency < min_exit_latency) {
>>  /*
>> @@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>  for_each_cpu_wrap(cpu, cpus, target) {
>>  if (!--nr)
>>  return -1;
>> -if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> -break;
>> +
>> +if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>> +#ifdef CONFIG_SCHED_CORE
>> +/*
>> + * If Core Scheduling is enabled, select this cpu
>> + * only if the process cookie matches core cookie.
>> + */
>> +if (sched_core_enabled(cpu_rq(cpu)) &&
>> +p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>> +#endif
>> +break;
>> +}
>>  }
>>  
>>  time = cpu_clock(this) - time;
>> @@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct 
>&

[RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-19 Thread Aubrey Li

Add idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, if the idle driver indicates to stop tick, this CPU
is set in the idle cpumask to be a wakeup target. And if the CPU
is not in idle, the CPU is cleared in idle cpumask during scheduler
tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
schbench have no notable change, uperf has:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   0.831.233.27
  144   1   1.031.672.67
  192   1   0.691.813.59
  240   1   2.841.512.67

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 52 +-
 kernel/sched/idle.c|  8 --
 kernel/sched/sched.h   |  2 ++
 kernel/sched/topology.c|  3 +-
 6 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b1e0da56abca..c86ae0495163 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3994,6 +3994,7 @@ void scheduler_tick(void)
rq_lock(rq, );
 
update_rq_clock(rq);
+   update_idle_cpumask(rq, false);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
@@ -7192,6 +7193,7 @@ void __init sched_init(void)
rq_csd_init(rq, >nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
+   rq->last_idle_state = 1;
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48a6d442b444..d67fba5e406b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6807,6 +6812,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update

Re: [RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-19 Thread Li, Aubrey

On 2020/11/19 16:19, Vincent Guittot wrote:
> On Thu, 19 Nov 2020 at 02:34, Li, Aubrey  wrote:
>>
>> Hi Vincent,
>>
>> On 2020/11/18 21:36, Vincent Guittot wrote:
>>> On Wed, 18 Nov 2020 at 04:48, Aubrey Li  wrote:
>>>>
>>>> From: Aubrey Li 
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>> tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has low cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>> schbench have no notable change, uperf has:
>>>>
>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>
>>>>   threads   baseline-avg%stdpatch-avg   %std
>>>>   961   0.831.233.27
>>>>   144   1   1.031.672.67
>>>>   192   1   0.691.813.59
>>>>   240   1   2.841.512.67
>>>>
>>>> Cc: Mel Gorman 
>>>> Cc: Vincent Guittot 
>>>> Cc: Qais Yousef 
>>>> Cc: Valentin Schneider 
>>>> Cc: Jiang Biao 
>>>> Cc: Tim Chen 
>>>> Signed-off-by: Aubrey Li 
>>>> ---
>>>>  include/linux/sched/topology.h | 13 +
>>>>  kernel/sched/core.c|  2 ++
>>>>  kernel/sched/fair.c| 52 +-
>>>>  kernel/sched/idle.c|  7 +++--
>>>>  kernel/sched/sched.h   |  2 ++
>>>>  kernel/sched/topology.c|  3 +-
>>>>  6 files changed, 74 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h 
>>>> b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>> atomic_tref;
>>>> atomic_tnr_busy_cpus;
>>>> int has_idle_cores;
>>>> +   /*
>>>> +* Span of all idle CPUs in this domain.
>>>> +*
>>>> +* NOTE: this field is variable length. (Allocated dynamically
>>>> +* by attaching extra space to the end of the structure,
>>>> +* depending on how many CPUs the kernel has booted up with)
>>>> +*/
>>>> +   unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
>>>> *sds)
>>>> +{
>>>> +   return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>> /* These fields must be setup */
>>>> struct sched_domain __rcu *parent;  /* top domain must be null 
>>>> terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index b1e0da56abca..c86ae0495163 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -3994,6 +3994,7 @@ void scheduler_tick(void)
>>>> rq_lock(rq, );
>>>>
>>>> update_rq_clock(rq);
>>>> +   update_idle_cpumask(rq, false);
>>>> thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>>>> update_thermal_load_avg(rq_clock_thermal(rq), rq, 
>>>> thermal_pressure);
>>>> curr->sched_class->task_tick(rq, curr, 0);
>>>> @@ -7192,6 +7193,7 @@ void __init sched_init(void)
>>>> rq_csd_init(rq, >nohz_csd, nohz_csd_func);
>>>>  #endif
>>>>  #endif /* CONFIG_SMP */
>>>> +   rq->last_idle_state = 1;
>>>> hrtick_rq_init(rq);
>>>>

Re: [RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-18 Thread Li, Aubrey

Hi Vincent,

On 2020/11/18 21:36, Vincent Guittot wrote:
> On Wed, 18 Nov 2020 at 04:48, Aubrey Li  wrote:
>>
>> From: Aubrey Li 
>>
>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, if the idle driver indicates to stop tick, this CPU
>> is set in the idle cpumask to be a wakeup target. And if the CPU
>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>> tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>> schbench have no notable change, uperf has:
>>
>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>
>>   threads   baseline-avg%stdpatch-avg   %std
>>   961   0.831.233.27
>>   144   1   1.031.672.67
>>   192   1   0.691.813.59
>>   240   1   2.841.512.67
>>
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
>> ---
>>  include/linux/sched/topology.h | 13 +
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 52 +-
>>  kernel/sched/idle.c|  7 +++--
>>  kernel/sched/sched.h   |  2 ++
>>  kernel/sched/topology.c|  3 +-
>>  6 files changed, 74 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_tref;
>> atomic_tnr_busy_cpus;
>> int has_idle_cores;
>> +   /*
>> +* Span of all idle CPUs in this domain.
>> +*
>> +* NOTE: this field is variable length. (Allocated dynamically
>> +* by attaching extra space to the end of the structure,
>> +* depending on how many CPUs the kernel has booted up with)
>> +*/
>> +   unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +   return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>> /* These fields must be setup */
>> struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index b1e0da56abca..c86ae0495163 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3994,6 +3994,7 @@ void scheduler_tick(void)
>> rq_lock(rq, );
>>
>> update_rq_clock(rq);
>> +   update_idle_cpumask(rq, false);
>> thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>> update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
>> curr->sched_class->task_tick(rq, curr, 0);
>> @@ -7192,6 +7193,7 @@ void __init sched_init(void)
>> rq_csd_init(rq, >nohz_csd, nohz_csd_func);
>>  #endif
>>  #endif /* CONFIG_SMP */
>> +   rq->last_idle_state = 1;
>> hrtick_rq_init(rq);
>> atomic_set(>nr_iowait, 0);
>> }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48a6d442b444..d67fba5e406b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>
>> time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   /*
>> +* sched_domain_shared is set only at shared cache level,
>> +* this works only because select_idle_cpu is called with
>> +* sd_llc.
>> +*/
>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--n

Re: [RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-18 Thread Li, Aubrey

On 2020/11/18 20:06, Valentin Schneider wrote:
> 
> On 16/11/20 20:04, Aubrey Li wrote:
>> From: Aubrey Li 
>>
>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, if the idle driver indicates to stop tick, this CPU
>> is set in the idle cpumask to be a wakeup target. And if the CPU
>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>> tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>> schbench have no notable change, uperf has:
>>
>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>
>>   threads   baseline-avg%stdpatch-avg   %std
>>   961   0.831.233.27
>>   144   1   1.031.672.67
>>   192   1   0.691.813.59
>>   240   1   2.841.512.67
>>
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
> 
> That's missing a v3 -> v4 change summary
> 

okay, I'll add in the next version soon.

Thanks,
-Aubrey

[RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-17 Thread Aubrey Li

From: Aubrey Li 

Add idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, if the idle driver indicates to stop tick, this CPU
is set in the idle cpumask to be a wakeup target. And if the CPU
is not in idle, the CPU is cleared in idle cpumask during scheduler
tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
schbench have no notable change, uperf has:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   0.831.233.27
  144   1   1.031.672.67
  192   1   0.691.813.59
  240   1   2.841.512.67

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 52 +-
 kernel/sched/idle.c|  7 +++--
 kernel/sched/sched.h   |  2 ++
 kernel/sched/topology.c|  3 +-
 6 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b1e0da56abca..c86ae0495163 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3994,6 +3994,7 @@ void scheduler_tick(void)
rq_lock(rq, );
 
update_rq_clock(rq);
+   update_idle_cpumask(rq, false);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
@@ -7192,6 +7193,7 @@ void __init sched_init(void)
rq_csd_init(rq, >nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
+   rq->last_idle_state = 1;
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48a6d442b444..d67fba5e406b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6807,6 +6812,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq, bool set_idle)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+   int idle_state;
+
+   /*
+* If called from scheduler tick, only update
+* idle cpumask if the CPU is busy, as idle
+* cpumask is also updated on idle entry.
+*
+*/
+   if (!set_idle && idle_cpu(cpu))
+   return;
+   /*
+* Also set SCHED_IDLE cpu in idle cpumask to
+* allow SCHED_IDLE cpu as a wakeup target
+*/
+   idle_state = set_idle || sched_idle_cpu(cpu);
+   /*
+* No need to update idle cpumask if the state
+* does not change.
+*/
+   if (rq->last_idle_state == idle_state)
+   return;
+
+   rcu_read_lock();
+

Re: [RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-12 Thread Li, Aubrey

On 2020/11/12 18:57, Qais Yousef wrote:
> On 10/21/20 23:03, Aubrey Li wrote:
>> From: Aubrey Li 
>>
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior.
> 
> Did you intend to put the patch version history in the commit message?
> 
> I started looking at this last week but got distracted. I see you already got
> enough reviews, so my 2p is that I faced some compilation issues:
> 
>   aarch64-linux-gnu-ld: kernel/sched/idle.o: in function 
> `set_next_task_idle':
>   /mnt/data/src/linux/kernel/sched/idle.c:405: undefined reference to 
> `update_idle_cpumask'
>   aarch64-linux-gnu-ld: kernel/sched/fair.o: in function 
> `nohz_balancer_kick':
>   /mnt/data/src/linux/kernel/sched/fair.c:10150: undefined reference to 
> `update_idle_cpumask'
>   aarch64-linux-gnu-ld: /mnt/data/src/linux/kernel/sched/fair.c:10148: 
> undefined reference to `update_idle_cpumask'
> 
> Because of the missing CONFIG_SCHED_SMT in my .config. I think
> update_idle_cpumask() should be defined unconditionally.

Thanks to point this out timely, :), I'll fix it in the next version.

-Aubrey

Re: [RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-11 Thread Li, Aubrey

On 2020/11/9 23:54, Valentin Schneider wrote:
> 
> On 09/11/20 13:40, Li, Aubrey wrote:
>> On 2020/11/7 5:20, Valentin Schneider wrote:
>>>
>>> On 21/10/20 16:03, Aubrey Li wrote:
>>>> From: Aubrey Li 
>>>>
>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>>>> enters idle, its corresponding bit in the idle cpumask will be set,
>>>> and when the CPU exits idle, its bit will be cleared.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has low cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>
>>> FWIW I gave this a spin on my arm64 desktop (Ampere eMAG, 32 core). I get
>>> some barely noticeable (AIUI not statistically significant for bench sched)
>>> changes for 100 iterations of:
>>>
>>> | bench  | metric   |   mean | std |q90 
>>> |q99 |
>>> |+--++-++|
>>> | hackbench --loops 5000 --groups 1  | duration | -1.07% |  -2.23% | -0.88% 
>>> | -0.25% |
>>> | hackbench --loops 5000 --groups 2  | duration | -0.79% | +30.60% | -0.49% 
>>> | -0.74% |
>>> | hackbench --loops 5000 --groups 4  | duration | -0.54% |  +6.99% | -0.21% 
>>> | -0.12% |
>>> | perf bench sched pipe -T -l 10 | ops/sec  | +1.05% |  -2.80% | -0.17% 
>>> | +0.39% |
>>>
>>> q90 & q99 being the 90th and 99th percentile.
>>>
>>> Base was tip/sched/core at:
>>> d8fcb81f1acf ("sched/fair: Check for idle core in wake_affine")
>>
>> Thanks for the data, Valentin! So does the negative value mean improvement?
>>
> 
> For hackbench yes (shorter is better); for perf bench sched no, since the
> metric here is ops/sec so higher is better.
> 
> That said, I (use a tool that) run a 2-sample Kolmogorov–Smirnov test
> against the two sample sets (tip/sched/core vs tip/sched/core+patch), and
> the p-value for perf sched bench is quite high (~0.9) which means we can't
> reject that both sample sets come from the same distribution; long story
> short we can't say whether the patch had a noticeable impact for that
> benchmark.
> 
>> If so the data looks expected to me. As we set idle cpumask every time we
>> enter idle, but only clear it at the tick frequency, so if the workload
>> is not heavy enough, there could be a lot of idle during two ticks, so idle
>> cpumask is almost equal to sched_domain_span(sd), which makes no difference.
>>
>> But if the system load is heavy enough, CPU has few/no chance to enter idle,
>> then idle cpumask can be cleared during tick, which makes the bit number in
>> sds_idle_cpus(sd->shared) far less than the bit number in 
>> sched_domain_span(sd)
>> if llc domain has large count of CPUs.
>>
> 
> With hackbench -g 4 that's 160 tasks (against 32 CPUs, all under same LLC),
> although the work done by each task isn't much. I'll try bumping that a
> notch, or increasing the size of the messages.

As long as the system is busy enough and not schedule on idle thread, then
idle cpu mask will shrink tick by tick, and we'll see lower sd->avg_scan_cost.

This version of patch sets idle cpu bit every time it enters idle, so need
heavy load for scheduler to not switch idle thread in.

I personally like the logic in the previous version, because in those versions,
- when cpu enters idle, cpuidle governor returns a flag "stop_tick"
- if tick is stopped, which indicates the CPU is not busy, and can be set
  idle in idle cpumask
- otherwise, the CPU is likely going to work very soon, so not set it in
  idle cpumask.

But apparently I missed "nohz=off" case in the previous implementation. For
"nohz=off" case I selected to keep original behavior, which didn't content Mel.
Probably I can refine it in the next version.

Do you have any suggestions?

Thanks,
-Aubrey

Re: [RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-09 Thread Li, Aubrey

On 2020/11/7 5:20, Valentin Schneider wrote:
> 
> On 21/10/20 16:03, Aubrey Li wrote:
>> From: Aubrey Li 
>>
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
> 
> FWIW I gave this a spin on my arm64 desktop (Ampere eMAG, 32 core). I get
> some barely noticeable (AIUI not statistically significant for bench sched)
> changes for 100 iterations of:
> 
> | bench  | metric   |   mean | std |q90 | 
>q99 |
> |+--++-++|
> | hackbench --loops 5000 --groups 1  | duration | -1.07% |  -2.23% | -0.88% | 
> -0.25% |
> | hackbench --loops 5000 --groups 2  | duration | -0.79% | +30.60% | -0.49% | 
> -0.74% |
> | hackbench --loops 5000 --groups 4  | duration | -0.54% |  +6.99% | -0.21% | 
> -0.12% |
> | perf bench sched pipe -T -l 10 | ops/sec  | +1.05% |  -2.80% | -0.17% | 
> +0.39% |
> 
> q90 & q99 being the 90th and 99th percentile.
> 
> Base was tip/sched/core at:
> d8fcb81f1acf ("sched/fair: Check for idle core in wake_affine")

Thanks for the data, Valentin! So does the negative value mean improvement?

If so the data looks expected to me. As we set idle cpumask every time we
enter idle, but only clear it at the tick frequency, so if the workload
is not heavy enough, there could be a lot of idle during two ticks, so idle
cpumask is almost equal to sched_domain_span(sd), which makes no difference.

But if the system load is heavy enough, CPU has few/no chance to enter idle,
then idle cpumask can be cleared during tick, which makes the bit number in 
sds_idle_cpus(sd->shared) far less than the bit number in sched_domain_span(sd)
if llc domain has large count of CPUs.

For example, if I run 4 x overcommit uperf on a system with 192 CPUs, 
I observed:
- default, the average of this_sd->avg_scan_cost is 223.12ns
- patch, the average of this_sd->avg_scan_cost is 63.4ns

And select_idle_cpu is called 7670253 times per second, so for every CPU the
scan cost is saved (223.12 - 63.4) * 7670253 / 192 = 6.4ms. As a result, I
saw uperf thoughput improved by 60+%.

Thanks,
-Aubrey

Re: [RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-08 Thread Li, Aubrey

On 2020/11/6 15:58, Vincent Guittot wrote:
> On Wed, 21 Oct 2020 at 17:05, Aubrey Li  wrote:
>>
>> From: Aubrey Li 
>>
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior.
>>
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
>> ---
>>  include/linux/sched/topology.h | 13 ++
>>  kernel/sched/fair.c| 45 +-
>>  kernel/sched/idle.c|  1 +
>>  kernel/sched/sched.h   |  1 +
>>  kernel/sched/topology.c|  3 ++-
>>  5 files changed, 61 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index fb11091129b3..43a641d26154 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_tref;
>> atomic_tnr_busy_cpus;
>> int has_idle_cores;
>> +   /*
>> +* Span of all idle CPUs in this domain.
>> +*
>> +* NOTE: this field is variable length. (Allocated dynamically
>> +* by attaching extra space to the end of the structure,
>> +* depending on how many CPUs the kernel has booted up with)
>> +*/
>> +   unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +   return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>> /* These fields must be setup */
>> struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6b3b59cc51d6..088d1995594f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6023,6 +6023,38 @@ void __update_idle_core(struct rq *rq)
>> rcu_read_unlock();
>>  }
>>
>> +static DEFINE_PER_CPU(bool, cpu_idle_state);
>> +/*
>> + * Update cpu idle state and record this information
>> + * in sd_llc_shared->idle_cpus_span.
>> + */
>> +void update_idle_cpumask(struct rq *rq, bool idle_state)
>> +{
>> +   struct sched_domain *sd;
>> +   int cpu = cpu_of(rq);
>> +
>> +   /*
>> +* No need to update idle cpumask if the state
>> +* does not change.
>> +*/
>> +   if (per_cpu(cpu_idle_state, cpu) == idle_state)
>> +   return;
>> +
>> +   per_cpu(cpu_idle_state, cpu) = idle_state;
>> +
>> +   rcu_read_lock();
>> +
>> +   sd = rcu_dereference(per_cpu(sd_llc, cpu));
>> +   if (!sd || !sd->shared)
>> +   goto unlock;
>> +   if (idle_state)
>> +   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>> +   else
>> +   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>> +unlock:
>> +   rcu_read_unlock();
>> +}
>> +
>>  /*
>>   * Scan the entire LLC domain for idle cores; this dynamically switches off 
>> if
>>   * there are no idle cores left in the system; tracked through
>> @@ -6136,7 +6168,12 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>
>> time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   /*
>> +* sched_domain_shared is set only at shared cache level,
>> +* this works onl

Re: [PATCH v8 -tip 00/26] Core scheduling

2020-11-08 Thread Li, Aubrey

On 2020/11/7 1:54, Joel Fernandes wrote:
> On Fri, Nov 06, 2020 at 10:58:58AM +0800, Li, Aubrey wrote:
> 
>>>
>>> -- workload D, new added syscall workload, performance drop in cs_on:
>>> +--+--+---+
>>> |  | **   | will-it-scale  * 192  |
>>> |  |  | (pipe based context_switch)   |
>>> +==+==+===+
>>> | cgroup   | **   | cg_will-it-scale  |
>>> +--+--+---+
>>> | record_item  | **   | threads_avg   |
>>> +--+--+---+
>>> | coresched_normalized | **   | 0.2   |
>>> +--+--+---+
>>> | default_normalized   | **   | 1 |
>>> +--+--+---+
>>> | smtoff_normalized| **   | 0.89  |
>>> +--+--+---+
>>
>> will-it-scale may be a very extreme case. The story here is,
>> - On one sibling reader/writer gets blocked and tries to schedule another 
>> reader/writer in.
>> - The other sibling tries to wake up reader/writer.
>>
>> Both CPUs are acquiring rq->__lock,
>>
>> So when coresched off, they are two different locks, lock stat(1 second 
>> delta) below:
>>
>> class namecon-bouncescontentions   waittime-min   waittime-max 
>> waittime-total   waittime-avgacq-bounces   acquisitions   holdtime-min   
>> holdtime-max holdtime-total   holdtime-avg
>> >__lock:  210210   0.10   3.04   
>>   180.87   0.86797   79165021   0.03 
>>  20.6960650198.34   0.77
>>
>> But when coresched on, they are actually one same lock, lock stat(1 second 
>> delta) below:
>>
>> class namecon-bouncescontentions   waittime-min   waittime-max 
>> waittime-total   waittime-avgacq-bounces   acquisitions   holdtime-min   
>> holdtime-max holdtime-total   holdtime-avg
>> >__lock:  64794596484857   0.05 216.46
>> 60829776.85   9.388346319   15399739   0.03  
>> 95.5681119515.38   5.27
>>
>> This nature of core scheduling may degrade the performance of similar 
>> workloads with frequent context switching.
> 
> When core sched is off, is SMT off as well? From the above table, it seems to
> be. So even for core sched off, there will be a single lock per physical CPU
> core (assuming SMT is also off) right? Or did I miss something?
> 

The table includes 3 cases:
- default:  SMT on,  coresched off
- coresched:SMT on,  coresched on
- smtoff:   SMT off, coresched off

I was comparing the default(coresched off & SMT on) case with (coresched
on & SMT on) case.

If SMT off, then reader and writer on the different cores have different 
rq->lock,
so the lock contention is not that serious.

class namecon-bouncescontentions   waittime-min   waittime-max 
waittime-total   waittime-avgacq-bounces   acquisitions   holdtime-min   
holdtime-max holdtime-total   holdtime-avg
>__lock:   60 60   0.11   1.92  
41.33   0.69127   67184172   0.03  
22.9533160428.37   0.49

Does this address your concern?

Thanks,
-Aubrey

Re: [PATCH v8 -tip 00/26] Core scheduling

2020-11-05 Thread Li, Aubrey

ake up reader/writer.

Both CPUs are acquiring rq->__lock,

So when coresched off, they are two different locks, lock stat(1 second delta) 
below:

class namecon-bouncescontentions   waittime-min   waittime-max 
waittime-total   waittime-avgacq-bounces   acquisitions   holdtime-min   
holdtime-max holdtime-total   holdtime-avg
>__lock:  210210   0.10   3.04 
180.87   0.86797   79165021   0.03  
20.6960650198.34   0.77

But when coresched on, they are actually one same lock, lock stat(1 second 
delta) below:

class namecon-bouncescontentions   waittime-min   waittime-max 
waittime-total   waittime-avgacq-bounces   acquisitions   holdtime-min   
holdtime-max holdtime-total   holdtime-avg
>__lock:  64794596484857   0.05 216.46
60829776.85   9.388346319   15399739   0.03 
 95.5681119515.38   5.27

This nature of core scheduling may degrade the performance of similar workloads 
with frequent context switching.

Any thoughts?

Thanks,
-Aubrey

Re: [RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-04 Thread Li, Aubrey

Hi Valentin,

Thanks for your reply.

On 2020/11/4 3:27, Valentin Schneider wrote:
> 
> Hi,
> 
> On 21/10/20 16:03, Aubrey Li wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6b3b59cc51d6..088d1995594f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6023,6 +6023,38 @@ void __update_idle_core(struct rq *rq)
>>   rcu_read_unlock();
>>  }
>>
>> +static DEFINE_PER_CPU(bool, cpu_idle_state);
> 
> I would've expected this to be far less compact than a cpumask, but that's
> not the story readelf is telling me. Objdump tells me this is recouping
> some of the padding in .data..percpu, at least with the arm64 defconfig.
> 
> In any case this ought to be better wrt cacheline bouncing, which I suppose
> is what we ultimately want here.

Yes, every CPU has a byte, so it may not be less than a cpumask. Probably I can
put it into struct rq, do you have any better suggestions?

> 
> Also, see rambling about init value below.
> 
>> @@ -10070,6 +10107,12 @@ static void nohz_balancer_kick(struct rq *rq)
>>   if (unlikely(rq->idle_balance))
>>   return;
>>
>> +/* The CPU is not in idle, update idle cpumask */
>> +if (unlikely(sched_idle_cpu(cpu))) {
>> +/* Allow SCHED_IDLE cpu as a wakeup target */
>> +update_idle_cpumask(rq, true);
>> +} else
>> +update_idle_cpumask(rq, false);
> 
> This means that without CONFIG_NO_HZ_COMMON, a CPU going into idle will
> never be accounted as going out of it, right? Eventually the cpumask
> should end up full, which conceptually implements the previous behaviour of
> select_idle_cpu() but in a fairly roundabout way...

Maybe I can move it to scheduler_tick().

> 
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 9079d865a935..f14a6ef4de57 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
>>   sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>>   atomic_inc(>shared->ref);
>>   atomic_set(>shared->nr_busy_cpus, sd_weight);
>> +cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> 
> So at init you would have (single LLC for sake of simplicity):
> 
>   \all cpu : cpu_idle_state[cpu]  == false
>   cpumask_full(sds_idle_cpus) == true
> 
> IOW it'll require all CPUs to go idle at some point for these two states to
> be properly aligned. Should cpu_idle_state not then be init'd to 1?
> 
> This then happens again for hotplug, except that cpu_idle_state[cpu] may be
> either true or false when the sds_idle_cpus mask is reset to 1's.
> 

okay, will refine this in the next version.

Thanks,
-Aubrey

[PATCH v1] coresched/proc: add forceidle report with coresched enabled

2020-10-29 Thread Aubrey Li

When a CPU is running a task with coresched enabled, its sibling will
be forced idle if the sibling does not have a trusted task to run. It
is useful to report forceidle to understand the performance of different
cookies of tasks throughout the system.

forceidle is added at the last column of /proc/stat:

  $ cat /proc/stat
  cpu  102034 0 11992 8347016 1046 0 11 0 0 0 991
  cpu0 59 0 212 80364 59 0 0 0 0 0 0
  cpu1 72057 0 89 9102 0 0 0 0 0 0 90

So forceidle% can be computed by any user space tools, for example:

  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 24.75   0.000.00%   0.99%   74.26%
  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 25.74   0.000.00%   0.99%   73.27%
  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 24.75   0.000.00%   0.99%   74.26%
  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 25.24   0.000.00%   3.88%   70.87%

Signed-off-by: Aubrey Li 
---
 fs/proc/stat.c  | 48 +
 include/linux/kernel_stat.h |  1 +
 include/linux/tick.h|  2 ++
 kernel/time/tick-sched.c| 48 +
 kernel/time/tick-sched.h|  3 +++
 5 files changed, 102 insertions(+)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 46b3293015fe..b27ccac7b5a4 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -28,7 +28,11 @@ static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
u64 idle;
 
idle = kcs->cpustat[CPUTIME_IDLE];
+#ifdef CONFIG_SCHED_CORE
+   if (cpu_online(cpu) && !nr_iowait_cpu(cpu) && 
!cpu_rq(cpu)->core->core_forceidle)
+#else
if (cpu_online(cpu) && !nr_iowait_cpu(cpu))
+#endif
idle += arch_idle_time(cpu);
return idle;
 }
@@ -43,6 +47,17 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int 
cpu)
return iowait;
 }
 
+#ifdef CONFIG_SCHED_CORE
+static u64 get_forceidle_time(struct kernel_cpustat *kcs, int cpu)
+{
+   u64 forceidle;
+
+   forceidle = kcs->cpustat[CPUTIME_FORCEIDLE];
+   if (cpu_online(cpu) && cpu_rq(cpu)->core->core_forceidle)
+   forceidle += arch_idle_time(cpu);
+   return forceidle;
+}
+#endif
 #else
 
 static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
@@ -77,6 +92,21 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int 
cpu)
return iowait;
 }
 
+static u64 get_forceidle_time(struct kernel_cpustat *kcs, int cpu)
+{
+   u64 forceidle, forceidle_usecs = -1ULL;
+
+   if (cpu_online(cpu))
+   forceidle_usecs = get_cpu_forceidle_time_us(cpu, NULL);
+
+   if (forceidle_usecs == -1ULL)
+   /* !NO_HZ or cpu offline so we can rely on cpustat.forceidle */
+   forceidle = kcs->cpustat[CPUTIME_FORCEIDLE];
+   else
+   forceidle = forceidle_usecs * NSEC_PER_USEC;
+
+   return forceidle;
+}
 #endif
 
 static void show_irq_gap(struct seq_file *p, unsigned int gap)
@@ -111,12 +141,18 @@ static int show_stat(struct seq_file *p, void *v)
u64 guest, guest_nice;
u64 sum = 0;
u64 sum_softirq = 0;
+#ifdef CONFIG_SCHED_CORE
+   u64 forceidle;
+#endif
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
struct timespec64 boottime;
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
guest = guest_nice = 0;
+#ifdef CONFIG_SCHED_CORE
+   forceidle = 0;
+#endif
getboottime64();
 
for_each_possible_cpu(i) {
@@ -130,6 +166,9 @@ static int show_stat(struct seq_file *p, void *v)
system  += cpustat[CPUTIME_SYSTEM];
idle+= get_idle_time(, i);
iowait  += get_iowait_time(, i);
+#ifdef CONFIG_SCHED_CORE
+   forceidle   += get_forceidle_time(, i);
+#endif
irq += cpustat[CPUTIME_IRQ];
softirq += cpustat[CPUTIME_SOFTIRQ];
steal   += cpustat[CPUTIME_STEAL];
@@ -157,6 +196,9 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
+#ifdef CONFIG_SCHED_CORE
+   seq_put_decimal_ull(p, " ", nsec_to_clock_t(forceidle));
+#endif
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -171,6 +213,9 @@ static int show_stat(struct seq_file *p, void *v)
system  = cpustat[CPUTIME_SYSTEM];
idle= get_idle_time(, i);
iowait  = get_iowait_time(, i);
+#ifdef CONFIG_SCHED_CORE
+   forceidle   = get_forceidle_time(, i);
+#endif
irq = cpustat[CPUTIME_IRQ];
softir

Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()

2020-10-26 Thread Li, Aubrey

On 2020/10/26 17:01, Peter Zijlstra wrote:
> On Sat, Oct 24, 2020 at 08:27:16AM -0400, Vineeth Pillai wrote:
>>
>>
>> On 10/24/20 7:10 AM, Vineeth Pillai wrote:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 93a3b874077d..4cae5ac48b60 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
>>> sched_entity *curr)
>>>     se = second;
>>>     }
>>>
>>> -   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) <
>>> 1) {
>>> +   if (left && cfs_rq->next &&
>>> +   wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>>     /*
>>>  * Someone really wants this to run. If it's not unfair,
>>> run it.
>>>  */
>>>     se = cfs_rq->next;
>>> -   } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
>>> left) < 1) {
>>> +   } else if (left && cfs_rq->last &&
>>> +   wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>>>     /*
>>>  * Prefer last buddy, try to return the CPU to a
>>> preempted task.
>>>
>>>
>>> There reason for left being NULL needs to be investigated. This was
>>> there from v1 and we did not yet get to it. I shall try to debug later
>>> this week.
>>
>> Thinking more about it and looking at the crash, I think that
>> 'left == NULL' can happen in pick_next_entity for core scheduling.
>> If a cfs_rq has only one task that is running, then it will be
>> dequeued and 'left = __pick_first_entity()' will be NULL as the
>> cfs_rq will be empty. This would not happen outside of coresched
>> because we never call pick_tack() before put_prev_task() which
>> will enqueue the task back.
>>
>> With core scheduling, a cpu can call pick_task() for its sibling while
>> the sibling is still running the active task and put_prev_task has yet
>> not been called. This can result in 'left == NULL'.
> 
> Quite correct. Hurmph.. the reason we do this is because... we do the
> update_curr() the wrong way around. And I can't seem to remember why we
> do that (it was in my original patches).
> 
> Something like so seems the obvious thing to do, but I can't seem to
> remember why we're not doing it :-(
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6950,15 +6950,10 @@ static struct task_struct *pick_task_fai
>   do {
>   struct sched_entity *curr = cfs_rq->curr;
>  
> - se = pick_next_entity(cfs_rq, NULL);
> + if (curr && curr->on_rq)
> + update_curr(cfs_rq);
>  
> - if (curr) {
> - if (se && curr->on_rq)
> - update_curr(cfs_rq);
> -
> - if (!se || entity_before(curr, se))
> - se = curr;
> - }
> + se = pick_next_entity(cfs_rq, curr);
>  
>   cfs_rq = group_cfs_rq(se);
>   } while (cfs_rq);
> 

This patch works too for my benchmark, thanks Peter!

Re: [PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file

2020-10-25 Thread Li, Aubrey

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> core.c is already huge. The core-tagging interface code is largely
> independent of it. Move it to its own file to make both files easier to
> maintain.
> 
> Tested-by: Julien Desfossez 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  kernel/sched/Makefile  |   1 +
>  kernel/sched/core.c| 481 +
>  kernel/sched/coretag.c | 468 +++
>  kernel/sched/sched.h   |  56 -
>  4 files changed, 523 insertions(+), 483 deletions(-)
>  create mode 100644 kernel/sched/coretag.c
> 
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 5fc9c9b70862..c526c20adf9d 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
>  obj-$(CONFIG_MEMBARRIER) += membarrier.o
>  obj-$(CONFIG_CPU_ISOLATION) += isolation.o
>  obj-$(CONFIG_PSI) += psi.o
> +obj-$(CONFIG_SCHED_CORE) += coretag.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b3afbba5abe1..211e0784675f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
>   return RB_EMPTY_ROOT(>core_tree);
>  }
>  
> -static bool sched_core_enqueued(struct task_struct *task)
> -{
> - return !RB_EMPTY_NODE(>core_node);
> -}
> -
>  static struct task_struct *sched_core_first(struct rq *rq)
>  {
>   struct task_struct *task;
> @@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
>   rq->core->core_task_seq++;
>  }
>  
> -static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>  {
>   struct rb_node *parent, **node;
>   struct task_struct *node_task;
> @@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct 
> task_struct *p)
>   rb_insert_color(>core_node, >core_tree);
>  }
>  
> -static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p)
>  {
>   rq->core->core_task_seq++;
>  
> @@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
>  }
>  
>  static DEFINE_MUTEX(sched_core_mutex);
> -static DEFINE_MUTEX(sched_core_tasks_mutex);
>  static int sched_core_count;
>  
>  static void __sched_core_enable(void)
> @@ -346,16 +340,6 @@ void sched_core_put(void)
>   __sched_core_disable();
>   mutex_unlock(_core_mutex);
>  }
> -
> -static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
> *t2);
> -
> -#else /* !CONFIG_SCHED_CORE */
> -
> -static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) 
> { }
> -static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) 
> { }
> -static bool sched_core_enqueued(struct task_struct *task) { return false; }
> -static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
> *t2) { }
> -
>  #endif /* CONFIG_SCHED_CORE */
>  
>  /*
> @@ -8505,9 +8489,6 @@ void sched_offline_group(struct task_group *tg)
>   spin_unlock_irqrestore(_group_lock, flags);
>  }
>  
> -#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 
> 1)
> -static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
> -
>  static void sched_change_group(struct task_struct *tsk, int type)
>  {
>   struct task_group *tg;
> @@ -8583,11 +8564,6 @@ void sched_move_task(struct task_struct *tsk)
>   task_rq_unlock(rq, tsk, );
>  }
>  
> -static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
> -{
> - return css ? container_of(css, struct task_group, css) : NULL;
> -}
> -
>  static struct cgroup_subsys_state *
>  cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  {
> @@ -9200,459 +9176,6 @@ static u64 cpu_rt_period_read_uint(struct 
> cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
> -#ifdef CONFIG_SCHED_CORE
> -/*
> - * A simple wrapper around refcount. An allocated sched_core_cookie's
> - * address is used to compute the cookie of the task.
> - */
> -struct sched_core_cookie {
> - refcount_t refcnt;
> -};
> -
> -/*
> - * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
> - * @p: The task to assign a cookie to.
> - * @cookie: The cookie to assign.
> - * @group: is it a group interface or a per-task interface.
> - *
> - * This function is typically called from a stop-machine handler.
> - */
> -void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, 
> bool group)
> -{
> - if (!p)
> - return;
> -
> - if (group)
> - p->core_group_cookie = cookie;
> - else
> - p->core_task_cookie = cookie;
> -
> - /* Use up half of the cookie's bits for task cookie and remaining for 
> group cookie. */
> - p->core_cookie = (p->core_task_cookie <<
> -

Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()

2020-10-24 Thread Li, Aubrey

On 2020/10/24 20:27, Vineeth Pillai wrote:
> 
> 
> On 10/24/20 7:10 AM, Vineeth Pillai wrote:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 93a3b874077d..4cae5ac48b60 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4428,12 +4428,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
>> sched_entity *curr)
>>     se = second;
>>     }
>>
>> -   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>> +   if (left && cfs_rq->next &&
>> +   wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>     /*
>>  * Someone really wants this to run. If it's not unfair, run 
>> it.
>>  */
>>     se = cfs_rq->next;
>> -   } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) 
>> < 1) {
>> +   } else if (left && cfs_rq->last &&
>> +   wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>>     /*
>>  * Prefer last buddy, try to return the CPU to a preempted 
>> task.
>>
>>
>> There reason for left being NULL needs to be investigated. This was
>> there from v1 and we did not yet get to it. I shall try to debug later
>> this week.
> 
> Thinking more about it and looking at the crash, I think that
> 'left == NULL' can happen in pick_next_entity for core scheduling.
> If a cfs_rq has only one task that is running, then it will be
> dequeued and 'left = __pick_first_entity()' will be NULL as the
> cfs_rq will be empty. This would not happen outside of coresched
> because we never call pick_tack() before put_prev_task() which
> will enqueue the task back.
> 
> With core scheduling, a cpu can call pick_task() for its sibling while
> the sibling is still running the active task and put_prev_task has yet
> not been called. This can result in 'left == NULL'. So I think the
> above fix is appropriate when core scheduling is active. It could be
> cleaned up a bit though.

This patch works, thanks Vineeth for the quick fix!

Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()

2020-10-23 Thread Li, Aubrey

On 2020/10/24 5:47, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 01:25:38PM +0800, Li, Aubrey wrote:
>>>>> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
>>>>>
>>>>>  #ifdef CONFIG_SMP
>>>>>   .balance= balance_dl,
>>>>> + .pick_task  = pick_task_dl,
>>>>>   .select_task_rq = select_task_rq_dl,
>>>>>   .migrate_task_rq= migrate_task_rq_dl,
>>>>>   .set_cpus_allowed   = set_cpus_allowed_dl,
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index dbd9368a959d..bd6aed63f5e3 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
>>>>> sched_entity *curr)
>>>>>* Avoid running the skip buddy, if running something else can
>>>>>* be done without getting too unfair.
>>>>>*/
>>>>> - if (cfs_rq->skip == se) {
>>>>> + if (cfs_rq->skip && cfs_rq->skip == se) {
>>>>>   struct sched_entity *second;
>>>>>
>>>>>   if (se == curr) {
>>>>> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, 
>>>>> struct task_struct *p, int wake_
>>>>>   set_last_buddy(se);
>>>>>  }
>>>>>
>>>>> +#ifdef CONFIG_SMP
>>>>> +static struct task_struct *pick_task_fair(struct rq *rq)
>>>>> +{
>>>>> + struct cfs_rq *cfs_rq = >cfs;
>>>>> + struct sched_entity *se;
>>>>> +
>>>>> + if (!cfs_rq->nr_running)
>>>>> + return NULL;
>>>>> +
>>>>> + do {
>>>>> + struct sched_entity *curr = cfs_rq->curr;
>>>>> +
>>>>> + se = pick_next_entity(cfs_rq, NULL);
>>>>> +
>>>>> + if (curr) {
>>>>> + if (se && curr->on_rq)
>>>>> + update_curr(cfs_rq);
>>>>> +
>>>>> + if (!se || entity_before(curr, se))
>>>>> + se = curr;
>>>>> + }
>>>>> +
>>>>> + cfs_rq = group_cfs_rq(se);
>>>>> + } while (cfs_rq);
>>>>> ++
>>>>> + return task_of(se);
>>>>> +}
>>>>> +#endif
>>>>
>>>> One of my machines hangs when I run uperf with only one message:
>>>> [  719.034962] BUG: kernel NULL pointer dereference, address: 
>>>> 0050
>>>>
>>>> Then I replicated the problem on my another machine(no serial console),
>>>> here is the stack by manual copy.
>>>>
>>>> Call Trace:
>>>>  pick_next_entity+0xb0/0x160
>>>>  pick_task_fair+0x4b/0x90
>>>>  __schedule+0x59b/0x12f0
>>>>  schedule_idle+0x1e/0x40
>>>>  do_idle+0x193/0x2d0
>>>>  cpu_startup_entry+0x19/0x20
>>>>  start_secondary+0x110/0x150
>>>>  secondary_startup_64_no_verify+0xa6/0xab
>>>
>>> Interesting. Wondering if we screwed something up in the rebase.
>>>
>>> Questions:
>>> 1. Does the issue happen if you just apply only up until this patch,
>>> or the entire series?
>>
>> I applied the entire series and just find a related patch to report the
>> issue.
> 
> Ok.
> 
>>> 2. Do you see the issue in v7? Not much if at all has changed in this
>>> part of the code from v7 -> v8 but could be something in the newer
>>> kernel.
>>>
>>
>> IIRC, I can run uperf successfully on v7.
>> I'm on tip/master 2d3e8c9424c9 (origin/master) "Merge branch 'linus'."
>> Please let me know if this is a problem, or you have a repo I can pull
>> for testing.
> 
> Here is a repo with v8 series on top of v5.9 release:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9

I didn't see NULL pointer dereference BUG of this repo, will post performance
data later.

Thanks,
-Aubrey

Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()

2020-10-22 Thread Li, Aubrey

On 2020/10/22 23:25, Joel Fernandes wrote:
> On Thu, Oct 22, 2020 at 12:59 AM Li, Aubrey  wrote:
>>
>> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
>>> From: Peter Zijlstra 
>>>
>>> Because sched_class::pick_next_task() also implies
>>> sched_class::set_next_task() (and possibly put_prev_task() and
>>> newidle_balance) it is not state invariant. This makes it unsuitable
>>> for remote task selection.
>>>
>>> Tested-by: Julien Desfossez 
>>> Signed-off-by: Peter Zijlstra (Intel) 
>>> Signed-off-by: Vineeth Remanan Pillai 
>>> Signed-off-by: Julien Desfossez 
>>> Signed-off-by: Joel Fernandes (Google) 
>>> ---
>>>  kernel/sched/deadline.c  | 16 ++--
>>>  kernel/sched/fair.c  | 32 +++-
>>>  kernel/sched/idle.c  |  8 
>>>  kernel/sched/rt.c| 14 --
>>>  kernel/sched/sched.h |  3 +++
>>>  kernel/sched/stop_task.c | 13 +++--
>>>  6 files changed, 79 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index 814ec49502b1..0271a7848ab3 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -1848,7 +1848,7 @@ static struct sched_dl_entity 
>>> *pick_next_dl_entity(struct rq *rq,
>>>   return rb_entry(left, struct sched_dl_entity, rb_node);
>>>  }
>>>
>>> -static struct task_struct *pick_next_task_dl(struct rq *rq)
>>> +static struct task_struct *pick_task_dl(struct rq *rq)
>>>  {
>>>   struct sched_dl_entity *dl_se;
>>>   struct dl_rq *dl_rq = >dl;
>>> @@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct 
>>> rq *rq)
>>>   dl_se = pick_next_dl_entity(rq, dl_rq);
>>>   BUG_ON(!dl_se);
>>>   p = dl_task_of(dl_se);
>>> - set_next_task_dl(rq, p, true);
>>> +
>>> + return p;
>>> +}
>>> +
>>> +static struct task_struct *pick_next_task_dl(struct rq *rq)
>>> +{
>>> + struct task_struct *p;
>>> +
>>> + p = pick_task_dl(rq);
>>> + if (p)
>>> + set_next_task_dl(rq, p, true);
>>> +
>>>   return p;
>>>  }
>>>
>>> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
>>>
>>>  #ifdef CONFIG_SMP
>>>   .balance= balance_dl,
>>> + .pick_task  = pick_task_dl,
>>>   .select_task_rq = select_task_rq_dl,
>>>   .migrate_task_rq= migrate_task_rq_dl,
>>>   .set_cpus_allowed   = set_cpus_allowed_dl,
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index dbd9368a959d..bd6aed63f5e3 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
>>> sched_entity *curr)
>>>* Avoid running the skip buddy, if running something else can
>>>* be done without getting too unfair.
>>>*/
>>> - if (cfs_rq->skip == se) {
>>> + if (cfs_rq->skip && cfs_rq->skip == se) {
>>>   struct sched_entity *second;
>>>
>>>   if (se == curr) {
>>> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, 
>>> struct task_struct *p, int wake_
>>>   set_last_buddy(se);
>>>  }
>>>
>>> +#ifdef CONFIG_SMP
>>> +static struct task_struct *pick_task_fair(struct rq *rq)
>>> +{
>>> + struct cfs_rq *cfs_rq = >cfs;
>>> + struct sched_entity *se;
>>> +
>>> + if (!cfs_rq->nr_running)
>>> + return NULL;
>>> +
>>> + do {
>>> + struct sched_entity *curr = cfs_rq->curr;
>>> +
>>> + se = pick_next_entity(cfs_rq, NULL);
>>> +
>>> + if (curr) {
>>> + if (se && curr->on_rq)
>>> + update_curr(cfs_rq);
>>> +
>>> + if (!se || entity_before(curr, se))
>>> + se = curr;
>>> + }
>>> +
>>> + cfs_rq = group_cfs_rq(se);
>>> + } while (cfs_rq);
>>> ++
>>> + return task_of(se);
>>> +}
>

Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()

2020-10-22 Thread Li, Aubrey

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra 
> 
> Because sched_class::pick_next_task() also implies
> sched_class::set_next_task() (and possibly put_prev_task() and
> newidle_balance) it is not state invariant. This makes it unsuitable
> for remote task selection.
> 
> Tested-by: Julien Desfossez 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Vineeth Remanan Pillai 
> Signed-off-by: Julien Desfossez 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  kernel/sched/deadline.c  | 16 ++--
>  kernel/sched/fair.c  | 32 +++-
>  kernel/sched/idle.c  |  8 
>  kernel/sched/rt.c| 14 --
>  kernel/sched/sched.h |  3 +++
>  kernel/sched/stop_task.c | 13 +++--
>  6 files changed, 79 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 814ec49502b1..0271a7848ab3 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1848,7 +1848,7 @@ static struct sched_dl_entity 
> *pick_next_dl_entity(struct rq *rq,
>   return rb_entry(left, struct sched_dl_entity, rb_node);
>  }
>  
> -static struct task_struct *pick_next_task_dl(struct rq *rq)
> +static struct task_struct *pick_task_dl(struct rq *rq)
>  {
>   struct sched_dl_entity *dl_se;
>   struct dl_rq *dl_rq = >dl;
> @@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct rq 
> *rq)
>   dl_se = pick_next_dl_entity(rq, dl_rq);
>   BUG_ON(!dl_se);
>   p = dl_task_of(dl_se);
> - set_next_task_dl(rq, p, true);
> +
> + return p;
> +}
> +
> +static struct task_struct *pick_next_task_dl(struct rq *rq)
> +{
> + struct task_struct *p;
> +
> + p = pick_task_dl(rq);
> + if (p)
> + set_next_task_dl(rq, p, true);
> +
>   return p;
>  }
>  
> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
>  
>  #ifdef CONFIG_SMP
>   .balance= balance_dl,
> + .pick_task  = pick_task_dl,
>   .select_task_rq = select_task_rq_dl,
>   .migrate_task_rq= migrate_task_rq_dl,
>   .set_cpus_allowed   = set_cpus_allowed_dl,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dbd9368a959d..bd6aed63f5e3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
> sched_entity *curr)
>* Avoid running the skip buddy, if running something else can
>* be done without getting too unfair.
>*/
> - if (cfs_rq->skip == se) {
> + if (cfs_rq->skip && cfs_rq->skip == se) {
>   struct sched_entity *second;
>  
>   if (se == curr) {
> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, struct 
> task_struct *p, int wake_
>   set_last_buddy(se);
>  }
>  
> +#ifdef CONFIG_SMP
> +static struct task_struct *pick_task_fair(struct rq *rq)
> +{
> + struct cfs_rq *cfs_rq = >cfs;
> + struct sched_entity *se;
> +
> + if (!cfs_rq->nr_running)
> + return NULL;
> +
> + do {
> + struct sched_entity *curr = cfs_rq->curr;
> +
> + se = pick_next_entity(cfs_rq, NULL);
> +
> + if (curr) {
> + if (se && curr->on_rq)
> + update_curr(cfs_rq);
> +
> + if (!se || entity_before(curr, se))
> + se = curr;
> + }
> +
> + cfs_rq = group_cfs_rq(se);
> + } while (cfs_rq);
> +
> + return task_of(se);
> +}
> +#endif

One of my machines hangs when I run uperf with only one message:
[  719.034962] BUG: kernel NULL pointer dereference, address: 0050

Then I replicated the problem on my another machine(no serial console),
here is the stack by manual copy.

Call Trace:
 pick_next_entity+0xb0/0x160
 pick_task_fair+0x4b/0x90
 __schedule+0x59b/0x12f0
 schedule_idle+0x1e/0x40
 do_idle+0x193/0x2d0
 cpu_startup_entry+0x19/0x20
 start_secondary+0x110/0x150
 secondary_startup_64_no_verify+0xa6/0xab

> +
>  struct task_struct *
>  pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags 
> *rf)
>  {
> @@ -11173,6 +11202,7 @@ const struct sched_class fair_sched_class
>  
>  #ifdef CONFIG_SMP
>   .balance= balance_fair,
> + .pick_task  = pick_task_fair,
>   .select_task_rq = select_task_rq_fair,
>   .migrate_task_rq= migrate_task_rq_fair,
>  
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 8ce6e80352cf..ce7552c6bc65 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -405,6 +405,13 @@ static void set_next_task_idle(struct rq *rq, struct 
> task_struct *next, bool fir
>   schedstat_inc(rq->sched_goidle);
>  }
>  
> +#ifdef CONFIG_SMP
> +static struct task_struct *pick_task_idle(struct rq *rq)
> +{
> + return rq->idle;
> +}
> +#endif

Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode

2020-10-21 Thread Li, Aubrey

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
> 
> This patch therefore adds support for protecting all syscall and IRQ
> kernel mode entries. Care is taken to track the outermost usermode exit
> and entry using per-cpu counters. In cases where one of the hyperthreads
> enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> when not needed - example: idle and non-cookie HTs do not need to be
> forced into kernel mode.
> 
> More information about attacks:
> For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> data to either host or guest attackers. For L1TF, it is possible to leak
> to guest attackers. There is no possible mitigation involving flushing
> of buffers to avoid this since the execution of attacker and victims
> happen concurrently on 2 or more HTs.
> 
> Cc: Julien Desfossez 
> Cc: Tim Chen 
> Cc: Aaron Lu 
> Cc: Aubrey Li 
> Cc: Tim Chen 
> Cc: Paul E. McKenney 
> Co-developed-by: Vineeth Pillai 
> Tested-by: Julien Desfossez 
> Signed-off-by: Vineeth Pillai 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  .../admin-guide/kernel-parameters.txt |   7 +
>  include/linux/entry-common.h  |   2 +-
>  include/linux/sched.h |  12 +
>  kernel/entry/common.c |  25 +-
>  kernel/sched/core.c   | 229 ++
>  kernel/sched/sched.h  |   3 +
>  6 files changed, 275 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 3236427e2215..48567110f709 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4678,6 +4678,13 @@
>  
>   sbni=   [NET] Granch SBNI12 leased line adapter
>  
> + sched_core_protect_kernel=
> + [SCHED_CORE] Pause SMT siblings of a core running in
> + user mode, if at least one of the siblings of the core
> + is running in kernel mode. This is to guarantee that
> + kernel data is not leaked to tasks which are not trusted
> + by the kernel.
> +
>   sched_debug [KNL] Enables verbose scheduler debug messages.
>  
>   schedstats= [KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 474f29638d2c..260216de357b 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -69,7 +69,7 @@
>  
>  #define EXIT_TO_USER_MODE_WORK   
> \
>   (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
> -  _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |   \
> +  _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET | \
>ARCH_EXIT_TO_USER_MODE_WORK)
>  
>  /**
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d38e904dd603..fe6f225bfbf9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
>  
>  const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>  
> +#ifdef CONFIG_SCHED_CORE
> +void sched_core_unsafe_enter(void);
> +void sched_core_unsafe_exit(void);
> +bool sched_core_wait_till_safe(unsigned long ti_check);
> +bool sched_core_kernel_protected(void);
> +#else
> +#define sched_core_unsafe_enter(ignore) do { } while (0)
> +#define sched_core_unsafe_exit(ignore) do { } while (0)
> +#define sched_core_wait_till_safe(ignore) do { } while (0)
> +#define sched_core_kernel_protected(ignore) do { } while (0)
> +#endif
> +
>  #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 0a1e20f8d4e8..c8dc6b1b1f40 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
>  /* Workaround to allow gradual conversion of architecture code */
>  void __weak arch_do_signal(struct pt_regs *regs) { }
>  
> +unsigned long exit_to_user_get_work(void)
> +{
> + unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> + if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> + return

[RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-10-21 Thread Aubrey Li

From: Aubrey Li 

Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 ++
 kernel/sched/fair.c| 45 +-
 kernel/sched/idle.c|  1 +
 kernel/sched/sched.h   |  1 +
 kernel/sched/topology.c|  3 ++-
 5 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..088d1995594f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6023,6 +6023,38 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
 }
 
+static DEFINE_PER_CPU(bool, cpu_idle_state);
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq, bool idle_state)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+
+   /*
+* No need to update idle cpumask if the state
+* does not change.
+*/
+   if (per_cpu(cpu_idle_state, cpu) == idle_state)
+   return;
+
+   per_cpu(cpu_idle_state, cpu) = idle_state;
+
+   rcu_read_lock();
+
+   sd = rcu_dereference(per_cpu(sd_llc, cpu));
+   if (!sd || !sd->shared)
+   goto unlock;
+   if (idle_state)
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+   else
+   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
+unlock:
+   rcu_read_unlock();
+}
+
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
@@ -6136,7 +6168,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10070,6 +10107,12 @@ static void nohz_balancer_kick(struct rq *rq)
if (unlikely(rq->idle_balance))
return;
 
+   /* The CPU is not in idle, update idle cpumask */
+   if (unlikely(sched_idle_cpu(cpu))) {
+   /* Allow SCHED_IDLE cpu as a wakeup target */
+   update_idle_cpumask(rq, true);
+   } else
+   update_idle_cpumask(rq, false);
/*
 * We may be recently in ticked or tickless idle mode. At the first
 * busy tick after returning from idle, we will update the busy stats.
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1ae95b9150d3..ce1f929d7fbb 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct 
task_struct *prev)
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool 
firs

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-27 Thread Li, Aubrey

On 2020/9/26 0:45, Vincent Guittot wrote:
> Le vendredi 25 sept. 2020 à 17:21:46 (+0800), Li, Aubrey a écrit :
>> Hi Vicent,
>>
>> On 2020/9/24 21:09, Vincent Guittot wrote:
>>>>>>
>>>>>> Would you mind share uperf(netperf load) result on your side? That's the
>>>>>> workload I have seen the most benefit this patch contributed under heavy
>>>>>> load level.
>>>>>
>>>>> with uperf, i've got the same kind of result as sched pipe
>>>>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
>>>>> with this patch:  Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
>>>>> regression as for sched pipe
>>>>>
>>>> In case this is caused by the logic error in this patch(sorry again), did
>>>> you see any improvement in patch V2? Though it does not helps for nohz=off
>>>> case, just want to know if it helps or does not help at all on arm 
>>>> platform.
>>>
>>> With the v2 which rate limit the update of the cpumask (but doesn't
>>> support sched_idle stask),  I don't see any performance impact:
>>
>> I agree we should go the way with cpumask update rate limited.
>>
>> And I think no performance impact for sched-pipe is expected, as this 
>> workload
>> has only 2 threads and the platform has 8 cores, so mostly previous cpu is
>> returned, and even if select_idle_sibling is called, select_idle_core is hit
>> and rarely call select_idle_cpu.
> 
> my platform is not smt so select_idle_core is nop. Nevertheless 
> select_idle_cpu
> is almost never called because prev is idle and selected before calling it in
> our case
> 
>>
>> But I'm more curious why there is 23% performance penalty? So for this 
>> patch, if
>> you revert this change but keep cpumask updated, is 23% penalty still there?
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> 
> I was about to say that reverting this line should not change anything because
> we never reach this point but it does in fact. And after looking at a trace,
> I can see that the 2 threads of perf bench sched pipe are on the same CPU and
> that the sds_idle_cpus(sd->shared) is always empty. In fact, the rq->curr is
> not yet idle and still point to the cfs task when you call 
> update_idle_cpumask().
> This means that once cleared, the bit will never be set
> You can remove the test in update_idle_cpumask() which is called either when
> entering idle or when there is only sched_idle tasks that are runnable.
> 
> @@ -6044,8 +6044,7 @@ void update_idle_cpumask(struct rq *rq)
> sd = rcu_dereference(per_cpu(sd_llc, cpu));
> if (!sd || !sd->shared)
> goto unlock;
> -   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> -   goto unlock;
> +
> cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>  unlock:
> rcu_read_unlock();
> 
> With this fix, the performance decrease is only 2%
> 
>>
>> I just wonder if it's caused by the atomic ops as you have two cache domains 
>> with
>> sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me 
>> to find
>> an ARM machine but I'll try.
>>
>> Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? 
>> no any
>> performance impact?
> 
> with v2 :  Throughput 24.97Mb/s (+/- 0.07%) so there is no perf regression
> 

Thanks Vincent, let me try to refine this patch.

-Aubrey

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-25 Thread Li, Aubrey

Hi Vicent,

On 2020/9/24 21:09, Vincent Guittot wrote:
>>>>
>>>> Would you mind share uperf(netperf load) result on your side? That's the
>>>> workload I have seen the most benefit this patch contributed under heavy
>>>> load level.
>>>
>>> with uperf, i've got the same kind of result as sched pipe
>>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
>>> with this patch:  Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
>>> regression as for sched pipe
>>>
>> In case this is caused by the logic error in this patch(sorry again), did
>> you see any improvement in patch V2? Though it does not helps for nohz=off
>> case, just want to know if it helps or does not help at all on arm platform.
> 
> With the v2 which rate limit the update of the cpumask (but doesn't
> support sched_idle stask),  I don't see any performance impact:

I agree we should go the way with cpumask update rate limited.

And I think no performance impact for sched-pipe is expected, as this workload
has only 2 threads and the platform has 8 cores, so mostly previous cpu is
returned, and even if select_idle_sibling is called, select_idle_core is hit
and rarely call select_idle_cpu.

But I'm more curious why there is 23% performance penalty? So for this patch, if
you revert this change but keep cpumask updated, is 23% penalty still there?

-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

I just wonder if it's caused by the atomic ops as you have two cache domains 
with
sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me to 
find
an ARM machine but I'll try.

Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? no 
any
performance impact?

Thanks,
-Aubrey

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-17 Thread Li, Aubrey

On 2020/9/16 19:00, Mel Gorman wrote:
> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> The following benchmarks were tested on a x86 4 socket system with
>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
>>
> 
> This still appears to be tied to turning the tick off. An idle CPU
> available for computation does not necessarily have the tick turned off
> if it's for short periods of time. When nohz is disabled or a machine is
> active enough that CPUs are not disabling the tick, select_idle_cpu may
> fail to select an idle CPU and instead stack tasks on the old CPU.
> 
> The other subtlety is that select_idle_sibling() currently allows a
> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> idle as such, it's simply running a low priority task that is suitable
> for preemption. I suspect this patch breaks that.
> 
Thanks!

I shall post a v3 with performance data, I made a quick uperf testing and
found the benefit is still there. So I posted the patch here and looking
forward to your comments before I start the benchmarks.

Thanks,
-Aubrey

---
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..9a3c82645472 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
 }
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+
+   rcu_read_lock();
+   sd = rcu_dereference(per_cpu(sd_llc, cpu));
+   if (!sd || !sd->shared)
+   goto unlock;
+   if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
+   goto unlock;
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+unlock:
+   rcu_read_unlock();
+}
+
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
@@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
 
if (want_affine)
current->recent_used_cpu = cpu;
+
+   sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
+   if (sd && sd->shared)
+   cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
}
rcu_read_unlock();
 
@@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct 
task_struct *p, bool first)
/* ensure bandwidth has been allocated on our new cfs_rq */
account_cfs_rq_runtime(cfs_rq, 0);
}
+   /* Update idle cpumask if task has idle policy */
+   if (unlikely(task_has_idle_policy(p)))
+   update_idle_cpumask(rq);
 }
 
 void init_cfs_rq(struct cfs_rq *cfs_rq)
diff --git a/

Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

2020-09-16 Thread Li, Aubrey

On 2020/9/17 4:53, chris hyser wrote:
> On 9/16/20 10:24 AM, chris hyser wrote:
>> On 9/16/20 8:57 AM, Li, Aubrey wrote:
>>>> Here are the uperf results of the various patchsets. Note, that disabling 
>>>> smt is better for these tests and that that presumably reflects the 
>>>> overall overhead of core scheduling which went from bad to really bad. The 
>>>> primary focus in this email is to start to understand what happened within 
>>>> core sched itself.
>>>>
>>>> patchset  smt=on/cs=off  smt=off    smt=on/cs=on
>>>> 
>>>> v5-v5.6.y  :    1.78Gb/s 1.57Gb/s 1.07Gb/s
>>>> pre-v6-v5.6.y  :    1.75Gb/s 1.55Gb/s    822.16Mb/s
>>>> v6-5.7 :    1.87Gs/s 1.56Gb/s    561.6Mb/s
>>>> v6-5.7-hotplug :    1.75Gb/s 1.58Gb/s    438.21Mb/s
>>>> v7 :    1.80Gb/s 1.61Gb/s    440.44Mb/s
>>>
>>> I haven't had a chance to play with v7, but I got something different.
>>>
>>>    branch    smt=on/cs=on
>>> coresched/v5-v5.6.y    1.09Gb/s
>>> coresched/v6-v5.7.y    1.05Gb/s
>>>
>>> I attached my kernel config in case you want to make a comparison, or you
>>> can send yours, I'll try to see I can replicate your result.
>>
>> I will give this config a try. One of the reports forwarded to me about the 
>> drop in uperf perf was an email from you I believe mentioning a 50% perf 
>> drop between v5 and v6?? I was actually setting out to duplicate your 
>> results. :-)
> 
> The first thing I did was to verify I built and tested the right bits. 
> Presumably as I get same numbers. I'm still trying to tweak your config to 
> get a root disk in my setup. Oh, one thing I missed in reading your first 
> response, I had 24 cores/48 cpus. I think you had half that, though my guess 
> is that that should have actually made the numbers even worse. :-)
> 
> The following was forwarded to me originally sent on Aug 3, by you I believe:
> 
>> We found uperf(in cgroup) throughput drops by ~50% with corescheduling.
>>
>> The problem is, uperf triggered a lot of softirq and offloaded softirq
>> service to *ksoftirqd* thread.
>>
>> - default, ksoftirqd thread can run with uperf on the same core, we saw
>>   100% CPU utilization.
>> - coresched enabled, ksoftirqd's core cookie is different from uperf, so
>>   they can't run concurrently on the same core, we saw ~15% forced idle.
>>
>> I guess this kind of performance drop can be replicated by other similar
>> (a lot of softirq activities) workloads.
>>
>> Currently core scheduler picks cookie-match tasks for all SMT siblings, does
>> it make sense we add a policy to allow cookie-compatible task running 
>> together?
>> For example, if a task is trusted(set by admin), it can work with kernel 
>> thread.
>> The difference from corescheduling disabled is that we still have user to 
>> user
>> isolation.
>>
>> Thanks,
>> -Aubrey
> 
> Would you please elaborate on what this test was? In trying to duplicate 
> this, I just kept adding uperf threads to my setup until I started to see 
> performance losses similar to what is reported above (and a second report 
> about v7). Also, I wasn't looking for absolute numbers per-se, just 
> significant enough differences to try to track where the performance went.
> 

This test was smt-on/cs-on against smt-on/cs-off on the same corescheduling 
version,
we didn't find such big regression between different versions as you 
encountered.

Thanks,
-Aubrey

[RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-15 Thread Aubrey Li

Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

The following benchmarks were tested on a x86 4 socket system with
24 cores per socket and 2 hyperthreads per core, total 192 CPUs:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   1.240.982.76
  144   1   1.131.354.01
  192   1   0.581.673.25
  240   1   2.491.683.55

hackbench: process mode, 10 loops, 40 file descriptors per group

  group baseline-avg%stdpatch-avg   %std
  2(80) 1   12.05   0.979.88
  3(120)1   12.48   0.9511.62
  4(160)1   13.83   0.9713.22
  5(200)1   2.761.012.94

schbench: 99th percentile latency, 16 workers per message thread

  mthread   baseline-avg%stdpatch-avg   %std
  6(96) 1   1.240.993   1.73
  9(144)1   0.380.998   0.39
  12(192)   1   1.580.995   1.64
  15(240)   1   51.71   0.606   37.41

sysbench mysql throughput: read/write, table size = 10,000,000

  threadbaseline-avg%stdpatch-avg   %std
  961   1.771.015   1.71
  144   1   3.390.998   4.05
  192   1   2.881.002   2.81
  240   1   2.071.011   2.09

kbuild: kexec reboot every time

  baseline-avg  patch-avg
  1 1

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/fair.c|  9 -
 kernel/sched/topology.c|  3 ++-
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..cfe78fcf69da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10187,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;
 
atomic_inc(>shared->nr_busy_cpus);
+   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
@@ -10212,6 +10218,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;
 
atomic_dec(>shared->nr_busy_cpus);
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..f14a6ef4de57 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_

Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-15 Thread Li, Aubrey

On 2020/9/15 17:23, Vincent Guittot wrote:
> On Tue, 15 Sep 2020 at 10:47, Jiang Biao  wrote:
>>
>> Hi, Vincent
>>
>> On Mon, 14 Sep 2020 at 20:26, Vincent Guittot
>>  wrote:
>>>
>>> On Sun, 13 Sep 2020 at 05:59, Jiang Biao  wrote:
>>>>
>>>> Hi, Aubrey
>>>>
>>>> On Fri, 11 Sep 2020 at 23:48, Aubrey Li  wrote:
>>>>>
>>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>>>>> enters idle, its corresponding bit in the idle cpumask will be set,
>>>>> and when the CPU exits idle, its bit will be cleared.
>>>>>
>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>> especially when the system is heavily loaded.
>>>>>
>>>>> Signed-off-by: Aubrey Li 
>>>>> ---
>>>>>  include/linux/sched/topology.h | 13 +
>>>>>  kernel/sched/fair.c|  4 +++-
>>>>>  kernel/sched/topology.c|  2 +-
>>>>>  3 files changed, 17 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/sched/topology.h 
>>>>> b/include/linux/sched/topology.h
>>>>> index fb11091129b3..43a641d26154 100644
>>>>> --- a/include/linux/sched/topology.h
>>>>> +++ b/include/linux/sched/topology.h
>>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>> atomic_tref;
>>>>> atomic_tnr_busy_cpus;
>>>>> int has_idle_cores;
>>>>> +   /*
>>>>> +* Span of all idle CPUs in this domain.
>>>>> +*
>>>>> +* NOTE: this field is variable length. (Allocated dynamically
>>>>> +* by attaching extra space to the end of the structure,
>>>>> +* depending on how many CPUs the kernel has booted up with)
>>>>> +*/
>>>>> +   unsigned long   idle_cpus_span[];
>>>>>  };
>>>>>
>>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
>>>>> *sds)
>>>>> +{
>>>>> +   return to_cpumask(sds->idle_cpus_span);
>>>>> +}
>>>>> +
>>>>>  struct sched_domain {
>>>>> /* These fields must be setup */
>>>>> struct sched_domain __rcu *parent;  /* top domain must be 
>>>>> null terminated */
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 6b3b59cc51d6..3b6f8a3589be 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, 
>>>>> struct sched_domain *sd, int t
>>>>>
>>>>> time = cpu_clock(this);
>>>>>
>>>>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>>> Is the sds_idle_cpus() always empty if nohz=off?
>>>
>>> Good point
>>>
>>>> Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
>>>>
>>>>>
>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>> if (!--nr)
>>>>> @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
>>>>> sd->nohz_idle = 0;
>>>>>
>>>>> atomic_inc(>shared->nr_busy_cpus);
>>>>> +   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>>>>>  unlock:
>>>>> rcu_read_unlock();
>>>>>  }
>>>>> @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
>>>>> sd->nohz_idle = 1;
>>>>>
>>>>> atomic_dec(>shared->nr_busy_cpus);
>>>>> +   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>>>> This only works when entering/exiting tickless mode? :)
>>>> Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
>>>
>>> set_cpu_sd_state_busy is only called during a tick in order to limit
>>> the rate of the update to once per tick per cpu at most and prevents
>>> any kind of storm of update if short running tasks wake/sleep all the
>>> time. We don't want to update a cpumask at each and every enter/leave
>>> idle.
>>>
>> Agree. But set_cpu_sd_state_busy seems not being reached when
>> nohz=off, which means it will not work for that case? :)
> 
> Yes set_cpu_sd_state_idle/busy are nohz function

Thanks Biao to point this out.

If the shared idle cpumask is initialized with sched_domain_span(sd),
then nohz=off case will remain the previous behavior.

Thanks,
-Aubrey

Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-11 Thread Li, Aubrey

On 2020/9/12 7:04, Li, Aubrey wrote:
> On 2020/9/12 0:28, Qais Yousef wrote:
>> On 09/10/20 13:42, Aubrey Li wrote:
>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>>> enters idle, its corresponding bit in the idle cpumask will be set,
>>> and when the CPU exits idle, its bit will be cleared.
>>>
>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>> has low cost than scanning all the cpus in last level cache domain,
>>> especially when the system is heavily loaded.
>>>
>>> Signed-off-by: Aubrey Li 
>>> ---
>>>  include/linux/sched/topology.h | 13 +
>>>  kernel/sched/fair.c|  4 +++-
>>>  kernel/sched/topology.c|  2 +-
>>>  3 files changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>> index fb11091129b3..43a641d26154 100644
>>> --- a/include/linux/sched/topology.h
>>> +++ b/include/linux/sched/topology.h
>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>> atomic_tref;
>>> atomic_tnr_busy_cpus;
>>> int has_idle_cores;
>>> +   /*
>>> +* Span of all idle CPUs in this domain.
>>> +*
>>> +* NOTE: this field is variable length. (Allocated dynamically
>>> +* by attaching extra space to the end of the structure,
>>> +* depending on how many CPUs the kernel has booted up with)
>>> +*/
>>> +   unsigned long   idle_cpus_span[];
>>
>> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
> 
> I can use the existing free code. Do we have a problem of this?
> 
>>
>> The patch looks useful. Did it help you with any particular workload? It'd be
>> good to expand on that in the commit message.
>>
> Odd, that included in patch v1 0/1, did you receive it?

I found it at here:

https://lkml.org/lkml/2020/9/11/645

> 
> Thanks,
> -Aubrey
>

Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-11 Thread Li, Aubrey

On 2020/9/12 0:28, Qais Yousef wrote:
> On 09/10/20 13:42, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Signed-off-by: Aubrey Li 
>> ---
>>  include/linux/sched/topology.h | 13 +
>>  kernel/sched/fair.c|  4 +++-
>>  kernel/sched/topology.c|  2 +-
>>  3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index fb11091129b3..43a641d26154 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>  atomic_tref;
>>  atomic_tnr_busy_cpus;
>>  int has_idle_cores;
>> +/*
>> + * Span of all idle CPUs in this domain.
>> + *
>> + * NOTE: this field is variable length. (Allocated dynamically
>> + * by attaching extra space to the end of the structure,
>> + * depending on how many CPUs the kernel has booted up with)
>> + */
>> +unsigned long   idle_cpus_span[];
> 
> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?

I can use the existing free code. Do we have a problem of this?

> 
> The patch looks useful. Did it help you with any particular workload? It'd be
> good to expand on that in the commit message.
> 
Odd, that included in patch v1 0/1, did you receive it?

Thanks,
-Aubrey

[RFC PATCH v1 0/1] select idle cpu from idle cpumask in sched domain

2020-09-11 Thread Aubrey Li

I'm writting to see if it makes sense to track idle cpus in a shared cpumask
in sched domain, then a task wakes up it can select idle cpu from this cpumask
instead of scanning all the cpus in the last level cache domain, especially
when the system is heavily loaded, the scanning cost could be significantly
reduced. The price is that the atomic cpumask ops are added to the idle entry
and exit paths.

I tested the following benchmarks on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   1.240.982.76
  144   1   1.131.354.01
  192   1   0.581.673.25
  240   1   2.491.683.55

hackbench: process mode, 10 loops, 40 file descriptors per group

  group baseline-avg%stdpatch-avg   %std
  2(80) 1   12.05   0.979.88
  3(120)1   12.48   0.9511.62
  4(160)1   13.83   0.9713.22
  5(200)1   2.761.012.94 

schbench: 99th percentile latency, 16 workers per message thread

  mthread   baseline-avg%stdpatch-avg   %std
  6(96) 1   1.240.993   1.73
  9(144)1   0.380.998   0.39
  12(192)   1   1.580.995   1.64
  15(240)   1   51.71   0.606   37.41

sysbench mysql throughput: read/write, table size = 10,000,000

  threadbaseline-avg%stdpatch-avg   %std
  961   1.771.015   1.71
  144   1   3.390.998   4.05
  192   1   2.881.002   2.81
  240   1   2.071.011   2.09

kbuild: kexec reboot every time

  baseline-avg  patch-avg
  1 1

Any suggestions are highly appreciated!

Thanks,
-Aubrey

Aubrey Li (1):
  sched/fair: select idle cpu from idle cpumask in sched domain

 include/linux/sched/topology.h | 13 +
 kernel/sched/fair.c|  4 +++-
 kernel/sched/topology.c|  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

-- 
2.25.1

[RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-11 Thread Aubrey Li

Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/fair.c|  4 +++-
 kernel/sched/topology.c|  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..3b6f8a3589be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;
 
atomic_inc(>shared->nr_busy_cpus);
+   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
@@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;
 
atomic_dec(>shared->nr_busy_cpus);
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..92d0aeef86bf 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
*per_cpu_ptr(sdd->sd, j) = sd;
 
-   sds = kzalloc_node(sizeof(struct sched_domain_shared),
+   sds = kzalloc_node(sizeof(struct sched_domain_shared) + 
cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
-- 
2.25.1

Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)

2020-08-13 Thread Li, Aubrey

On 2020/8/14 12:04, benbjiang(蒋彪) wrote:
> 
> 
>> On Aug 14, 2020, at 9:36 AM, Li, Aubrey  wrote:
>>
>> On 2020/8/14 8:26, benbjiang(蒋彪) wrote:
>>>
>>>
>>>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey  wrote:
>>>>
>>>> On 2020/8/13 7:08, Joel Fernandes wrote:
>>>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>>>>>> Hi Joel,
>>>>>>
>>>>>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>>>>>> Hi Aubrey,
>>>>>>>
>>>>>>> Apologies for replying late as I was still looking into the details.
>>>>>>>
>>>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>>>>>> [...]
>>>>>>>> +/*
>>>>>>>> + * Core scheduling policy:
>>>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>>>>>> + * on the same core concurrently.
>>>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>>>>>thread on the same core concurrently. 
>>>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>>>>>> + * with idle thread on the same core.
>>>>>>>> + */
>>>>>>>> +enum coresched_policy {
>>>>>>>> +   CORE_SCHED_DISABLED,
>>>>>>>> +   CORE_SCHED_COOKIE_MATCH,
>>>>>>>> +  CORE_SCHED_COOKIE_TRUST,
>>>>>>>> +   CORE_SCHED_COOKIE_LONELY,
>>>>>>>> +};
>>>>>>>>
>>>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this 
>>>>>>>> kind
>>>>>>>> of performance regression. Not sure if this sounds attractive?
>>>>>>>
>>>>>>> Instead of this, I think it can be something simpler IMHO:
>>>>>>>
>>>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply 
>>>>>>> the
>>>>>>>  core-scheduling patchset, such tasks will share a core and sniff on 
>>>>>>> each
>>>>>>>  other. So let us not pretend that such tasks are not trusted).
>>>>>>>
>>>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will 
>>>>>>> cover
>>>>>>>  ksoftirqd reported in your original issue).
>>>>>>>
>>>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). 
>>>>>>> Default
>>>>>>>  enable it. Setting this option would tag all tasks that are forked 
>>>>>>> from a
>>>>>>>  cookie-0 task with their own cookie. Later on, such tasks can be added 
>>>>>>> to
>>>>>>>  a group. This cover's PeterZ's ask about having 'default untrusted').
>>>>>>>  (Users like ChromeOS that don't want to userspace system processes to 
>>>>>>> be
>>>>>>>  tagged can disable this option so such tasks will be cookie-0).
>>>>>>>
>>>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override 
>>>>>>> the
>>>>>>>  above behaviors.
>>>>>>
>>>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set 
>>>>>> uperf's
>>>>>> cookie to be cookie-0 via prctl?
>>>>>
>>>>> Yes, but let me try to understand better. There are 2 problems here I 
>>>>> think:
>>>>>
>>>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing 
>>>>> a
>>>>> core with it: This should not be any worse than SMT OFF, because even SMT 
>>>>> OFF
>>>>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
>>>>> core-scheduling adds some overhead with IPIs but such a huge drop of perf 
>>>>> is
>>>>> strange. Peter any thoughts on that?
>>>>>
>>>>> 2. Interface: To solve the performance problem, you are saying you want 
>>>>> uperf
>>>>> to share a core with ksoftirqd so that it is not forced into idle.  Why 
>>>>> not
>>>>> just keep uperf out of the cgroup?
>>>>
>>>> I guess this is unacceptable for who runs their apps in container and vm.
>>> IMHO,  just as Joel proposed, 
>>> 1. Consider all cookie-0 task as trusted.
>>> 2. All kernel threads and idle task would have a cookie 0 
>>> In that way, all tasks with cookies(including uperf in a cgroup) could run
>>> concurrently with kernel threads.
>>> That could be a good solution for the issue. :)
>>
>> From uperf point of review, it can trust cookie-0(I assume we still need
>> some modifications to change cookie-match to cookie-compatible to allow
>> ZERO and NONZERO run together).
>>
>> But from kernel thread point of review, it can NOT trust uperf, unless
>> we set uperf's cookie to 0.
> That’s right. :)
> Could we set the cookie of cgroup where uperf lies to 0?
> 
IMHO the disadvantage is that if there are two or more cgroups set cookie-0,
then the user applications in these cgroups could run concurrently on a core,
though all of them are set as trusted, we made a hole of user->user isolation.

Thanks,
-Aubrey

Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)

2020-08-13 Thread Li, Aubrey

On 2020/8/14 8:26, benbjiang(蒋彪) wrote:
> 
> 
>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey  wrote:
>>
>> On 2020/8/13 7:08, Joel Fernandes wrote:
>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>>>> Hi Joel,
>>>>
>>>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>>>> Hi Aubrey,
>>>>>
>>>>> Apologies for replying late as I was still looking into the details.
>>>>>
>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>>>> [...]
>>>>>> +/*
>>>>>> + * Core scheduling policy:
>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>>>> + * on the same core concurrently.
>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>>>  thread on the same core concurrently. 
>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>>>> + * with idle thread on the same core.
>>>>>> + */
>>>>>> +enum coresched_policy {
>>>>>> +   CORE_SCHED_DISABLED,
>>>>>> +   CORE_SCHED_COOKIE_MATCH,
>>>>>> +CORE_SCHED_COOKIE_TRUST,
>>>>>> +   CORE_SCHED_COOKIE_LONELY,
>>>>>> +};
>>>>>>
>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>>>> of performance regression. Not sure if this sounds attractive?
>>>>>
>>>>> Instead of this, I think it can be something simpler IMHO:
>>>>>
>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply 
>>>>> the
>>>>>   core-scheduling patchset, such tasks will share a core and sniff on each
>>>>>   other. So let us not pretend that such tasks are not trusted).
>>>>>
>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will 
>>>>> cover
>>>>>   ksoftirqd reported in your original issue).
>>>>>
>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). 
>>>>> Default
>>>>>   enable it. Setting this option would tag all tasks that are forked from 
>>>>> a
>>>>>   cookie-0 task with their own cookie. Later on, such tasks can be added 
>>>>> to
>>>>>   a group. This cover's PeterZ's ask about having 'default untrusted').
>>>>>   (Users like ChromeOS that don't want to userspace system processes to be
>>>>>   tagged can disable this option so such tasks will be cookie-0).
>>>>>
>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override 
>>>>> the
>>>>>   above behaviors.
>>>>
>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set 
>>>> uperf's
>>>> cookie to be cookie-0 via prctl?
>>>
>>> Yes, but let me try to understand better. There are 2 problems here I think:
>>>
>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
>>> core with it: This should not be any worse than SMT OFF, because even SMT 
>>> OFF
>>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
>>> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
>>> strange. Peter any thoughts on that?
>>>
>>> 2. Interface: To solve the performance problem, you are saying you want 
>>> uperf
>>> to share a core with ksoftirqd so that it is not forced into idle.  Why not
>>> just keep uperf out of the cgroup?
>>
>> I guess this is unacceptable for who runs their apps in container and vm.
> IMHO,  just as Joel proposed, 
> 1. Consider all cookie-0 task as trusted.
> 2. All kernel threads and idle task would have a cookie 0 
> In that way, all tasks with cookies(including uperf in a cgroup) could run
> concurrently with kernel threads.
> That could be a good solution for the issue. :)

>From uperf point of review, it can trust cookie-0(I assume we still need
some modifications to change cookie-match to cookie-compatible to allow
ZERO and NONZERO run together).

But from kernel thread point of review, it can NOT trust uperf, unless
we set uperf's cookie to 0.

Thanks,
-Aubrey

Re: [RFC PATCH 00/16] Core scheduling v6

2020-08-12 Thread Li, Aubrey

On 2020/8/13 7:08, Joel Fernandes wrote:
> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>> Hi Joel,
>>
>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>> Hi Aubrey,
>>>
>>> Apologies for replying late as I was still looking into the details.
>>>
>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>> [...]
>>>> +/*
>>>> + * Core scheduling policy:
>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>> + * on the same core concurrently.
>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>thread on the same core concurrently. 
>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>> + * with idle thread on the same core.
>>>> + */
>>>> +enum coresched_policy {
>>>> +   CORE_SCHED_DISABLED,
>>>> +   CORE_SCHED_COOKIE_MATCH,
>>>> +  CORE_SCHED_COOKIE_TRUST,
>>>> +   CORE_SCHED_COOKIE_LONELY,
>>>> +};
>>>>
>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>> of performance regression. Not sure if this sounds attractive?
>>>
>>> Instead of this, I think it can be something simpler IMHO:
>>>
>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>>core-scheduling patchset, such tasks will share a core and sniff on each
>>>other. So let us not pretend that such tasks are not trusted).
>>>
>>> 2. All kernel threads and idle task would have a cookie 0 (so that will 
>>> cover
>>>ksoftirqd reported in your original issue).
>>>
>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>>enable it. Setting this option would tag all tasks that are forked from a
>>>cookie-0 task with their own cookie. Later on, such tasks can be added to
>>>a group. This cover's PeterZ's ask about having 'default untrusted').
>>>(Users like ChromeOS that don't want to userspace system processes to be
>>>tagged can disable this option so such tasks will be cookie-0).
>>>
>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>>above behaviors.
>>
>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set 
>> uperf's
>> cookie to be cookie-0 via prctl?
> 
> Yes, but let me try to understand better. There are 2 problems here I think:
> 
> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
> core with it: This should not be any worse than SMT OFF, because even SMT OFF
> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
> strange. Peter any thoughts on that?
> 
> 2. Interface: To solve the performance problem, you are saying you want uperf
> to share a core with ksoftirqd so that it is not forced into idle.  Why not
> just keep uperf out of the cgroup?

I guess this is unacceptable for who runs their apps in container and vm.

Thanks,
-Aubrey

> Then it will have cookie 0 and be able to
> share core with kernel threads. About user-user isolation that you need, if
> you tag any "untrusted" threads by adding it to CGroup, then there will
> automatically isolated from uperf while allowing uperf to share CPU with
> kernel threads.
> 
> Please let me know your thoughts and thanks,
> 
>  - Joel
> 
>>
>> Thanks,
>> -Aubrey
>>>
>>> 5. Document everything clearly so the semantics are clear both to the
>>>developers of core scheduling and to system administrators.
>>>
>>> Note that, with the concept of "system trusted cookie", we can also do
>>> optimizations like:
>>> 1. Disable STIBP when switching into trusted tasks.
>>> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching 
>>> into
>>>trusted tasks.
>>>
>>> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
>>> other engineer requested I do something like #2 already.
>>>
>>> Once we get full-syscall isolation working, threads belonging to a process
>>> can also share a core so those can just share a core with the task-group
>>> leader.
>>>
>>>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>>>>
>>>> This is a good question, from the data we measured by uperf,
>>>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
>>>
>>> This is worrying for sure. :-(. We ought to debug/profile it more to see 
>>> what
>>> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
>>>
>>> Any other thoughts from others on this?
>>>
>>> thanks,
>>>
>>>  - Joel
>>>
>>>
>>>>> thanks,
>>>>>
>>>>>  - Joel
>>>>> PS: I am planning to write a patch behind a CONFIG option that tags
>>>>> all processes (default untrusted) so everything gets a cookie which
>>>>> some folks said was how they wanted (have a whitelist instead of
>>>>> blacklist).
>>>>>
>>>>
>>

Re: [RFC PATCH 00/16] Core scheduling v6

2020-08-11 Thread Li, Aubrey

Hi Joel,

On 2020/8/10 0:44, Joel Fernandes wrote:
> Hi Aubrey,
> 
> Apologies for replying late as I was still looking into the details.
> 
> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
> [...]
>> +/*
>> + * Core scheduling policy:
>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>> + * on the same core concurrently.
>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>  thread on the same core concurrently. 
>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>> + * with idle thread on the same core.
>> + */
>> +enum coresched_policy {
>> +   CORE_SCHED_DISABLED,
>> +   CORE_SCHED_COOKIE_MATCH,
>> +CORE_SCHED_COOKIE_TRUST,
>> +   CORE_SCHED_COOKIE_LONELY,
>> +};
>>
>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>> of performance regression. Not sure if this sounds attractive?
> 
> Instead of this, I think it can be something simpler IMHO:
> 
> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>core-scheduling patchset, such tasks will share a core and sniff on each
>other. So let us not pretend that such tasks are not trusted).
> 
> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>ksoftirqd reported in your original issue).
> 
> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>enable it. Setting this option would tag all tasks that are forked from a
>cookie-0 task with their own cookie. Later on, such tasks can be added to
>a group. This cover's PeterZ's ask about having 'default untrusted').
>(Users like ChromeOS that don't want to userspace system processes to be
>tagged can disable this option so such tasks will be cookie-0).
> 
> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>above behaviors.

How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
cookie to be cookie-0 via prctl?

Thanks,
-Aubrey
> 
> 5. Document everything clearly so the semantics are clear both to the
>developers of core scheduling and to system administrators.
> 
> Note that, with the concept of "system trusted cookie", we can also do
> optimizations like:
> 1. Disable STIBP when switching into trusted tasks.
> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
>trusted tasks.
> 
> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
> other engineer requested I do something like #2 already.
> 
> Once we get full-syscall isolation working, threads belonging to a process
> can also share a core so those can just share a core with the task-group
> leader.
> 
>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>>
>> This is a good question, from the data we measured by uperf,
>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
> 
> This is worrying for sure. :-(. We ought to debug/profile it more to see what
> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
> 
> Any other thoughts from others on this?
> 
> thanks,
> 
>  - Joel
> 
> 
>>> thanks,
>>>
>>>  - Joel
>>> PS: I am planning to write a patch behind a CONFIG option that tags
>>> all processes (default untrusted) so everything gets a cookie which
>>> some folks said was how they wanted (have a whitelist instead of
>>> blacklist).
>>>
>>

Re: [RFC PATCH 00/16] Core scheduling v6

2020-08-04 Thread Li, Aubrey

On 2020/8/4 0:53, Joel Fernandes wrote:
> Hi Aubrey,
> 
> On Mon, Aug 3, 2020 at 4:23 AM Li, Aubrey  wrote:
>>
>> On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
>>> Sixth iteration of the Core-Scheduling feature.
>>>
>>> Core scheduling is a feature that allows only trusted tasks to run
>>> concurrently on cpus sharing compute resources (eg: hyperthreads on a
>>> core). The goal is to mitigate the core-level side-channel attacks
>>> without requiring to disable SMT (which has a significant impact on
>>> performance in some situations). Core scheduling (as of v6) mitigates
>>> user-space to user-space attacks and user to kernel attack when one of
>>> the siblings enters the kernel via interrupts. It is still possible to
>>> have a task attack the sibling thread when it enters the kernel via
>>> syscalls.
>>>
>>> By default, the feature doesn't change any of the current scheduler
>>> behavior. The user decides which tasks can run simultaneously on the
>>> same core (for now by having them in the same tagged cgroup). When a
>>> tag is enabled in a cgroup and a task from that cgroup is running on a
>>> hardware thread, the scheduler ensures that only idle or trusted tasks
>>> run on the other sibling(s). Besides security concerns, this feature
>>> can also be beneficial for RT and performance applications where we
>>> want to control how tasks make use of SMT dynamically.
>>>
>>> This iteration is mostly a cleanup of v5 except for a major feature of
>>> pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
>>> introducing documentation and includes minor crash fixes.
>>>
>>> One major cleanup was removing the hotplug support and related code.
>>> The hotplug related crashes were not documented and the fixes piled up
>>> over time leading to complex code. We were not able to reproduce the
>>> crashes in the limited testing done. But if they are reroducable, we
>>> don't want to hide them. We should document them and design better
>>> fixes if any.
>>>
>>> In terms of performance, the results in this release are similar to
>>> v5. On a x86 system with N hardware threads:
>>> - if only N/2 hardware threads are busy, the performance is similar
>>>   between baseline, corescheduling and nosmt
>>> - if N hardware threads are busy with N different corescheduling
>>>   groups, the impact of corescheduling is similar to nosmt
>>> - if N hardware threads are busy and multiple active threads share the
>>>   same corescheduling cookie, they gain a performance improvement over
>>>   nosmt.
>>>   The specific performance impact depends on the workload, but for a
>>>   really busy database 12-vcpu VM (1 coresched tag) running on a 36
>>>   hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
>>>   their own coresched tag), the performance drops by 54% with
>>>   corescheduling and drops by 90% with nosmt.
>>>
>>
>> We found uperf(in cgroup) throughput drops by ~50% with corescheduling.
>>
>> The problem is, uperf triggered a lot of softirq and offloaded softirq
>> service to *ksoftirqd* thread.
>>
>> - default, ksoftirqd thread can run with uperf on the same core, we saw
>>   100% CPU utilization.
>> - coresched enabled, ksoftirqd's core cookie is different from uperf, so
>>   they can't run concurrently on the same core, we saw ~15% forced idle.
>>
>> I guess this kind of performance drop can be replicated by other similar
>> (a lot of softirq activities) workloads.
>>
>> Currently core scheduler picks cookie-match tasks for all SMT siblings, does
>> it make sense we add a policy to allow cookie-compatible task running 
>> together?
>> For example, if a task is trusted(set by admin), it can work with kernel 
>> thread.
>> The difference from corescheduling disabled is that we still have user to 
>> user
>> isolation.
> 
> In ChromeOS we are considering all cookie-0 tasks as trusted.
> Basically if you don't trust a task, then that is when you assign the
> task a tag. We do this for the sandboxed processes.

I have a proposal of this, by changing cpu.tag to cpu.coresched_policy,
something like the following:

+/*
+ * Core scheduling policy:
+ * - CORE_SCHED_DISABLED: core scheduling is disabled.
+ * - CORE_COOKIE_MATCH: tasks with same cookie can run
+ * on the same core concurrently.
+ * - CORE_COOKIE_TRUST: trusted task can run with kernel
thread on the same core concurren

1 2 3 4 5 6 7 8 9 >

1 - 100 of 814 matches

Mail list logo