[tip: sched/core] sched/fair: Reduce long-tail newly idle balance cost

2021-03-23 Thread tip-bot2 for Aubrey Li
The following commit has been merged into the sched/core branch of tip:

Commit-ID: acb4decc1e900468d51b33c5f1ee445278e716a7
Gitweb:
https://git.kernel.org/tip/acb4decc1e900468d51b33c5f1ee445278e716a7
Author:Aubrey Li 
AuthorDate:Wed, 24 Feb 2021 16:15:49 +08:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 23 Mar 2021 16:01:59 +01:00

sched/fair: Reduce long-tail newly idle balance cost

A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.

Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1 or even 0 if the running task becomes idle, this
causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
load_balance redo at the same sched_domain level.

In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.

This patch clears LBF_ALL_PINNED flag for this race condition, and hence
reduces the long-tail cost of newly idle balance.

Signed-off-by: Aubrey Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/1614154549-116078-1-git-send-email-aubrey...@intel.com
---
 kernel/sched/fair.c |  9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aaa0dfa..6d73bdb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7687,6 +7687,15 @@ static int detach_tasks(struct lb_env *env)
 
lockdep_assert_held(>src_rq->lock);
 
+   /*
+* Source run queue has been emptied by another CPU, clear
+* LBF_ALL_PINNED flag as we will not test any task.
+*/
+   if (env->src_rq->nr_running <= 1) {
+   env->flags &= ~LBF_ALL_PINNED;
+   return 0;
+   }
+
if (env->imbalance <= 0)
return 0;
 


[PATCH v10] sched/fair: select idle cpu from idle cpumask for task wakeup

2021-03-15 Thread Aubrey Li
From: Aubrey Li 

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v9->v10:
- Update scan cost only when the idle cpumask is scanned, i.e, the
  idle cpumask is not empty

v8->v9:
- rebase on top of tip/sched/core, no code change

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 47 --
 kernel/sched/idle.c|  5 +
 kernel/sched/sched.h   |  4 
 kernel/sched/topology.c|  3 ++-
 6 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778..905e382 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,8 +74,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb62..310bf9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4552,6 +4552,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, rq->idle_balance);
trigger_load_balance(rq);
 #endif
 }
@@ -8209,6 +8210,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb..24384b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6134,7 +6134,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
@@ -6173,7 +6178,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (smt)
set_idle_cores(this, false);
 
-   if (sched_feat(SIS_PROP) && !smt) {
+   if (sched_feat(SIS_PROP) && !smt && (cpu < nr_cpumask_bits)) {
time = cpu_clock(this) - time;
  

[PATCH v9 1/2] sched/fair: select idle cpu from idle cpumask for task wakeup

2021-03-09 Thread Aubrey Li
From: Aubrey Li 

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v8->v9:
- rebase on top of tip/sched/core, no functionality change

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 45 +-
 kernel/sched/idle.c|  5 +
 kernel/sched/sched.h   |  4 
 kernel/sched/topology.c|  3 ++-
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778..905e382 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,8 +74,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb62..310bf9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4552,6 +4552,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, rq->idle_balance);
trigger_load_balance(rq);
 #endif
 }
@@ -8209,6 +8210,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb..15d23d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6134,7 +6134,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
@@ -6838,6 +6843,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ *
+ * This function is called with interrupts disabled.
+ */
+void update_idle_cpumask(int cpu, bool idle)
+{
+   struct sched_domain *sd;
+   struct rq *rq = cpu_rq(cpu);
+   int idle_state;
+
+   /*
+ 

[PATCH v9 2/2] sched/fair: Remove SIS_PROP

2021-03-09 Thread Aubrey Li
From: Aubrey Li 

Scanning idle cpu from the idle cpumask avoid superfluous scans
of the LLC domain, as the first bit in the idle cpumask is the
target. Considering the selected target could become busy, the
idle check is reserved, but SIS_PROP feature becomes meaningless,
so remove avg_scan_cost computation as well.

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Tim Chen 
Cc: Jiang Biao 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h |  2 --
 kernel/sched/fair.c| 33 ++---
 kernel/sched/features.h|  5 -
 3 files changed, 2 insertions(+), 38 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 905e382..2a37596 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -113,8 +113,6 @@ struct sched_domain {
u64 max_newidle_lb_cost;
unsigned long next_decay_max_lb_cost;
 
-   u64 avg_scan_cost;  /* select_idle_sibling */
-
 #ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 15d23d2..6236822 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6117,18 +6117,15 @@ static inline int select_idle_core(struct task_struct 
*p, int core, struct cpuma
 #endif /* CONFIG_SCHED_SMT */
 
 /*
- * Scan the LLC domain for idle CPUs; this is dynamically regulated by
- * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
- * average idle time for this rq (as found in rq->avg_idle).
+ * Scan idle cpumask in the LLC domain for idle CPUs
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-   int i, cpu, idle_cpu = -1, nr = INT_MAX;
+   int i, cpu, idle_cpu = -1;
bool smt = test_idle_cores(target, false);
int this = smp_processor_id();
struct sched_domain *this_sd;
-   u64 time;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6141,25 +6138,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 */
cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
-   if (sched_feat(SIS_PROP) && !smt) {
-   u64 avg_cost, avg_idle, span_avg;
-
-   /*
-* Due to large variance we need a large fuzz factor;
-* hackbench in particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
-
-   span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
-   nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
-
-   time = cpu_clock(this);
-   }
-
for_each_cpu_wrap(cpu, cpus, target) {
if (smt) {
i = select_idle_core(p, cpu, cpus, _cpu);
@@ -6167,8 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
return i;
 
} else {
-   if (!--nr)
-   return -1;
idle_cpu = __select_idle_cpu(cpu);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;
@@ -6178,11 +6154,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (smt)
set_idle_cores(this, false);
 
-   if (sched_feat(SIS_PROP) && !smt) {
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
-   }
-
return idle_cpu;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b15..267aa774 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -53,11 +53,6 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
 SCHED_FEAT(TTWU_QUEUE, true)
 
 /*
- * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
- */
-SCHED_FEAT(SIS_PROP, true)
-
-/*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
  * annotations are not complete.
-- 
2.7.4



[PATCH v2] sched/fair: reduce long-tail newly idle balance cost

2021-02-24 Thread Aubrey Li
A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.

Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1 or even 0 if the running task becomes idle, this
causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
load_balance redo at the same sched_domain level.

In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.

This patch clears LBF_ALL_PINNED flag for this race condition, and hence
reduces the long-tail cost of newly idle balance.

Cc: Vincent Guittot 
Cc: Mel Gorman 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Srinivas Pandruvada 
Cc: Rafael J. Wysocki 
Signed-off-by: Aubrey Li 
---
 kernel/sched/fair.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce2..5c67804 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7675,6 +7675,15 @@ static int detach_tasks(struct lb_env *env)
 
lockdep_assert_held(>src_rq->lock);
 
+   /*
+* Source run queue has been emptied by another CPU, clear
+* LBF_ALL_PINNED flag as we will not test any task.
+*/
+   if (env->src_rq->nr_running <= 1) {
+   env->flags &= ~LBF_ALL_PINNED;
+   return 0;
+   }
+
if (env->imbalance <= 0)
return 0;
 
-- 
2.7.4



[RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-01-24 Thread Aubrey Li
A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.

Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
flag set, and triggers load_balance redo at the same sched_domain level.

In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.

This patch introduces a variable(sched_nr_lb_redo) to limit load balance
redo times, combined with sysctl_sched_nr_migrate, the max load balance
cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
192 logical CPUs.

Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Srinivas Pandruvada 
Cc: Rafael J. Wysocki 
Signed-off-by: Aubrey Li 
---
 kernel/sched/fair.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba..b59f371 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7407,6 +7407,8 @@ struct lb_env {
unsigned intloop;
unsigned intloop_break;
unsigned intloop_max;
+   unsigned intredo_cnt;
+   unsigned intredo_max;
 
enum fbq_type   fbq_type;
enum migration_type migration_type;
@@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
return group_balance_cpu(sg) == env->dst_cpu;
 }
 
+static const unsigned int sched_nr_lb_redo = 1;
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.dst_grpmask= sched_group_span(sd->groups),
.idle   = idle,
.loop_break = sched_nr_migrate_break,
+   .redo_max   = sched_nr_lb_redo,
.cpus   = cpus,
.fbq_type   = all,
.tasks  = LIST_HEAD_INIT(env.tasks),
@@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 * destination group that is receiving any migrated
 * load.
 */
-   if (!cpumask_subset(cpus, env.dst_grpmask)) {
+   if (!cpumask_subset(cpus, env.dst_grpmask) &&
+   ++env.redo_cnt < env.redo_max) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
goto redo;
-- 
2.7.4



[PATCH] cpuset: fix typos in comments

2021-01-12 Thread Aubrey Li
Change hierachy to hierarchy and congifured to configured, no functionality
changed.

Signed-off-by: Aubrey Li 
---
 kernel/cgroup/cpuset.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 57b5b5d..15f4300 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -98,7 +98,7 @@ struct cpuset {
 * and if it ends up empty, it will inherit the parent's mask.
 *
 *
-* On legacy hierachy:
+* On legacy hierarchy:
 *
 * The user-configured masks are always the same with effective masks.
 */
@@ -1286,10 +1286,10 @@ static int update_parent_subparts_cpumask(struct cpuset 
*cpuset, int cmd,
  * @cs:  the cpuset to consider
  * @tmp: temp variables for calculating effective_cpus & partition setup
  *
- * When congifured cpumask is changed, the effective cpumasks of this cpuset
+ * When configured cpumask is changed, the effective cpumasks of this cpuset
  * and all its descendants need to be updated.
  *
- * On legacy hierachy, effective_cpus will be the same with cpu_allowed.
+ * On legacy hierarchy, effective_cpus will be the same with cpu_allowed.
  *
  * Called with cpuset_mutex held
  */
-- 
2.7.4



[RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-09 Thread Aubrey Li
Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql and
kbuild have been tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no regression
found.

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 ++
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 45 +-
 kernel/sched/idle.c|  5 
 kernel/sched/sched.h   |  4 +++
 kernel/sched/topology.c|  3 ++-
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..b136e2440ea4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4011,6 +4011,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, rq->idle_balance);
trigger_load_balance(rq);
 #endif
 }
@@ -7186,6 +7187,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..25f36ecfee54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ *
+ * This function is called with interrupts disabled.
+ */
+void update_idle_cpumask(int cpu, bool idle)
+{
+   st

[RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-08 Thread Aubrey Li
Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql
and kbuild were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no regression
found.

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP.

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 51 +-
 kernel/sched/idle.c|  5 
 kernel/sched/sched.h   |  4 +++
 kernel/sched/topology.c|  3 +-
 6 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..c4c51ff3402a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4011,6 +4011,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+   update_idle_cpumask(cpu, false);
trigger_load_balance(rq);
 #endif
 }
@@ -7186,6 +7187,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+   rq->last_idle_state = 1;
 
INIT_LIST_HEAD(>cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..7306f8886120 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6806,6 +6811,50 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(int cpu, bool set_idle)
+{
+   struct sched_domain *sd;
+   struct rq *rq = cpu_rq(cpu);
+   int idle_state;
+
+   /*
+* If called from scheduler tick, only update
+* idle cpumask if the CPU is busy, as id

[RFC PATCH v6] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-07 Thread Aubrey Li
Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql
and kbuild were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no significant
data change found.

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 52 +-
 kernel/sched/idle.c|  5 
 kernel/sched/sched.h   |  2 ++
 kernel/sched/topology.c|  3 +-
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..b8af602dea79 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3999,6 +3999,7 @@ void scheduler_tick(void)
rq_lock(rq, );
 
update_rq_clock(rq);
+   update_idle_cpumask(rq, false);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
@@ -7197,6 +7198,7 @@ void __init sched_init(void)
rq_csd_init(rq, >nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
+   rq->last_idle_state = 1;
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..1b5c7ed08544 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6808,6 +6813,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq, bool set_idle)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+   int idle_state;
+
+   /*
+* If called from scheduler tick, only update
+* idle cpumask if the CPU is busy, as id

[RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-19 Thread Aubrey Li
Add idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, if the idle driver indicates to stop tick, this CPU
is set in the idle cpumask to be a wakeup target. And if the CPU
is not in idle, the CPU is cleared in idle cpumask during scheduler
tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
schbench have no notable change, uperf has:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   0.831.233.27
  144   1   1.031.672.67
  192   1   0.691.813.59
  240   1   2.841.512.67

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 52 +-
 kernel/sched/idle.c|  8 --
 kernel/sched/sched.h   |  2 ++
 kernel/sched/topology.c|  3 +-
 6 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b1e0da56abca..c86ae0495163 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3994,6 +3994,7 @@ void scheduler_tick(void)
rq_lock(rq, );
 
update_rq_clock(rq);
+   update_idle_cpumask(rq, false);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
@@ -7192,6 +7193,7 @@ void __init sched_init(void)
rq_csd_init(rq, >nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
+   rq->last_idle_state = 1;
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48a6d442b444..d67fba5e406b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6807,6 +6812,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update

[RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-17 Thread Aubrey Li
From: Aubrey Li 

Add idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, if the idle driver indicates to stop tick, this CPU
is set in the idle cpumask to be a wakeup target. And if the CPU
is not in idle, the CPU is cleared in idle cpumask during scheduler
tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
schbench have no notable change, uperf has:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   0.831.233.27
  144   1   1.031.672.67
  192   1   0.691.813.59
  240   1   2.841.512.67

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/core.c|  2 ++
 kernel/sched/fair.c| 52 +-
 kernel/sched/idle.c|  7 +++--
 kernel/sched/sched.h   |  2 ++
 kernel/sched/topology.c|  3 +-
 6 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b1e0da56abca..c86ae0495163 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3994,6 +3994,7 @@ void scheduler_tick(void)
rq_lock(rq, );
 
update_rq_clock(rq);
+   update_idle_cpumask(rq, false);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
@@ -7192,6 +7193,7 @@ void __init sched_init(void)
rq_csd_init(rq, >nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
+   rq->last_idle_state = 1;
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48a6d442b444..d67fba5e406b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6807,6 +6812,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq, bool set_idle)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+   int idle_state;
+
+   /*
+* If called from scheduler tick, only update
+* idle cpumask if the CPU is busy, as idle
+* cpumask is also updated on idle entry.
+*
+*/
+   if (!set_idle && idle_cpu(cpu))
+   return;
+   /*
+* Also set SCHED_IDLE cpu in idle cpumask to
+* allow SCHED_IDLE cpu as a wakeup target
+*/
+   idle_state = set_idle || sched_idle_cpu(cpu);
+   /*
+* No need to update idle cpumask if the state
+* does not change.
+*/
+   if (rq->last_idle_state == idle_state)
+   return;
+
+   rcu_read_lock();
+  

[PATCH v1] coresched/proc: add forceidle report with coresched enabled

2020-10-29 Thread Aubrey Li
When a CPU is running a task with coresched enabled, its sibling will
be forced idle if the sibling does not have a trusted task to run. It
is useful to report forceidle to understand the performance of different
cookies of tasks throughout the system.

forceidle is added at the last column of /proc/stat:

  $ cat /proc/stat
  cpu  102034 0 11992 8347016 1046 0 11 0 0 0 991
  cpu0 59 0 212 80364 59 0 0 0 0 0 0
  cpu1 72057 0 89 9102 0 0 0 0 0 0 90

So forceidle% can be computed by any user space tools, for example:

  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 24.75   0.000.00%   0.99%   74.26%
  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 25.74   0.000.00%   0.99%   73.27%
  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 24.75   0.000.00%   0.99%   74.26%
  CPU   user%   system% iowait% forceidle%  idle%
  cpu53 25.24   0.000.00%   3.88%   70.87%

Signed-off-by: Aubrey Li 
---
 fs/proc/stat.c  | 48 +
 include/linux/kernel_stat.h |  1 +
 include/linux/tick.h|  2 ++
 kernel/time/tick-sched.c| 48 +
 kernel/time/tick-sched.h|  3 +++
 5 files changed, 102 insertions(+)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 46b3293015fe..b27ccac7b5a4 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -28,7 +28,11 @@ static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
u64 idle;
 
idle = kcs->cpustat[CPUTIME_IDLE];
+#ifdef CONFIG_SCHED_CORE
+   if (cpu_online(cpu) && !nr_iowait_cpu(cpu) && 
!cpu_rq(cpu)->core->core_forceidle)
+#else
if (cpu_online(cpu) && !nr_iowait_cpu(cpu))
+#endif
idle += arch_idle_time(cpu);
return idle;
 }
@@ -43,6 +47,17 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int 
cpu)
return iowait;
 }
 
+#ifdef CONFIG_SCHED_CORE
+static u64 get_forceidle_time(struct kernel_cpustat *kcs, int cpu)
+{
+   u64 forceidle;
+
+   forceidle = kcs->cpustat[CPUTIME_FORCEIDLE];
+   if (cpu_online(cpu) && cpu_rq(cpu)->core->core_forceidle)
+   forceidle += arch_idle_time(cpu);
+   return forceidle;
+}
+#endif
 #else
 
 static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
@@ -77,6 +92,21 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int 
cpu)
return iowait;
 }
 
+static u64 get_forceidle_time(struct kernel_cpustat *kcs, int cpu)
+{
+   u64 forceidle, forceidle_usecs = -1ULL;
+
+   if (cpu_online(cpu))
+   forceidle_usecs = get_cpu_forceidle_time_us(cpu, NULL);
+
+   if (forceidle_usecs == -1ULL)
+   /* !NO_HZ or cpu offline so we can rely on cpustat.forceidle */
+   forceidle = kcs->cpustat[CPUTIME_FORCEIDLE];
+   else
+   forceidle = forceidle_usecs * NSEC_PER_USEC;
+
+   return forceidle;
+}
 #endif
 
 static void show_irq_gap(struct seq_file *p, unsigned int gap)
@@ -111,12 +141,18 @@ static int show_stat(struct seq_file *p, void *v)
u64 guest, guest_nice;
u64 sum = 0;
u64 sum_softirq = 0;
+#ifdef CONFIG_SCHED_CORE
+   u64 forceidle;
+#endif
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
struct timespec64 boottime;
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
guest = guest_nice = 0;
+#ifdef CONFIG_SCHED_CORE
+   forceidle = 0;
+#endif
getboottime64();
 
for_each_possible_cpu(i) {
@@ -130,6 +166,9 @@ static int show_stat(struct seq_file *p, void *v)
system  += cpustat[CPUTIME_SYSTEM];
idle+= get_idle_time(, i);
iowait  += get_iowait_time(, i);
+#ifdef CONFIG_SCHED_CORE
+   forceidle   += get_forceidle_time(, i);
+#endif
irq += cpustat[CPUTIME_IRQ];
softirq += cpustat[CPUTIME_SOFTIRQ];
steal   += cpustat[CPUTIME_STEAL];
@@ -157,6 +196,9 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
+#ifdef CONFIG_SCHED_CORE
+   seq_put_decimal_ull(p, " ", nsec_to_clock_t(forceidle));
+#endif
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -171,6 +213,9 @@ static int show_stat(struct seq_file *p, void *v)
system  = cpustat[CPUTIME_SYSTEM];
idle= get_idle_time(, i);
iowait  = get_iowait_time(, i);
+#ifdef CONFIG_SCHED_CORE
+   forceidle   = get_forceidle_time(, i);
+#endif
irq = cpustat[CPUTIME_IRQ];
softir

[RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-10-21 Thread Aubrey Li
From: Aubrey Li 

Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 ++
 kernel/sched/fair.c| 45 +-
 kernel/sched/idle.c|  1 +
 kernel/sched/sched.h   |  1 +
 kernel/sched/topology.c|  3 ++-
 5 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..088d1995594f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6023,6 +6023,38 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
 }
 
+static DEFINE_PER_CPU(bool, cpu_idle_state);
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq, bool idle_state)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+
+   /*
+* No need to update idle cpumask if the state
+* does not change.
+*/
+   if (per_cpu(cpu_idle_state, cpu) == idle_state)
+   return;
+
+   per_cpu(cpu_idle_state, cpu) = idle_state;
+
+   rcu_read_lock();
+
+   sd = rcu_dereference(per_cpu(sd_llc, cpu));
+   if (!sd || !sd->shared)
+   goto unlock;
+   if (idle_state)
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+   else
+   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
+unlock:
+   rcu_read_unlock();
+}
+
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
@@ -6136,7 +6168,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10070,6 +10107,12 @@ static void nohz_balancer_kick(struct rq *rq)
if (unlikely(rq->idle_balance))
return;
 
+   /* The CPU is not in idle, update idle cpumask */
+   if (unlikely(sched_idle_cpu(cpu))) {
+   /* Allow SCHED_IDLE cpu as a wakeup target */
+   update_idle_cpumask(rq, true);
+   } else
+   update_idle_cpumask(rq, false);
/*
 * We may be recently in ticked or tickless idle mode. At the first
 * busy tick after returning from idle, we will update the busy stats.
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1ae95b9150d3..ce1f929d7fbb 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct 
task_struct *prev)
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool 
firs

[RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-15 Thread Aubrey Li
Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

The following benchmarks were tested on a x86 4 socket system with
24 cores per socket and 2 hyperthreads per core, total 192 CPUs:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   1.240.982.76
  144   1   1.131.354.01
  192   1   0.581.673.25
  240   1   2.491.683.55

hackbench: process mode, 10 loops, 40 file descriptors per group

  group baseline-avg%stdpatch-avg   %std
  2(80) 1   12.05   0.979.88
  3(120)1   12.48   0.9511.62
  4(160)1   13.83   0.9713.22
  5(200)1   2.761.012.94

schbench: 99th percentile latency, 16 workers per message thread

  mthread   baseline-avg%stdpatch-avg   %std
  6(96) 1   1.240.993   1.73
  9(144)1   0.380.998   0.39
  12(192)   1   1.580.995   1.64
  15(240)   1   51.71   0.606   37.41

sysbench mysql throughput: read/write, table size = 10,000,000

  threadbaseline-avg%stdpatch-avg   %std
  961   1.771.015   1.71
  144   1   3.390.998   4.05
  192   1   2.881.002   2.81
  240   1   2.071.011   2.09

kbuild: kexec reboot every time

  baseline-avg  patch-avg
  1 1

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Qais Yousef 
Cc: Valentin Schneider 
Cc: Jiang Biao 
Cc: Tim Chen 
Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/fair.c|  9 -
 kernel/sched/topology.c|  3 ++-
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..cfe78fcf69da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* sched_domain_shared is set only at shared cache level,
+* this works only because select_idle_cpu is called with
+* sd_llc.
+*/
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10187,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;
 
atomic_inc(>shared->nr_busy_cpus);
+   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
@@ -10212,6 +10218,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;
 
atomic_dec(>shared->nr_busy_cpus);
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..f14a6ef4de57 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_

[RFC PATCH v1 0/1] select idle cpu from idle cpumask in sched domain

2020-09-11 Thread Aubrey Li
I'm writting to see if it makes sense to track idle cpus in a shared cpumask
in sched domain, then a task wakes up it can select idle cpu from this cpumask
instead of scanning all the cpus in the last level cache domain, especially
when the system is heavily loaded, the scanning cost could be significantly
reduced. The price is that the atomic cpumask ops are added to the idle entry
and exit paths.

I tested the following benchmarks on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads   baseline-avg%stdpatch-avg   %std
  961   1.240.982.76
  144   1   1.131.354.01
  192   1   0.581.673.25
  240   1   2.491.683.55

hackbench: process mode, 10 loops, 40 file descriptors per group

  group baseline-avg%stdpatch-avg   %std
  2(80) 1   12.05   0.979.88
  3(120)1   12.48   0.9511.62
  4(160)1   13.83   0.9713.22
  5(200)1   2.761.012.94 

schbench: 99th percentile latency, 16 workers per message thread

  mthread   baseline-avg%stdpatch-avg   %std
  6(96) 1   1.240.993   1.73
  9(144)1   0.380.998   0.39
  12(192)   1   1.580.995   1.64
  15(240)   1   51.71   0.606   37.41

sysbench mysql throughput: read/write, table size = 10,000,000

  threadbaseline-avg%stdpatch-avg   %std
  961   1.771.015   1.71
  144   1   3.390.998   4.05
  192   1   2.881.002   2.81
  240   1   2.071.011   2.09

kbuild: kexec reboot every time

  baseline-avg  patch-avg
  1 1

Any suggestions are highly appreciated!

Thanks,
-Aubrey

Aubrey Li (1):
  sched/fair: select idle cpu from idle cpumask in sched domain

 include/linux/sched/topology.h | 13 +
 kernel/sched/fair.c|  4 +++-
 kernel/sched/topology.c|  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

-- 
2.25.1



[RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-11 Thread Aubrey Li
Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Signed-off-by: Aubrey Li 
---
 include/linux/sched/topology.h | 13 +
 kernel/sched/fair.c|  4 +++-
 kernel/sched/topology.c|  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   /*
+* Span of all idle CPUs in this domain.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long   idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+   return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent;  /* top domain must be null 
terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..3b6f8a3589be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;
 
atomic_inc(>shared->nr_busy_cpus);
+   cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
@@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;
 
atomic_dec(>shared->nr_busy_cpus);
+   cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
 unlock:
rcu_read_unlock();
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..92d0aeef86bf 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
*per_cpu_ptr(sdd->sd, j) = sd;
 
-   sds = kzalloc_node(sizeof(struct sched_domain_shared),
+   sds = kzalloc_node(sizeof(struct sched_domain_shared) + 
cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
-- 
2.25.1



Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

2020-07-23 Thread Aubrey Li
On Thu, Jul 23, 2020 at 4:28 PM benbjiang(蒋彪)  wrote:
>
> Hi,
>
> > On Jul 23, 2020, at 4:06 PM, Li, Aubrey  wrote:
> >
> > On 2020/7/23 15:47, benbjiang(蒋彪) wrote:
> >> Hi,
> >>
> >>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey  wrote:
> >>>
> >>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
> >>>> Hi,
> >>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey  
> >>>>> wrote:
> >>>>>
> >>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey  
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey  
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
> >>>>>>>>>> Hi, Aubrey,
> >>>>>>>>>>
> >>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> From: Aubrey Li 
> >>>>>>>>>>>
> >>>>>>>>>>> - Don't migrate if there is a cookie mismatch
> >>>>>>>>>>> Load balance tries to move task from busiest CPU to the
> >>>>>>>>>>> destination CPU. When core scheduling is enabled, if the
> >>>>>>>>>>> task's cookie does not match with the destination CPU's
> >>>>>>>>>>> core cookie, this task will be skipped by this CPU. This
> >>>>>>>>>>> mitigates the forced idle time on the destination CPU.
> >>>>>>>>>>>
> >>>>>>>>>>> - Select cookie matched idle CPU
> >>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
> >>>>>>>>>>> idle CPU instead of the first idle CPU.
> >>>>>>>>>>>
> >>>>>>>>>>> - Find cookie matched idlest CPU
> >>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
> >>>>>>>>>>> cookie matches with task's cookie
> >>>>>>>>>>>
> >>>>>>>>>>> - Don't migrate task if cookie not match
> >>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
> >>>>>>>>>>> core cookie does not match with task's cookie
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Aubrey Li 
> >>>>>>>>>>> Signed-off-by: Tim Chen 
> >>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai 
> >>>>>>>>>>> ---
> >>>>>>>>>>> kernel/sched/fair.c  | 64 
> >>>>>>>>>>> 
> >>>>>>>>>>> kernel/sched/sched.h | 29 
> >>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>>>>>>>> index d16939766361..33dc4bf01817 100644
> >>>>>>>>>>> --- a/kernel/sched/fair.c
> >>>>>>>>>>> +++ b/kernel/sched/fair.c
> >>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct 
> >>>>>>>>>>> task_numa_env *env,
> >>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
> >>>>>>>>>>> continue;
> >>>>>>>>>>>
> >>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
> >>>>>>>>>>> +   /*
> >>>>>>

Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

2020-07-10 Thread Aubrey Li
On Fri, Jul 10, 2020 at 9:36 PM Vineeth Remanan Pillai
 wrote:
>
> Hi Aubrey,
>
> On Fri, Jul 10, 2020 at 8:19 AM Li, Aubrey  wrote:
> >
> > Hi Joel/Vineeth,
> > [...]
> > The problem is gone when we reverted this patch. We are running multiple
> > uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> > This is 100% reproducible on our side.
> >
> > Just wonder if anything already known before we dig into it.
> >
> Thanks for reporting this. We haven't seen any lockups like this
> in our testing yet.

This is replicable on a bare metal machine. We tried to reproduce
on a 8-cpus KVM vm but failed.

> Could you please add more information on how to reproduce this?
> Was it a simple uperf run without any options or was it running any
> specific kind of network test?

I put our scripts at here:
https://github.com/aubreyli/uperf

>
> We shall also try to reproduce this and investigate.

I'll try to see if I can narrow down the test case and grab some logs
next week.

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-25 Thread Aubrey Li
On Thu, Sep 26, 2019 at 1:24 AM Tim Chen  wrote:
>
> On 9/24/19 7:40 PM, Aubrey Li wrote:
> > On Sat, Sep 7, 2019 at 2:30 AM Tim Chen  wrote:
> >> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
> >> +   int src_sibling, int dst_sibling,
> >> +   struct task_group *tg, u64 task_load)
> >> +{
> >> +   struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
> >> +   s64 excess, deficit, old_mismatch, new_mismatch;
> >> +
> >> +   if (src_cpu == dst_cpu)
> >> +   return -1;
> >> +
> >> +   /* XXX SMT4 will require additional logic */
> >> +
> >> +   se = tg->se[src_cpu];
> >> +   se_sibling = tg->se[src_sibling];
> >> +
> >> +   excess = se->avg.load_avg - se_sibling->avg.load_avg;
> >> +   if (src_sibling == dst_cpu) {
> >> +   old_mismatch = abs(excess);
> >> +   new_mismatch = abs(excess - 2*task_load);
> >> +   return old_mismatch - new_mismatch;
> >> +   }
> >> +
> >> +   dst_se = tg->se[dst_cpu];
> >> +   dst_se_sibling = tg->se[dst_sibling];
> >> +   deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
> >> +
> >> +   old_mismatch = abs(excess) + abs(deficit);
> >> +   new_mismatch = abs(excess - (s64) task_load) +
> >> +  abs(deficit + (s64) task_load);
> >
> > If I understood correctly, these formulas made an assumption that the task
> > being moved to the destination is matched the destination's core cookie.
>
> That's not the case.  We do not need to match the destination's core cookie,

I actually meant destination core's core cookie.

> as that may change after context switches. It needs to reduce the load 
> mismatch
> with the destination CPU's sibling for that cgroup.

So the new_mismatch is not always true, especially when there are more
cgroups and
more core cookies on the system.

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-24 Thread Aubrey Li
On Sat, Sep 7, 2019 at 2:30 AM Tim Chen  wrote:
> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
> +   int src_sibling, int dst_sibling,
> +   struct task_group *tg, u64 task_load)
> +{
> +   struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
> +   s64 excess, deficit, old_mismatch, new_mismatch;
> +
> +   if (src_cpu == dst_cpu)
> +   return -1;
> +
> +   /* XXX SMT4 will require additional logic */
> +
> +   se = tg->se[src_cpu];
> +   se_sibling = tg->se[src_sibling];
> +
> +   excess = se->avg.load_avg - se_sibling->avg.load_avg;
> +   if (src_sibling == dst_cpu) {
> +   old_mismatch = abs(excess);
> +   new_mismatch = abs(excess - 2*task_load);
> +   return old_mismatch - new_mismatch;
> +   }
> +
> +   dst_se = tg->se[dst_cpu];
> +   dst_se_sibling = tg->se[dst_sibling];
> +   deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
> +
> +   old_mismatch = abs(excess) + abs(deficit);
> +   new_mismatch = abs(excess - (s64) task_load) +
> +  abs(deficit + (s64) task_load);

If I understood correctly, these formulas made an assumption that the task
being moved to the destination is matched the destination's core cookie. so if
the task is not matched with dst's core cookie and still have to stay
in the runqueue
then the formula becomes not correct.

>  /**
>   * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -8345,7 +8492,8 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
> else
> load = source_load(i, load_idx);
>
> -   sgs->group_load += load;

Why is this load update line removed?

> +   core_sched_imbalance_scan(sgs, i, env->dst_cpu);
> +
> sgs->group_util += cpu_util(i);
> sgs->sum_nr_running += rq->cfs.h_nr_running;
>

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-18 Thread Aubrey Li
On Thu, Sep 19, 2019 at 4:41 AM Tim Chen  wrote:
>
> On 9/17/19 6:33 PM, Aubrey Li wrote:
> > On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu  
> > wrote:
>
> >>
> >> And I have pushed Tim's branch to:
> >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >>
> >> Mine:
> >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
>
>
> Aubrey,
>
> Thanks for testing with your set up.
>
> I think the test that's of interest is to see my load balancing added on top
> of Aaron's fairness patch, instead of using my previous version of
> forced idle approach in coresched-v3-v5.1.5-test-tim branch.
>

I'm trying to figure out a way to solve fairness only(not include task
placement),
So @Vineeth - if everyone is okay with Aaron's fairness patch, maybe
we should have a v4?

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-17 Thread Aubrey Li
On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu  wrote:
>
> On Fri, Sep 13, 2019 at 07:12:52AM +0800, Aubrey Li wrote:
> > On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu  wrote:
> > >
> > > On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > > > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > > > I think Julien's result show that my patches did not do as well as
> > > > your patches for fairness. Aubrey did some other testing with the same
> > > > conclusion.  So I think keeping the forced idle time balanced is not
> > > > enough for maintaining fairness.
> > >
> > > Well, I have done following tests:
> > > 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> > > 2 start two tagged will-it-scale/page_fault1, see how each performs;
> > > 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> > >
> > > They all show your patchset performs equally well...And consider what
> > > the patch does, I think they are really doing the same thing in
> > > different ways.
> >
> > It looks like we are not on the same page, if you don't mind, can both of
> > you rebase your patchset onto v5.3-rc8 and provide a public branch so I
> > can fetch and test it at least by my benchmark?
>
> I'm using the following branch as base which is v5.1.5 based:
> https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-test
>
> And I have pushed Tim's branch to:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
>
> Mine:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
>
> The two branches both have two patches I have sent previouslly:
> https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/
> Although it has some potential performance loss as pointed out by
> Vineeth, I haven't got time to rework it yet.

In terms of these two branches, we tested two cases:

1) 32 AVX threads and 32 mysql threads on one core(2 HT)
2) 192 AVX threads and 192 mysql threads on 96 cores(192 HTs)

For case 1), we saw two branches is on par

Branch: coresched-v3-v5.1.5-test-core_vruntime
-Avg throughput:: 1865.62 (std: 20.6%)
-Avg latency: 26.43 (std: 8.3%)

Branch: coresched-v3-v5.1.5-test-tim
- Avg throughput: 1804.88 (std: 20.1%)
- Avg latency: 29.78 (std: 11.8%)

For case 2), we saw core vruntime performs better than counting forced
idle time

Branch: coresched-v3-v5.1.5-test-core_vruntime
- Avg throughput: 5528.56 (std: 44.2%)
- Avg latency: 165.99 (std: 45.2%)

Branch: coresched-v3-v5.1.5-test-tim
-Avg throughput:: 3842.33 (std: 35.1%)
-Avg latency: 306.99 (std: 72.9%)

As Aaron pointed out, vruntime is with se's weight, which could be a reason
for the difference.

So should we go with core vruntime approach?
Or Tim - do you want to improve forced idle time approach?

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-12 Thread Aubrey Li
On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu  wrote:
>
> On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > I think Julien's result show that my patches did not do as well as
> > your patches for fairness. Aubrey did some other testing with the same
> > conclusion.  So I think keeping the forced idle time balanced is not
> > enough for maintaining fairness.
>
> Well, I have done following tests:
> 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> 2 start two tagged will-it-scale/page_fault1, see how each performs;
> 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
>
> They all show your patchset performs equally well...And consider what
> the patch does, I think they are really doing the same thing in
> different ways.

It looks like we are not on the same page, if you don't mind, can both of
you rebase your patchset onto v5.3-rc8 and provide a public branch so I
can fetch and test it at least by my benchmark?

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-27 Thread Aubrey Li
On Wed, Aug 28, 2019 at 5:14 AM Matthew Garrett  wrote:
>
> Apple have provided a sysctl that allows applications to indicate that
> specific threads should make use of core isolation while allowing
> the rest of the system to make use of SMT, and browsers (Safari, Firefox
> and Chrome, at least) are now making use of this. Trying to do something
> similar using cgroups seems a bit awkward. Would something like this be
> reasonable? Having spoken to the Chrome team, I believe that the
> semantics we want are:
>
> 1) A thread to be able to indicate that it should not run on the same
> core as anything not in posession of the same cookie
> 2) Descendents of that thread to (by default) have the same cookie
> 3) No other thread be able to obtain the same cookie
> 4) Threads not be able to rejoin the global group (ie, threads can
> segregate themselves from their parent and peers, but can never rejoin
> that group once segregated)
>
> but don't know if that's what everyone else would want.
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 094bb03b9cc2..5d411246d4d5 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -229,4 +229,5 @@ struct prctl_mm_map {
>  # define PR_PAC_APDBKEY(1UL << 3)
>  # define PR_PAC_APGAKEY(1UL << 4)
>
> +#define PR_CORE_ISOLATE55
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 12df0e5434b8..a054cfcca511 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
> return -EINVAL;
> error = PAC_RESET_KEYS(me, arg2);
> break;
> +   case PR_CORE_ISOLATE:
> +#ifdef CONFIG_SCHED_CORE
> +   current->core_cookie = (unsigned long)current;

Because AVX512 instructions could pull down the core frequency,
we also want to give a magic cookie number to all AVX512-using
tasks on the system, so they won't affect the performance/latency
of any other tasks.

This could be done by putting all AVX512 tasks into a cgroup, or
by AVX512 detection the following patch introduced.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=2f7726f955572e587d5f50fbe9b2deed5334bd90

Thanks,
-Aubrey


Re: [PATCH] x86/apic: Handle missing global clockevent gracefully

2019-08-12 Thread Aubrey Li
On Mon, Aug 12, 2019 at 8:25 PM Thomas Gleixner  wrote:
>
> On Mon, 12 Aug 2019, Li, Aubrey wrote:
> > On 2019/8/9 20:54, Thomas Gleixner wrote:
> > > +   local_irq_disable();
> > > /*
> > >  * Setup the APIC counter to maximum. There is no way the lapic
> > >  * can underflow in the 100ms detection time frame
> > >  */
> > > __setup_APIC_LVTT(0x, 0, 0);
> > >
> > > -   /* Let the interrupts run */
> > > -   local_irq_enable();
> > > +   /*
> > > +* Methods to terminate the calibration loop:
> > > +*  1) Global clockevent if available (jiffies)
> > > +*  2) TSC if available and frequency is known
> > > +*/
> > > +   jif_start = READ_ONCE(jiffies);
> > > +
> > > +   if (tsc_khz) {
> > > +   tsc_start = rdtsc();
> > > +   tsc_perj = div_u64((u64)tsc_khz * 1000, HZ);
> > > +   }
> > > +
> > > +   while (lapic_cal_loops <= LAPIC_CAL_LOOPS) {
> >
> > Is this loop still meaningful, can we just invoke the handler twice
> > before and after the tick?
>
> And that solves what?
>

I meant, can we do this one time?
- lapic_cal_t1 = read APIC counter
- /* Wait for a tick to elapse */
- lapic_cal_t2 = read APIC counter

I'm not clear why we still need this loop, to use the
existing lapic_cal_handler()?

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-06 Thread Aubrey Li
On Tue, Aug 6, 2019 at 11:24 AM Aaron Lu  wrote:
>
> On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> > On 8/2/19 8:37 AM, Julien Desfossez wrote:
> > > We tested both Aaron's and Tim's patches and here are our results.
> > >
> > > Test setup:
> > > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > >   mem benchmark
> > > - both started at the same time
> > > - both are pinned on the same core (2 hardware threads)
> > > - 10 30-seconds runs
> > > - test script: https://paste.debian.net/plainh/834cf45c
> > > - only showing the CPU events/sec (higher is better)
> > > - tested 4 tag configurations:
> > >   - no tag
> > >   - sysbench mem untagged, sysbench cpu tagged
> > >   - sysbench mem tagged, sysbench cpu untagged
> > >   - both tagged with a different tag
> > > - "Alone" is the sysbench CPU running alone on the core, no tag
> > > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > >   combined with Aaron's "hack patch" to get rid of the remaining deep
> > >   idle cases
> > > - In all test cases, both tasks can run simultaneously (which was not
> > >   the case without those patches), but the standard deviation is a
> > >   pretty good indicator of the fairness/consistency.
> >
> > Thanks for testing the patches and giving such detailed data.
>
> Thanks Julien.
>
> > I came to realize that for my scheme, the accumulated deficit of forced 
> > idle could be wiped
> > out in one execution of a task on the forced idle cpu, with the update of 
> > the min_vruntime,
> > even if the execution time could be far less than the accumulated deficit.
> > That's probably one reason my scheme didn't achieve fairness.
>
> I've been thinking if we should consider core wide tenent fairness?
>
> Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
> (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
> B. Assume A1 and B1 are queued on the same thread and A2 on the other
> thread, when we decide priority for A1 and B1, shall we also consider
> A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
> belong to the same tenent? I tend to think we should make fairness per
> core per tenent, instead of per thread(cpu) per task(sched entity). What
> do you guys think?
>

I also think a way to make fairness per cookie per core, is this what you
want to propose?

Thanks,
-Aubrey

> Implemention of the idea is a mess to me, as I feel I'm duplicating the
> existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
> for the per core per tenent stuff.


Re: setup_boot_APIC_clock() NULL dereference during early boot on reduced hardware platforms

2019-08-01 Thread Aubrey Li
On Thu, Aug 1, 2019 at 3:35 PM Thomas Gleixner  wrote:
>
> On Thu, 1 Aug 2019, Aubrey Li wrote:
> > On Thu, Aug 1, 2019 at 2:26 PM Daniel Drake  wrote:
> > > global_clock_event is NULL here. This is a "reduced hardware" ACPI
> > > platform so acpi_generic_reduced_hw_init() has set timer_init to NULL,
> > > avoiding the usual codepaths that would set up global_clock_event.
> > >
> > IIRC, acpi_generic_reduced_hw_init() avoids initializing PIT, the status of
> > this legacy device is unknown in ACPI hw-reduced mode.
> >
> > > I tried the obvious:
> > >  if (!global_clock_event)
> > > return -1;
> > >
> > No, the platform needs a global clock event, can you turn on some other
>
> Wrong. The kernel boots perfectly fine without a global clock event. But
> for that the TSC and LAPIC frequency must be known.

I think LAPIC fast calibrate is only supported on intel platform, while
Daniel's box is an AMD platform. That's why lapic_init_clockevent() failed
and fall into the code path which needs a global clock event.

Thanks,
-Aubrey


Re: setup_boot_APIC_clock() NULL dereference during early boot on reduced hardware platforms

2019-08-01 Thread Aubrey Li
On Thu, Aug 1, 2019 at 2:26 PM Daniel Drake  wrote:
>
> Hi,
>
> Working with a new consumer laptop based on AMD R7-3700U, we are
> seeing a kernel panic during early boot (before the display
> initializes). It's a new product and there is no previous known
> working kernel version (tested 5.0, 5.2 and current linus master).
>
> We may have also seen this problem on a MiniPC based on AMD APU 7010
> from another vendor, but we don't have it in hands right now to
> confirm that it's the exact same crash.
>
> earlycon shows the details: a NULL dereference under
> setup_boot_APIC_clock(), which actually happens in
> calibrate_APIC_clock():
>
> /* Replace the global interrupt handler */
> real_handler = global_clock_event->event_handler;
> global_clock_event->event_handler = lapic_cal_handler;
>
> global_clock_event is NULL here. This is a "reduced hardware" ACPI
> platform so acpi_generic_reduced_hw_init() has set timer_init to NULL,
> avoiding the usual codepaths that would set up global_clock_event.
>
IIRC, acpi_generic_reduced_hw_init() avoids initializing PIT, the status of
this legacy device is unknown in ACPI hw-reduced mode.

> I tried the obvious:
>  if (!global_clock_event)
> return -1;
>
No, the platform needs a global clock event, can you turn on some other
clock source on your platform, like HPET?

Thanks,
-Aubrey

> However I'm probably missing part of the big picture here, as this
> only makes boot fail later on. It continues til the next point that
> something leads to schedule(), such as a driver calling msleep() or
> mark_readonly() calling rcu_barrier(), etc. Then it hangs.
>
> Is something missing in terms of timer setup here? Suggestions appreciated...
>
> Thanks
> Daniel


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-07-22 Thread Aubrey Li
On Mon, Jul 22, 2019 at 6:43 PM Aaron Lu  wrote:
>
> On 2019/7/22 18:26, Aubrey Li wrote:
> > The granularity period of util_avg seems too large to decide task priority
> > during pick_task(), at least it is in my case, cfs_prio_less() always picked
> > core max task, so pick_task() eventually picked idle, which causes this 
> > change
> > not very helpful for my case.
> >
> >  -0 [057] dN..83.716973: __schedule: max: sysbench/2578
> > 889050f68600
> >  -0 [057] dN..83.716974: __schedule:
> > (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
> >  -0 [057] dN..83.716975: __schedule:
> > (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
> >  -0 [057] dN..83.716975: cfs_prio_less: picked
> > sysbench/2578 util_avg: 20 527 -507 <=== here===
> >  -0 [057] dN..83.716976: __schedule: pick_task cookie
> > pick swapper/5/0 889050f68600
>
> Can you share your setup of the test? I would like to try it locally.

My setup is a co-location of AVX512 tasks(gemmbench) and non-AVX512 tasks
(sysbench MYSQL). Let me simply it and send offline.

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-07-22 Thread Aubrey Li
On Thu, Jul 18, 2019 at 6:07 PM Aaron Lu  wrote:
>
> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> > On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> > > The result looks still unfair, and particularly, the variance is too high,
> >
> > I just want to confirm that I am also seeing the same issue with a
> > similar setup. I also tried with the priority boost fix we previously
> > posted, the results are slightly better, but we are still seeing a very
> > high variance.
> >
> > On average, the results I get for 10 30-seconds runs are still much
> > better than nosmt (both sysbench pinned on the same sibling) for the
> > memory benchmark, and pretty similar for the CPU benchmark, but the high
> > variance between runs is indeed concerning.
>
> I was thinking to use util_avg signal to decide which task win in
> __prio_less() in the cross cpu case. The reason util_avg is chosen
> is because it represents how cpu intensive the task is, so the end
> result is, less cpu intensive task will preempt more cpu intensive
> task.
>
> Here is the test I have done to see how util_avg works
> (on a single node, 16 cores, 32 cpus vm):
> 1 Start tmux and then start 3 windows with each running bash;
> 2 Place two shells into two different cgroups and both have cpu.tag set;
> 3 Switch to the 1st tmux window, start
>   will-it-scale/page_fault1_processes -t 16 -s 30
>   in the first tagged shell;
> 4 Switch to the 2nd tmux window;
> 5 Start
>   will-it-scale/page_fault1_processes -t 16 -s 30
>   in the 2nd tagged shell;
> 6 Switch to the 3rd tmux window;
> 7 Do some simple things in the 3rd untagged shell like ls to see if
>   untagged task is able to proceed;
> 8 Wait for the two page_fault workloads to finish.
>
> With v3 here, I can not do step 4 and later steps, i.e. the 16
> page_fault1 processes started in step 3 will occupy all 16 cores and
> other tasks do not have a chance to run, including tmux, which made
> switching tmux window impossible.
>
> With the below patch on top of v3 that makes use of util_avg to decide
> which task win, I can do all 8 steps and the final scores of the 2
> workloads are: 1796191 and 2199586. The score number are not close,
> suggesting some unfairness, but I can finish the test now...
>
> Here is the diff(consider it as a POC):
>
> ---
>  kernel/sched/core.c  | 35 ++-
>  kernel/sched/fair.c  | 36 
>  kernel/sched/sched.h |  2 ++
>  3 files changed, 40 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26fea68f7f54..7557a7bbb481 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, 
> struct task_struct *b)
> if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
> return !dl_time_before(a->dl.deadline, b->dl.deadline);
>
> -   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -   u64 a_vruntime = a->se.vruntime;
> -   u64 b_vruntime = b->se.vruntime;
> -
> -   /*
> -* Normalize the vruntime if tasks are in different cpus.
> -*/
> -   if (task_cpu(a) != task_cpu(b)) {
> -   b_vruntime -= task_cfs_rq(b)->min_vruntime;
> -   b_vruntime += task_cfs_rq(a)->min_vruntime;
> -
> -   trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
> -a->pid, a_vruntime, a->se.vruntime, 
> task_cfs_rq(a)->min_vruntime,
> -b->pid, b_vruntime, b->se.vruntime, 
> task_cfs_rq(b)->min_vruntime);
> -
> -   }
> -
> -   return !((s64)(a_vruntime - b_vruntime) <= 0);
> -   }
> +   if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
> +   return cfs_prio_less(a, b);
>
> return false;
>  }
> @@ -3663,20 +3646,6 @@ pick_task(struct rq *rq, const struct sched_class 
> *class, struct task_struct *ma
> if (!class_pick)
> return NULL;
>
> -   if (!cookie) {
> -   /*
> -* If class_pick is tagged, return it only if it has
> -* higher priority than max.
> -*/
> -   bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
> -max && !prio_less(max, class_pick) :
> -max && prio_less(class_pick, max);
> -   if (class_pick->

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-07-19 Thread Aubrey Li
On Fri, Jul 19, 2019 at 1:53 PM Aaron Lu  wrote:
>
> On Thu, Jul 18, 2019 at 04:27:19PM -0700, Tim Chen wrote:
> >
> >
> > On 7/18/19 3:07 AM, Aaron Lu wrote:
> > > On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> >
> > >
> > > With the below patch on top of v3 that makes use of util_avg to decide
> > > which task win, I can do all 8 steps and the final scores of the 2
> > > workloads are: 1796191 and 2199586. The score number are not close,
> > > suggesting some unfairness, but I can finish the test now...
> >
> > Aaron,
> >
> > Do you still see high variance in terms of workload throughput that
> > was a problem with the previous version?
>
> Any suggestion how to measure this?
> It's not clear how Aubrey did his test, will need to take a look at
> sysbench.
>

Well, thanks to post this at the end of my vacation, ;)
I'll go back to the office next week and give a shot.
I actually have a new setup of co-locating AVX512 tasks with
sysbench MYSQL. Both throughput and latency was unacceptable
on the top of V3, Looking forward to seeing the difference of
patch.

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-06-16 Thread Aubrey Li
On Thu, Jun 13, 2019 at 11:22 AM Julien Desfossez
 wrote:
>
> On 12-Jun-2019 05:03:08 PM, Subhra Mazumdar wrote:
> >
> > On 6/12/19 9:33 AM, Julien Desfossez wrote:
> > >After reading more traces and trying to understand why only untagged
> > >tasks are starving when there are cpu-intensive tasks running on the
> > >same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
> > >the case where ‘core_cookie’ is 0, we are supposed to only prefer the
> > >tagged task if it’s priority is higher, but when the priorities are
> > >equal we prefer it as well which causes the starving. ‘pick_task’ is
> > >biased toward selecting its first parameter in case of equality which in
> > >this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
> > >the parameter solves this issue and matches the expected behavior.
> > >
> > >So we can get rid of this vruntime_boost concept.
> > >
> > >We have tested the fix below and it seems to work well with
> > >tagged/untagged tasks.
> > >
> > My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
> > than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
> > idleness. May be enable it by default?
>
> Yes if the fix is approved, we will just remove the option and it will
> always be enabled.
>

sysbench --report-interval option unveiled something.

benchmark setup
-
two cgroups, cpuset.cpus = 1, 53(one core, two siblings)
sysbench cpu mode, one thread in cgroup1
sysbench memory mode, one thread in cgroup2

no core scheduling
--
cpu throughput eps: 405.8, std: 0.14%
mem bandwidth MB/s: 5785.7, std: 0.11%

cgroup1 enable core scheduling(cpu mode)
cgroup2 disable core scheduling(memory mode)
-
cpu throughput eps: 8.7, std: 519.2%
mem bandwidth MB/s: 6263.2, std: 9.3%

cgroup1 disable core scheduling(cpu mode)
cgroup2 enable core scheduling(memory mode)
-
cpu throughput eps: 468.0 , std: 8.7%
mem bandwidth MB/S: 311.6 , std: 169.1%

cgroup1 enable core scheduling(cpu mode)
cgroup2 enable core scheduling(memory mode)

cpu throughput eps: 76.4 , std: 168.0%
mem bandwidth MB/S: 5388.3 , std: 30.9%

The result looks still unfair, and particularly, the variance is too high,
sysbench cpu log 
snip
[ 10s ] thds: 1 eps: 296.00 lat (ms,95%): 2.03
[ 11s ] thds: 1 eps: 0.00 lat (ms,95%): 1170.65
[ 12s ] thds: 1 eps: 1.00 lat (ms,95%): 0.00
[ 13s ] thds: 1 eps: 0.00 lat (ms,95%): 0.00
[ 14s ] thds: 1 eps: 295.91 lat (ms,95%): 2.03
[ 15s ] thds: 1 eps: 1.00 lat (ms,95%): 170.48
[ 16s ] thds: 1 eps: 0.00 lat (ms,95%): 2009.23
[ 17s ] thds: 1 eps: 1.00 lat (ms,95%): 995.51
[ 18s ] thds: 1 eps: 296.00 lat (ms,95%): 2.03
[ 19s ] thds: 1 eps: 1.00 lat (ms,95%): 170.48
[ 20s ] thds: 1 eps: 0.00 lat (ms,95%): 2009.23
snip

Thanks,
-Aubrey


[tip:x86/core] Documentation/filesystems/proc.txt: Add arch_status file

2019-06-12 Thread tip-bot for Aubrey Li
Commit-ID:  711486fd18596315d42cebaac3dba8c408f60a3d
Gitweb: https://git.kernel.org/tip/711486fd18596315d42cebaac3dba8c408f60a3d
Author: Aubrey Li 
AuthorDate: Thu, 6 Jun 2019 09:22:36 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 12 Jun 2019 11:42:13 +0200

Documentation/filesystems/proc.txt: Add arch_status file

Add documentation for /proc//arch_status file and the x86 specific
AVX512_elapsed_ms entry in it.

[ tglx: Massage changelog ]

Signed-off-by: Aubrey Li 
Signed-off-by: Thomas Gleixner 
Cc: a...@linux-foundation.org
Cc: pet...@infradead.org
Cc: h...@zytor.com
Cc: a...@linux.intel.com
Cc: tim.c.c...@linux.intel.com
Cc: dave.han...@intel.com
Cc: ar...@linux.intel.com
Cc: adobri...@gmail.com
Cc: aubrey...@intel.com
Cc: linux-...@vger.kernel.org
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Linux API 
Link: https://lkml.kernel.org/r/20190606012236.9391-3-aubrey...@linux.intel.com

---
 Documentation/filesystems/proc.txt | 40 ++
 1 file changed, 40 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..a226061fa109 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc//map_files - Information about memory mapped files
   3.10  /proc//timerslack_ns - Task timerslack value
   3.11 /proc//patch_state - Livepatch patch operation state
+  3.12 /proc//arch_status - Task architecture specific information
 
   4Configuring procfs
   4.1  Mount options
@@ -1948,6 +1949,45 @@ patched.  If the patch is being enabled, then the task 
has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12 /proc//arch_status - task architecture specific status
+---
+When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
+architecture specific status of the task.
+
+Example
+---
+ $ cat /proc/6753/arch_status
+ AVX512_elapsed_ms:  8
+
+Description
+---
+
+x86 specific entries:
+-
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not. Precise information can be obtained
+  with performance counters.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
 
 --
 Configuring procfs


[tip:x86/core] x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status

2019-06-12 Thread tip-bot for Aubrey Li
Commit-ID:  0c608dad2a771c0a11b6d12148d1a8b975e015d4
Gitweb: https://git.kernel.org/tip/0c608dad2a771c0a11b6d12148d1a8b975e015d4
Author: Aubrey Li 
AuthorDate: Thu, 6 Jun 2019 09:22:35 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 12 Jun 2019 11:42:13 +0200

x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status

AVX-512 components usage can result in turbo frequency drop. So it's useful
to expose AVX-512 usage elapsed time as a heuristic hint for user space job
schedulers to cluster the AVX-512 using tasks together.

Examples:
$ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the tsks AVX512 usage was
detected when the task was scheduled out.

$ cat /proc/tid/arch_status | grep AVX512
AVX512_elapsed_ms:  -1

'-1' indicates that no AVX512 usage was recorded for this task.

The time exposed is not necessarily accurate when the arch_status file is
read as the AVX512 usage is only evaluated when a task is scheduled
out. Accurate usage information can be obtained with performance counters.

[ tglx: Massaged changelog ]

Signed-off-by: Aubrey Li 
Signed-off-by: Thomas Gleixner 
Cc: a...@linux-foundation.org
Cc: pet...@infradead.org
Cc: h...@zytor.com
Cc: a...@linux.intel.com
Cc: tim.c.c...@linux.intel.com
Cc: dave.han...@intel.com
Cc: ar...@linux.intel.com
Cc: adobri...@gmail.com
Cc: aubrey...@intel.com
Cc: linux-...@vger.kernel.org
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Linux API 
Link: https://lkml.kernel.org/r/20190606012236.9391-2-aubrey...@linux.intel.com

---
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/fpu/xstate.c | 47 
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..8a49b4b03f6b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -217,6 +217,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select PROC_PID_ARCH_STATUS if PROC_FS
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 3c36dd1784db..591ddde3b3e8 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1240,3 +1242,48 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+
+   return 0;
+}
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */


[tip:x86/core] proc: Add /proc//arch_status

2019-06-12 Thread tip-bot for Aubrey Li
Commit-ID:  68bc30bb9f33fc8d11e3d110d29e06490896a999
Gitweb: https://git.kernel.org/tip/68bc30bb9f33fc8d11e3d110d29e06490896a999
Author: Aubrey Li 
AuthorDate: Thu, 6 Jun 2019 09:22:34 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 12 Jun 2019 11:42:13 +0200

proc: Add /proc//arch_status

Exposing architecture specific per process information is useful for
various reasons. An example is the AVX512 usage on x86 which is important
for task placement for power/performance optimizations.

Adding this information to the existing /prcc/pid/status file would be the
obvious choise, but it has been agreed on that a explicit arch_status file
is better in separating the generic and architecture specific information.

[ tglx: Massage changelog ]

Signed-off-by: Aubrey Li 
Signed-off-by: Thomas Gleixner 
Acked-by: Andrew Morton 
Cc: pet...@infradead.org
Cc: h...@zytor.com
Cc: a...@linux.intel.com
Cc: tim.c.c...@linux.intel.com
Cc: dave.han...@intel.com
Cc: ar...@linux.intel.com
Cc: adobri...@gmail.com
Cc: aubrey...@intel.com
Cc: linux-...@vger.kernel.org
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Linux API 
Link: https://lkml.kernel.org/r/20190606012236.9391-1-aubrey...@linux.intel.com

---
 fs/proc/Kconfig | 4 
 fs/proc/base.c  | 6 ++
 include/linux/proc_fs.h | 9 +
 3 files changed, 19 insertions(+)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 62ee41b4bbd0..4c3dcb718961 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -98,3 +98,7 @@ config PROC_CHILDREN
 
  Say Y if you are running any user-space software which takes benefit 
from
  this interface. For example, rkt is such a piece of software.
+
+config PROC_PID_ARCH_STATUS
+   def_bool n
+   depends on PROC_FS
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9c8ca6cd3ce4..ec436c61eece 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3061,6 +3061,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_STACKLEAK_METRICS
ONE("stack_depth", S_IRUGO, proc_stack_depth),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3449,6 +3452,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_LIVEPATCH
ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..a705aa2d03f9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -75,6 +75,15 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
void *data);
 extern struct pid *tgid_pidfd_to_pid(const struct file *file);
 
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * The architecture which selects CONFIG_PROC_PID_ARCH_STATUS must
+ * provide proc_pid_arch_status() definition.
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task);
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)


[PATCH v19 1/3] proc: add /proc//arch_status

2019-06-05 Thread Aubrey Li
The architecture specific information of the running processes
could be useful to the userland. Add /proc//arch_status
interface support to examine process architecture specific
information externally.

v3:
  Add a /proc//arch_state interface to expose per-task
  cpu specific state values.
v5:
  Change the interface to /proc/pid/status since no other
  architectures need a separated CPU specific interface.
v18:
  Change the interface to /proc/pid/arch_status. The interface
  /proc//status should not be different on different
  architectures. It would be better to separate the arch staff
  into its own file /proc//arch_status and make sure that
  everything in it is namespaced.

Signed-off-by: Aubrey Li 
Acked-by: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 fs/proc/Kconfig | 4 
 fs/proc/base.c  | 6 ++
 include/linux/proc_fs.h | 9 +
 3 files changed, 19 insertions(+)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 817c02b13b1d..d80ebf19d5f1 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -97,3 +97,7 @@ config PROC_CHILDREN
 
  Say Y if you are running any user-space software which takes benefit 
from
  this interface. For example, rkt is such a piece of software.
+
+config PROC_PID_ARCH_STATUS
+   def_bool n
+   depends on PROC_FS
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0c9bef89ac43..39ce939d8964 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3066,6 +3066,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_STACKLEAK_METRICS
ONE("stack_depth", S_IRUGO, proc_stack_depth),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3454,6 +3457,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_LIVEPATCH
ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..a705aa2d03f9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -75,6 +75,15 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
void *data);
 extern struct pid *tgid_pidfd_to_pid(const struct file *file);
 
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * The architecture which selects CONFIG_PROC_PID_ARCH_STATUS must
+ * provide proc_pid_arch_status() definition.
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task);
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)
-- 
2.17.1



[PATCH v19 3/3] Documentation/filesystems/proc.txt: add arch_status file

2019-06-05 Thread Aubrey Li
Added /proc//arch_status file, and added AVX512_elapsed_ms in
/proc//arch_status. Report it in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 Documentation/filesystems/proc.txt | 39 ++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..e8bc403d15df 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc//map_files - Information about memory mapped files
   3.10  /proc//timerslack_ns - Task timerslack value
   3.11 /proc//patch_state - Livepatch patch operation state
+  3.12 /proc//arch_status - Task architecture specific information
 
   4Configuring procfs
   4.1  Mount options
@@ -1948,6 +1949,44 @@ patched.  If the patch is being enabled, then the task 
has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12 /proc//arch_status - task architecture specific status
+---
+When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
+architecture specific status of the task.
+
+Example
+---
+ $ cat /proc/6753/arch_status
+ AVX512_elapsed_ms:  8
+
+Description
+---
+
+x86 specific entries:
+-
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
 
 --
 Configuring procfs
-- 
2.17.1



[PATCH v19 2/3] x86,/proc/pid/arch_status: Add AVX-512 usage elapsed time

2019-06-05 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/arch_status | grep AVX512
AVX512_elapsed_ms:  -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/fpu/xstate.c | 47 
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 62fc3fda1a05..5003c6f3a4d5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -208,6 +208,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select PROC_PID_ARCH_STATUS if PROC_FS
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..fcaaf21aa015 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,48 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+
+   return 0;
+}
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
-- 
2.17.1



Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-05-31 Thread Aubrey Li
On Fri, May 31, 2019 at 3:45 PM Aaron Lu  wrote:
>
> On Fri, May 31, 2019 at 02:53:21PM +0800, Aubrey Li wrote:
> > On Fri, May 31, 2019 at 2:09 PM Aaron Lu  wrote:
> > >
> > > On 2019/5/31 13:12, Aubrey Li wrote:
> > > > On Fri, May 31, 2019 at 11:01 AM Aaron Lu  
> > > > wrote:
> > > >>
> > > >> This feels like "date" failed to schedule on some CPU
> > > >> on time.
> > > >>
> > > >> My first reaction is: when shell wakes up from sleep, it will
> > > >> fork date. If the script is untagged and those workloads are
> > > >> tagged and all available cores are already running workload
> > > >> threads, the forked date can lose to the running workload
> > > >> threads due to __prio_less() can't properly do vruntime comparison
> > > >> for tasks on different CPUs. So those idle siblings can't run
> > > >> date and are idled instead. See my previous post on this:
> > > >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > > >> (Now that I re-read my post, I see that I didn't make it clear
> > > >> that se_bash and se_hog are assigned different tags(e.g. hog is
> > > >> tagged and bash is untagged).
> > > >
> > > > Yes, script is untagged. This looks like exactly the problem in you
> > > > previous post. I didn't follow that, does that discussion lead to a 
> > > > solution?
> > >
> > > No immediate solution yet.
> > >
> > > >>
> > > >> Siblings being forced idle is expected due to the nature of core
> > > >> scheduling, but when two tasks belonging to two siblings are
> > > >> fighting for schedule, we should let the higher priority one win.
> > > >>
> > > >> It used to work on v2 is probably due to we mistakenly
> > > >> allow different tagged tasks to schedule on the same core at
> > > >> the same time, but that is fixed in v3.
> > > >
> > > > I have 64 threads running on a 104-CPU server, that is, when the
> > >
> > > 104-CPU means 52 cores I guess.
> > > 64 threads may(should?) spread on all the 52 cores and that is enough
> > > to make 'date' suffer.
> >
> > 64 threads should spread onto all the 52 cores, but why they can get
> > scheduled while untagged "date" can not? Is it because in the current
>
> If 'date' didn't get scheduled, there will be no output at all unless
> all those workload threads finished :-)

Certainly I meant untagged "date" can not be scheduled on time, :)

>
> I guess the workload you used is not entirely CPU intensive, or 'date'
> can be totally blocked due to START_DEBIT. But note that START_DEBIT
> isn't the problem here, cross CPU vruntime comparison is.
>
> > implementation the task with cookie always has higher priority than the
> > task without a cookie?
>
> No.

I checked the benchmark log manually, it looks like the data of two benchmarks
with cookies are acceptable, but ones without cookies are really bad.


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-05-31 Thread Aubrey Li
On Fri, May 31, 2019 at 2:09 PM Aaron Lu  wrote:
>
> On 2019/5/31 13:12, Aubrey Li wrote:
> > On Fri, May 31, 2019 at 11:01 AM Aaron Lu  
> > wrote:
> >>
> >> This feels like "date" failed to schedule on some CPU
> >> on time.
> >>
> >> My first reaction is: when shell wakes up from sleep, it will
> >> fork date. If the script is untagged and those workloads are
> >> tagged and all available cores are already running workload
> >> threads, the forked date can lose to the running workload
> >> threads due to __prio_less() can't properly do vruntime comparison
> >> for tasks on different CPUs. So those idle siblings can't run
> >> date and are idled instead. See my previous post on this:
> >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> >> (Now that I re-read my post, I see that I didn't make it clear
> >> that se_bash and se_hog are assigned different tags(e.g. hog is
> >> tagged and bash is untagged).
> >
> > Yes, script is untagged. This looks like exactly the problem in you
> > previous post. I didn't follow that, does that discussion lead to a 
> > solution?
>
> No immediate solution yet.
>
> >>
> >> Siblings being forced idle is expected due to the nature of core
> >> scheduling, but when two tasks belonging to two siblings are
> >> fighting for schedule, we should let the higher priority one win.
> >>
> >> It used to work on v2 is probably due to we mistakenly
> >> allow different tagged tasks to schedule on the same core at
> >> the same time, but that is fixed in v3.
> >
> > I have 64 threads running on a 104-CPU server, that is, when the
>
> 104-CPU means 52 cores I guess.
> 64 threads may(should?) spread on all the 52 cores and that is enough
> to make 'date' suffer.

64 threads should spread onto all the 52 cores, but why they can get
scheduled while untagged "date" can not? Is it because in the current
implementation the task with cookie always has higher priority than the
task without a cookie?

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-05-30 Thread Aubrey Li
On Fri, May 31, 2019 at 11:01 AM Aaron Lu  wrote:
>
> This feels like "date" failed to schedule on some CPU
> on time.
>
> My first reaction is: when shell wakes up from sleep, it will
> fork date. If the script is untagged and those workloads are
> tagged and all available cores are already running workload
> threads, the forked date can lose to the running workload
> threads due to __prio_less() can't properly do vruntime comparison
> for tasks on different CPUs. So those idle siblings can't run
> date and are idled instead. See my previous post on this:
> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> (Now that I re-read my post, I see that I didn't make it clear
> that se_bash and se_hog are assigned different tags(e.g. hog is
> tagged and bash is untagged).

Yes, script is untagged. This looks like exactly the problem in you
previous post. I didn't follow that, does that discussion lead to a solution?

>
> Siblings being forced idle is expected due to the nature of core
> scheduling, but when two tasks belonging to two siblings are
> fighting for schedule, we should let the higher priority one win.
>
> It used to work on v2 is probably due to we mistakenly
> allow different tagged tasks to schedule on the same core at
> the same time, but that is fixed in v3.

I have 64 threads running on a 104-CPU server, that is, when the
system has ~40% idle time, and "date" is still failed to be picked
up onto CPU on time. This may be the nature of core scheduling,
but it seems to be far from fairness.

Shouldn't we share the core between (sysbench+gemmbench)
and (date)? I mean core level sharing instead of  "date" starvation?

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-05-30 Thread Aubrey Li
On Thu, May 30, 2019 at 10:17 PM Julien Desfossez
 wrote:
>
> Interesting, could you detail a bit more your test setup (commands used,
> type of machine, any cgroup/pinning configuration, etc) ? I would like
> to reproduce it and investigate.

Let me see if I can simply my test to reproduce it.

Thanks,
-Aubrey


Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-05-30 Thread Aubrey Li
On Thu, May 30, 2019 at 4:36 AM Vineeth Remanan Pillai
 wrote:
>
> Third iteration of the Core-Scheduling feature.
>
> This version fixes mostly correctness related issues in v2 and
> addresses performance issues. Also, addressed some crashes related
> to cgroups and cpu hotplugging.
>
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
>   SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
>   nosmt seems to perform better than core scheduling.

My testing scripts can not be completed on this version. I figured out the
number of cpu utilization report entry didn't reach my minimal requirement.
Then I wrote a simple script to verify.

$ cat test.sh
#!/bin/sh

for i in `seq 1 10`
do
echo `date`, $i
sleep 1
done


Normally it works as below:

Thu May 30 14:13:40 CST 2019, 1
Thu May 30 14:13:41 CST 2019, 2
Thu May 30 14:13:42 CST 2019, 3
Thu May 30 14:13:43 CST 2019, 4
Thu May 30 14:13:44 CST 2019, 5
Thu May 30 14:13:45 CST 2019, 6
Thu May 30 14:13:46 CST 2019, 7
Thu May 30 14:13:47 CST 2019, 8
Thu May 30 14:13:48 CST 2019, 9
Thu May 30 14:13:49 CST 2019, 10

When the system was running 32 sysbench threads and
32 gemmbench threads, it worked as below(the system
has ~38% idle time)
Thu May 30 14:14:20 CST 2019, 1
Thu May 30 14:14:21 CST 2019, 2
Thu May 30 14:14:22 CST 2019, 3
Thu May 30 14:14:24 CST 2019, 4 <===x=
Thu May 30 14:14:25 CST 2019, 5
Thu May 30 14:14:26 CST 2019, 6
Thu May 30 14:14:28 CST 2019, 7 <===x=
Thu May 30 14:14:29 CST 2019, 8
Thu May 30 14:14:31 CST 2019, 9 <===x=
Thu May 30 14:14:34 CST 2019, 10 <===x=

And it got worse when the system was running 64/64 case,
the system still had ~3% idle time
Thu May 30 14:26:40 CST 2019, 1
Thu May 30 14:26:46 CST 2019, 2
Thu May 30 14:26:53 CST 2019, 3
Thu May 30 14:27:01 CST 2019, 4
Thu May 30 14:27:03 CST 2019, 5
Thu May 30 14:27:11 CST 2019, 6
Thu May 30 14:27:31 CST 2019, 7
Thu May 30 14:27:32 CST 2019, 8
Thu May 30 14:27:41 CST 2019, 9
Thu May 30 14:27:56 CST 2019, 10

Any thoughts?

Thanks,
-Aubrey


Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-21 Thread Aubrey Li
On Mon, May 20, 2019 at 10:04 PM Vineeth Pillai
 wrote:
>
> > > The following patch improved my test cases.
> > > Welcome any comments.
> > >
> >
> > This is certainly better than violating the point of the core scheduler :)
> >
> > If I'm understanding this right what will happen in this case is instead
> > of using the idle process selected by the sibling we do the core scheduling
> > again. This may start with a newidle_balance which might bring over 
> > something
> > to run that matches what we want to put on the sibling. If that works then I
> > can see this helping.
> >
> > But I'd be a little concerned that we could end up thrashing. Once we do 
> > core
> > scheduling again here we'd force the sibling to resched and if we got a 
> > different
> > result which "helped" him pick idle we'd go around again.

Thrashing means more IPIs right? That's not what I observed, because idle task
has less chance onto CPU, rescheduling is reduced accordingly.

> > I think inherent in the concept of core scheduling (barring a perfectly 
> > aligned set
> > of jobs) is some extra idle time on siblings.

Yeah, I understand and agree with this, but 10-15% idle time on an overloaded
system makes me to try to figure out how this could happen and if we
can improve it.

> >
> >
> I was also thinking along the same lines. This change basically always
> tries to avoid idle and there by constantly interrupting the sibling.
> While this change might benefit a very small subset of workloads, it
> might introduce thrashing more often.

Thrashing is not observed under an overloaded case but may happen under a
light load or a mid load case, I need more investigation.

>
> One other reason you might be seeing performance improvement is
> because of the bugs that caused both siblings to go idle even though
> there are runnable and compatible threads in the queue. Most of the
> issues are fixed based on all the feedback received in v2. We have a
> github repo with the pre v3 changes here:
> https://github.com/digitalocean/linux-coresched/tree/coresched

Okay, thanks, it looks like the core functions pick_next_task() and pick_task()
have a lot of changes against v2. Need more brain power..

>
> Please try this and see how it compares with the vanilla v2. I think its
> time for a v3 now and we shall be posting it soon after some more
> testing and benchmarking.

Is there any potential change between pre v3 and v3? I prefer working
based on v3 so that everyone are on the same page.

Thanks,
-Aubrey


Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-18 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> From: Peter Zijlstra (Intel) 
>
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
>
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
>   simplify it further.
>
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c  | 222 ++-
>  kernel/sched/sched.h |   5 +-
>  2 files changed, 224 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e5bdc1c4d8d7..9e6e90c6f9b9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct 
> *prev)
>   * Pick up the highest-prio task:
>   */
>  static inline struct task_struct *
> -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> *rf)
>  {
> const struct sched_class *class;
> struct task_struct *p;
> @@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct 
> *prev, struct rq_flags *rf)
> BUG();
>  }
>
> +#ifdef CONFIG_SCHED_CORE
> +
> +static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
> +{
> +   if (is_idle_task(a) || is_idle_task(b))
> +   return true;
> +
> +   return a->core_cookie == b->core_cookie;
> +}
> +
> +// XXX fairness/fwd progress conditions
> +static struct task_struct *
> +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
> *max)
> +{
> +   struct task_struct *class_pick, *cookie_pick;
> +   unsigned long cookie = 0UL;
> +
> +   /*
> +* We must not rely on rq->core->core_cookie here, because we fail to 
> reset
> +* rq->core->core_cookie on new picks, such that we can detect if we 
> need
> +* to do single vs multi rq task selection.
> +*/
> +
> +   if (max && max->core_cookie) {
> +   WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
> +   cookie = max->core_cookie;
> +   }
> +
> +   class_pick = class->pick_task(rq);
> +   if (!cookie)
> +   return class_pick;
> +
> +   cookie_pick = sched_core_find(rq, cookie);
> +   if (!class_pick)
> +   return cookie_pick;
> +
> +   /*
> +* If class > max && class > cookie, it is the highest priority task 
> on
> +* the core (so far) and it must be selected, otherwise we must go 
> with
> +* the cookie pick in order to satisfy the constraint.
> +*/
> +   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
> class_pick))
> +   return class_pick;
> +
> +   return cookie_pick;
> +}
> +
> +static struct task_struct *
> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> +   struct task_struct *next, *max = NULL;
> +   const struct sched_class *class;
> +   const struct cpumask *smt_mask;
> +   int i, j, cpu;
> +
> +   if (!sched_core_enabled(rq))
> +   return __pick_next_task(rq, prev, rf);
> +
> +   /*
> +* If there were no {en,de}queues since we picked (IOW, the task
> +* pointers are all still valid), and we haven't scheduled the last
> +* pick yet, do so now.
> +*/
> +   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> +   rq->core->core_pick_seq != rq->core_sched_seq) {
> +   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
> +
> +   next = rq->core_pick;
> +   if (next != prev) {
> +   put_prev_task(rq, prev);
> +   set_next_task(rq, next);
> +   }
> +   return next;
> +   }
> +

The following patch improved my test cases.
Welcome any comments.

Thanks,
-Aubrey

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e3162f..86031f4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3685,10 +3685,12 @@ pick_next_task(struct rq *rq, struct
task_struct *prev, struct rq_flags *rf)
/*
 * If there were no {en,de}queues since we picked (IOW, the task
 * pointers are all still valid), and we haven't scheduled the last
-* pick yet, do so now.
+* pick yet, do so now. If the last pick is idle task, we abandon
+* last pick and try to pick up task this time.
 */
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
-   rq->core->core_pick_seq != rq->core_sched_seq) {
+   rq->core->core_pick_seq != rq->core_sched_seq &&
+   !is_idle_task(rq->core_pick)) {
WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);

next = 

Re: [RFC PATCH v2 17/17] sched: Debug bits...

2019-05-17 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> From: Peter Zijlstra (Intel) 
>
> Not-Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c | 38 +-
>  1 file changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0e3c51a1b54a..e8e5f26db052 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -106,6 +106,10 @@ static inline bool __prio_less(struct task_struct *a, 
> struct task_struct *b, boo
>
> int pa = __task_prio(a), pb = __task_prio(b);
>
> +   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
> +a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
> +b->comm, b->pid, pa, b->se.vruntime, b->dl.deadline);
> +

a minor nitpick

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e3162f..68c518c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -93,7 +93,7 @@ static inline bool __prio_less(struct task_struct
*a, struct task_struct *b, u64

trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
-   b->comm, b->pid, pa, b->se.vruntime, b->dl.deadline);
+   b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);


if (-pa < -pb)
return true;


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-09 Thread Aubrey Li
On Thu, May 9, 2019 at 10:14 AM Subhra Mazumdar
 wrote:
>
>
> On 5/8/19 6:38 PM, Aubrey Li wrote:
> > On Thu, May 9, 2019 at 8:29 AM Subhra Mazumdar
> >  wrote:
> >>
> >> On 5/8/19 5:01 PM, Aubrey Li wrote:
> >>> On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar
> >>>  wrote:
> >>>> On 5/8/19 11:19 AM, Subhra Mazumdar wrote:
> >>>>> On 5/8/19 8:49 AM, Aubrey Li wrote:
> >>>>>>> Pawan ran an experiment setting up 2 VMs, with one VM doing a
> >>>>>>> parallel kernel build and one VM doing sysbench,
> >>>>>>> limiting both VMs to run on 16 cpu threads (8 physical cores), with
> >>>>>>> 8 vcpu for each VM.
> >>>>>>> Making the fix did improve kernel build time by 7%.
> >>>>>> I'm gonna agree with the patch below, but just wonder if the testing
> >>>>>> result is consistent,
> >>>>>> as I didn't see any improvement in my testing environment.
> >>>>>>
> >>>>>> IIUC, from the code behavior, especially for 2 VMs case(only 2
> >>>>>> different cookies), the
> >>>>>> per-rq rb tree unlikely has nodes with different cookies, that is, all
> >>>>>> the nodes on this
> >>>>>> tree should have the same cookie, so:
> >>>>>> - if the parameter cookie is equal to the rb tree cookie, we meet a
> >>>>>> match and go the
> >>>>>> third branch
> >>>>>> - else, no matter we go left or right, we can't find a match, and
> >>>>>> we'll return idle thread
> >>>>>> finally.
> >>>>>>
> >>>>>> Please correct me if I was wrong.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Aubrey
> >>>>> This is searching in the per core rb tree (rq->core_tree) which can have
> >>>>> 2 different cookies. But having said that, even I didn't see any
> >>>>> improvement with the patch for my DB test case. But logically it is
> >>>>> correct.
> >>>>>
> >>>> Ah, my bad. It is per rq. But still can have 2 different cookies. Not 
> >>>> sure
> >>>> why you think it is unlikely?
> >>> Yeah, I meant 2 different cookies on the system, but unlikely 2
> >>> different cookies
> >>> on one same rq.
> >>>
> >>> If I read the source correctly, for the sched_core_balance path, when try 
> >>> to
> >>> steal cookie from another CPU, sched_core_find() uses dst's cookie to 
> >>> search
> >>> if there is a cookie match in src's rq, and sched_core_find() returns 
> >>> idle or
> >>> matched task, and later put this matched task onto dst's rq 
> >>> (activate_task() in
> >>> sched_core_find()). At this moment, the nodes on the rq's rb tree should 
> >>> have
> >>> same cookies.
> >>>
> >>> Thanks,
> >>> -Aubrey
> >> Yes, but sched_core_find is also called from pick_task to find a local
> >> matching task.
> > Can a local searching introduce a different cookies? Where is it from?
> No. I meant the local search uses the same binary search of sched_core_find
> so it has to be correct.
> >
> >> The enqueue side logic of the scheduler is unchanged with
> >> core scheduling,
> > But only the task with cookies is placed onto this rb tree?
> >
> >> so it is possible tasks with different cookies are
> >> enqueued on the same rq. So while searching for a matching task locally
> >> doing it correctly should matter.
> > May I know how exactly?
> select_task_rq_* seems to be unchanged. So the search logic to find a cpu
> to enqueue when a task becomes runnable is same as before and doesn't do
> any kind of cookie matching.

Okay, that's true in task wakeup path, and also load_balance seems to pull task
without checking cookie too. But my system is not over loaded when I tested this
patch, so there is none or only one task in rq and on the rq's rb
tree, so this patch
does not make a difference.

The question is, should we do cookie checking for task selecting CPU and load
balance CPU pulling task?

Thanks,
-Aubrey


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Aubrey Li
On Thu, May 9, 2019 at 8:29 AM Subhra Mazumdar
 wrote:
>
>
> On 5/8/19 5:01 PM, Aubrey Li wrote:
> > On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar
> >  wrote:
> >>
> >> On 5/8/19 11:19 AM, Subhra Mazumdar wrote:
> >>> On 5/8/19 8:49 AM, Aubrey Li wrote:
> >>>>> Pawan ran an experiment setting up 2 VMs, with one VM doing a
> >>>>> parallel kernel build and one VM doing sysbench,
> >>>>> limiting both VMs to run on 16 cpu threads (8 physical cores), with
> >>>>> 8 vcpu for each VM.
> >>>>> Making the fix did improve kernel build time by 7%.
> >>>> I'm gonna agree with the patch below, but just wonder if the testing
> >>>> result is consistent,
> >>>> as I didn't see any improvement in my testing environment.
> >>>>
> >>>> IIUC, from the code behavior, especially for 2 VMs case(only 2
> >>>> different cookies), the
> >>>> per-rq rb tree unlikely has nodes with different cookies, that is, all
> >>>> the nodes on this
> >>>> tree should have the same cookie, so:
> >>>> - if the parameter cookie is equal to the rb tree cookie, we meet a
> >>>> match and go the
> >>>> third branch
> >>>> - else, no matter we go left or right, we can't find a match, and
> >>>> we'll return idle thread
> >>>> finally.
> >>>>
> >>>> Please correct me if I was wrong.
> >>>>
> >>>> Thanks,
> >>>> -Aubrey
> >>> This is searching in the per core rb tree (rq->core_tree) which can have
> >>> 2 different cookies. But having said that, even I didn't see any
> >>> improvement with the patch for my DB test case. But logically it is
> >>> correct.
> >>>
> >> Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
> >> why you think it is unlikely?
> > Yeah, I meant 2 different cookies on the system, but unlikely 2
> > different cookies
> > on one same rq.
> >
> > If I read the source correctly, for the sched_core_balance path, when try to
> > steal cookie from another CPU, sched_core_find() uses dst's cookie to search
> > if there is a cookie match in src's rq, and sched_core_find() returns idle 
> > or
> > matched task, and later put this matched task onto dst's rq 
> > (activate_task() in
> > sched_core_find()). At this moment, the nodes on the rq's rb tree should 
> > have
> > same cookies.
> >
> > Thanks,
> > -Aubrey
> Yes, but sched_core_find is also called from pick_task to find a local
> matching task.

Can a local searching introduce a different cookies? Where is it from?

> The enqueue side logic of the scheduler is unchanged with
> core scheduling,

But only the task with cookies is placed onto this rb tree?

> so it is possible tasks with different cookies are
> enqueued on the same rq. So while searching for a matching task locally
> doing it correctly should matter.

May I know how exactly?

Thanks,
-Aubrey


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Aubrey Li
On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar
 wrote:
>
>
> On 5/8/19 11:19 AM, Subhra Mazumdar wrote:
> >
> > On 5/8/19 8:49 AM, Aubrey Li wrote:
> >>> Pawan ran an experiment setting up 2 VMs, with one VM doing a
> >>> parallel kernel build and one VM doing sysbench,
> >>> limiting both VMs to run on 16 cpu threads (8 physical cores), with
> >>> 8 vcpu for each VM.
> >>> Making the fix did improve kernel build time by 7%.
> >> I'm gonna agree with the patch below, but just wonder if the testing
> >> result is consistent,
> >> as I didn't see any improvement in my testing environment.
> >>
> >> IIUC, from the code behavior, especially for 2 VMs case(only 2
> >> different cookies), the
> >> per-rq rb tree unlikely has nodes with different cookies, that is, all
> >> the nodes on this
> >> tree should have the same cookie, so:
> >> - if the parameter cookie is equal to the rb tree cookie, we meet a
> >> match and go the
> >> third branch
> >> - else, no matter we go left or right, we can't find a match, and
> >> we'll return idle thread
> >> finally.
> >>
> >> Please correct me if I was wrong.
> >>
> >> Thanks,
> >> -Aubrey
> > This is searching in the per core rb tree (rq->core_tree) which can have
> > 2 different cookies. But having said that, even I didn't see any
> > improvement with the patch for my DB test case. But logically it is
> > correct.
> >
> Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
> why you think it is unlikely?

Yeah, I meant 2 different cookies on the system, but unlikely 2
different cookies
on one same rq.

If I read the source correctly, for the sched_core_balance path, when try to
steal cookie from another CPU, sched_core_find() uses dst's cookie to search
if there is a cookie match in src's rq, and sched_core_find() returns idle or
matched task, and later put this matched task onto dst's rq (activate_task() in
sched_core_find()). At this moment, the nodes on the rq's rb tree should have
same cookies.

Thanks,
-Aubrey


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Aubrey Li
On Fri, May 3, 2019 at 8:06 AM Tim Chen  wrote:
>
> On 5/1/19 4:27 PM, Tim Chen wrote:
> > On 4/28/19 11:15 PM, Aaron Lu wrote:
> >> On Tue, Apr 23, 2019 at 04:18:16PM +, Vineeth Remanan Pillai wrote:
> >>> +/*
> >>> + * Find left-most (aka, highest priority) task matching @cookie.
> >>> + */
> >>> +struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
> >>> +{
> >>> +   struct rb_node *node = rq->core_tree.rb_node;
> >>> +   struct task_struct *node_task, *match;
> >>> +
> >>> +   /*
> >>> +* The idle task always matches any cookie!
> >>> +*/
> >>> +   match = idle_sched_class.pick_task(rq);
> >>> +
> >>> +   while (node) {
> >>> +   node_task = container_of(node, struct task_struct, core_node);
> >>> +
> >>> +   if (node_task->core_cookie < cookie) {
> >>> +   node = node->rb_left;
> >>
> >> Should go right here?
> >>
> >
> > I think Aaron is correct.  We order the rb tree where tasks with smaller 
> > core cookies
> > go to the left part of the tree.
> >
> > In this case, the cookie we are looking for is larger than the current 
> > node's cookie.
> > It seems like we should move to the right to look for a node with matching 
> > cookie.
> >
> > At least making the following change still allow us to run the system 
> > stably for sysbench.
> > Need to gather more data to see how performance changes.
>
> Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel 
> build and one VM doing sysbench,
> limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu 
> for each VM.
> Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey
>
> Tim
>
>
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 25638a47c408..ed4cfa49e3f2 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -208,9 +208,9 @@ static struct task_struct *sched_core_find(struct rq 
> > *rq, unsigned long cookie)
> > while (node) {
> > node_task = container_of(node, struct task_struct, 
> > core_node);
> >
> > -   if (node_task->core_cookie < cookie) {
> > +   if (cookie < node_task->core_cookie) {
> > node = node->rb_left;
> > -   } else if (node_task->core_cookie > cookie) {
> > +   } else if (cookie > node_task->core_cookie) {
> > node = node->rb_right;
> > } else {
> > match = node_task;
> >
> >
>


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Aubrey Li
On Tue, Apr 30, 2019 at 12:01 AM Ingo Molnar  wrote:
> * Li, Aubrey  wrote:
>
> > > I.e. showing the approximate CPU thread-load figure column would be
> > > very useful too, where '50%' shows half-loaded, '100%' fully-loaded,
> > > '200%' over-saturated, etc. - for each row?
> >
> > See below, hope this helps.
> > .--.
> > |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / 
> > sem%] +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> > |--|
> > |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 
> > 0.1%]-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> > |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 
> > 0.2%] 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> > |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 
> > 0.1%]-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> > |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 
> > 0.0%]-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> > | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 
> > 0.2%]-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> > | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 
> > 0.4%]-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> > | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 
> > 0.8%]-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> > |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 
> > 1.4%]-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> > |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 
> > 1.0%]   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> > '--'
>
> Very nice, thank you!
>
> What's interesting is how in the over-saturated case (the last three
> rows: 128, 256 and 512 total threads) coresched-SMT leaves 20-30% CPU
> performance on the floor according to the load figures.

Yeah, I found the next focus.

>
> Is this true idle time (which shows up as 'id' during 'top'), or some
> load average artifact?
>

vmstat periodically reported intermediate CPU utilization in one second, it was
running simultaneously when the benchmarks run. The cpu% is computed by
the average of (100-idle) series.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Aubrey Li
On Mon, Apr 29, 2019 at 11:39 PM Phil Auld  wrote:
>
> On Mon, Apr 29, 2019 at 09:25:35PM +0800 Li, Aubrey wrote:
> > .--.
> > |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / 
> > sem%] +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> > |--|
> > |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 
> > 0.1%]-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> > |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 
> > 0.2%] 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> > |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 
> > 0.1%]-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> > |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 
> > 0.0%]-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> > | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 
> > 0.2%]-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> > | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 
> > 0.4%]-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> > | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 
> > 0.8%]-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> > |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 
> > 1.4%]-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> > |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 
> > 1.0%]   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> > '--'
> >
>
> That's really nice and clear.
>
> We start to see the penalty for the coresched at 32/32, leaving some cpus 
> more idle than otherwise.
> But it's pretty good overall, for this benchmark at least.
>
> Is this with stock v2 or with any of the fixes posted after? I wonder how 
> much the fixes for
> the race that violates the rule effects this, for example.
>

Yeah, this data is based on v2 without any fixes after.
I also tried some fixes potential to performance impact but no luck so far.
Please let me know if anything I missed.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-28 Thread Aubrey Li
On Sun, Apr 28, 2019 at 5:33 PM Ingo Molnar  wrote:
> So because I'm a big fan of presenting data in a readable fashion, here
> are your results, tabulated:

I thought I tried my best to make it readable, but this one looks much better,
thanks, ;-)
>
>  #
>  # Sysbench throughput comparison of 3 different kernels at different
>  # load levels, higher numbers are better:
>  #
>
>  
> .--|.
>  |  NA/AVX vanilla-SMT[stddev%] |coresched-SMT   [stddev%]   +/-  |   
> no-SMT[stddev%]   +/-  |
>  
> |--||
>  |   1/1 508.5[  0.2% ] |504.7   [  1.1% ]   0.8% |   
>  509.0[  0.2% ]   0.1% |
>  |   2/21000.2[  1.4% ] |   1004.1   [  1.6% ]   0.4% |   
>  997.6[  1.2% ]   0.3% |
>  |   4/41912.1[  1.0% ] |   1904.2   [  1.1% ]   0.4% |   
> 1914.9[  1.3% ]   0.1% |
>  |   8/83753.5[  0.3% ] |   3748.2   [  0.3% ]   0.1% |   
> 3751.3[  0.4% ]   0.1% |
>  |  16/16   7139.3[  2.4% ] |   7137.9   [  1.8% ]   0.0% |   
> 7049.2[  2.4% ]   1.3% |
>  |  32/32  10899.0[  4.2% ] |  10780.3   [  4.4% ]  -1.1% |  
> 10339.2[  9.6% ]  -5.1% |
>  |  64/64  15086.1[ 11.5% ] |  14262.0   [  8.2% ]  -5.5% |  
> 11168.7[ 22.2% ] -26.0% |
>  | 128/128 15371.9[ 22.0% ] |  14675.8   [ 14.4% ]  -4.5% |  
> 10963.9[ 18.5% ] -28.7% |
>  | 256/256 15990.8[ 22.0% ] |  12227.9   [ 10.3% ] -23.5% |  
> 10469.9[ 19.6% ] -34.5% |
>  
> '--|'
>
> One major thing that sticks out is that if we compare the stddev numbers
> to the +/- comparisons then it's pretty clear that the benchmarks are
> very noisy: in all but the last row stddev is actually higher than the
> measured effect.
>
> So what does 'stddev' mean here, exactly? The stddev of multipe runs,
> i.e. measured run-to-run variance? Or is it some internal metric of the
> benchmark?
>

The benchmark periodically reports intermediate statistics in one second,
the raw log looks like below:
[ 11s ] thds: 256 eps: 14346.72 lat (ms,95%): 44.17
[ 12s ] thds: 256 eps: 14328.45 lat (ms,95%): 44.17
[ 13s ] thds: 256 eps: 13773.06 lat (ms,95%): 43.39
[ 14s ] thds: 256 eps: 13752.31 lat (ms,95%): 43.39
[ 15s ] thds: 256 eps: 15362.79 lat (ms,95%): 43.39
[ 16s ] thds: 256 eps: 26580.65 lat (ms,95%): 35.59
[ 17s ] thds: 256 eps: 15011.78 lat (ms,95%): 36.89
[ 18s ] thds: 256 eps: 15025.78 lat (ms,95%): 39.65
[ 19s ] thds: 256 eps: 15350.87 lat (ms,95%): 39.65
[ 20s ] thds: 256 eps: 15491.70 lat (ms,95%): 36.89

I have a python script to parse eps(events per second) and lat(latency)
out, and compute the average and stddev. (And I can draw a curve locally).

It's noisy indeed when tasks number is greater than the CPU number.
It's probably caused by high frequent load balance and context switch.
Do you have any suggestions? Or any other information I can provide?

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-27 Thread Aubrey Li
On Sat, Apr 27, 2019 at 10:21 PM Ingo Molnar  wrote:
>
> * Aubrey Li  wrote:
>
> > On Sat, Apr 27, 2019 at 5:17 PM Ingo Molnar  wrote:
> > >
> > >
> > > * Aubrey Li  wrote:
> > >
> > > > I have the same environment setup above, for nosmt cases, I used
> > > > /sys interface Thomas mentioned, below is the result:
> > > >
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
> > > > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > > > 256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%
> > >
> > > What do these numbers mean? Are these latencies, i.e. lower is better?
> >
> > Yeah, like above setup, I run sysbench(Non-AVX task, NA) and gemmbench
> > (AVX512 task, AVX) in different level utilizatoin. The machine has 104 
> > CPUs, so
> > nosmt has 52 CPUs.  These numbers are 95th percentile latency of sysbench,
> > lower is better.
>
> But what we are really interested in are throughput numbers under these
> three kernel variants, right?
>

These are sysbench events per second number, higher is better.

NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
1/1   508.5( 0.2%)504.7( 1.1%) -0.8% 509.0( 0.2%)  0.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
2/2  1000.2( 1.4%)   1004.1( 1.6%)  0.4% 997.6( 1.2%) -0.3%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
4/4  1912.1( 1.0%)   1904.2( 1.1%) -0.4%1914.9( 1.3%)  0.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
8/8  3753.5( 0.3%)   3748.2( 0.3%) -0.1%3751.3( 0.4%) -0.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
16/167139.3( 2.4%)   7137.9( 1.8%) -0.0%7049.2( 2.4%) -1.3%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
32/32   10899.0( 4.2%)  10780.3( 4.4%) -1.1%10339.2( 9.6%) -5.1%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
64/64   15086.1(11.5%)  14262.0( 8.2%) -5.5%11168.7(22.2%) -26.0%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
128/128 15371.9(22.0%)  14675.8(14.4%) -4.5%10963.9(18.5%) -28.7%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
256/256 15990.8(22.0%)  12227.9(10.3%) -23.5%   10469.9(19.6%) -34.5%


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-27 Thread Aubrey Li
On Sat, Apr 27, 2019 at 5:17 PM Ingo Molnar  wrote:
>
>
> * Aubrey Li  wrote:
>
> > I have the same environment setup above, for nosmt cases, I used
> > /sys interface Thomas mentioned, below is the result:
> >
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
> > NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
> > 256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%
>
> What do these numbers mean? Are these latencies, i.e. lower is better?

Yeah, like above setup, I run sysbench(Non-AVX task, NA) and gemmbench
(AVX512 task, AVX) in different level utilizatoin. The machine has 104 CPUs, so
nosmt has 52 CPUs.  These numbers are 95th percentile latency of sysbench,
lower is better.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Aubrey Li
On Thu, Apr 25, 2019 at 5:55 PM Ingo Molnar  wrote:
> * Aubrey Li  wrote:
> > On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
> >  wrote:
> > >
> > > On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > > > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> > > >  wrote:
> > > > >
> > > > > Second iteration of the core-scheduling feature.
> > > > >
> > > > > This version fixes apparent bugs and performance issues in v1. This
> > > > > doesn't fully address the issue of core sharing between processes
> > > > > with different tags. Core sharing still happens 1% to 5% of the time
> > > > > based on the nature of workload and timing of the runnable processes.
> > > > >
> > > > > Changes in v2
> > > > > -
> > > > > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> > > >
> > > > Thanks to post v2, based on this version, here is my benchmarks result.
> > > >
> > > > Environment setup
> > > > --
> > > > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > > > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > > > cgroup2 workload, gemmbench (AVX512 workload)
> > > >
> > > > Case 1: task number < CPU num
> > > > 
> > > > 36 sysbench threads in cgroup1
> > > > 36 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> > > >
> > > > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > > > tasks, latency has ~28% improvement!!!
> > > >
> > > > Case 2: task number > CPU number
> > > > -
> > > > 72 sysbench threads in cgroup1
> > > > 72 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> > > >
> > > > So not only power, now security and performance is a pair of 
> > > > contradictions.
> > > > Due to core cookie not matching and forced idle introduced, latency has 
> > > > ~12%
> > > > regression.
> > > >
> > > > Any comments?
> > >
> > > Would it be possible to post the results with HT off as well ?
> >
> > What's the point here to turn HT off? The latency is sensitive to the
> > relationship
> > between the task number and CPU number. Usually less CPU number, more run
> > queue wait time, and worse result.
>
> HT-off numbers are mandatory: turning HT off is by far the simplest way
> to solve the security bugs in these CPUs.
>
> Any core-scheduling solution *must* perform better than HT-off for all
> relevant workloads, otherwise what's the point?
>
I have the same environment setup above, for nosmt cases, I used
/sys interface Thomas mentioned, below is the result:

NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
1/1  1.987( 1.97%)   2.043( 1.76%) -2.84% 1.985( 1.70%)  0.12%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
2/2  2.074( 1.16%)   2.057( 2.09%)  0.81% 2.072( 0.77%)  0.10%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
4/4  2.140( 0.00%)   2.138( 0.49%)  0.09% 2.137( 0.89%)  0.12%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
8/8  2.140( 0.00%)   2.144( 0.53%) -0.17% 2.140( 0.00%)  0.00%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
16/162.361( 2.99%)   2.369( 2.65%) -0.30% 2.406( 2.53%) -1.87%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
32/325.032( 8.68%)   3.485( 0.49%) 30.76% 6.002(27.21%) -19.27%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
64/647.577(34.35%)   3.972(23.18%) 47.57% 18.235(14.14%) -140.68%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
128/128 24.639(14.28%)  27.440( 8.24%) -11.37% 34.746( 6.92%) -41.02%
NA/AVX  baseline(std%)  coresched(std%) +/- nosmt(std%) +/-
256/256 38.797( 8.59%)  44.067(16.20%) -13.58% 42.536( 7.57%) -9.64%

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-25 Thread Aubrey Li
On Thu, Apr 25, 2019 at 5:55 PM Ingo Molnar  wrote:
>
>
> * Aubrey Li  wrote:
>
> > On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
> >  wrote:
> > >
> > > On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > > > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> > > >  wrote:
> > > > >
> > > > > Second iteration of the core-scheduling feature.
> > > > >
> > > > > This version fixes apparent bugs and performance issues in v1. This
> > > > > doesn't fully address the issue of core sharing between processes
> > > > > with different tags. Core sharing still happens 1% to 5% of the time
> > > > > based on the nature of workload and timing of the runnable processes.
> > > > >
> > > > > Changes in v2
> > > > > -
> > > > > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> > > >
> > > > Thanks to post v2, based on this version, here is my benchmarks result.
> > > >
> > > > Environment setup
> > > > --
> > > > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > > > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > > > cgroup2 workload, gemmbench (AVX512 workload)
> > > >
> > > > Case 1: task number < CPU num
> > > > 
> > > > 36 sysbench threads in cgroup1
> > > > 36 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> > > >
> > > > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > > > tasks, latency has ~28% improvement!!!
> > > >
> > > > Case 2: task number > CPU number
> > > > -
> > > > 72 sysbench threads in cgroup1
> > > > 72 gemmbench threads in cgroup2
> > > >
> > > > core sched off:
> > > > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > > > core sched on:
> > > > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> > > >
> > > > So not only power, now security and performance is a pair of 
> > > > contradictions.
> > > > Due to core cookie not matching and forced idle introduced, latency has 
> > > > ~12%
> > > > regression.
> > > >
> > > > Any comments?
> > >
> > > Would it be possible to post the results with HT off as well ?
> >
> > What's the point here to turn HT off? The latency is sensitive to the
> > relationship
> > between the task number and CPU number. Usually less CPU number, more run
> > queue wait time, and worse result.
>
> HT-off numbers are mandatory: turning HT off is by far the simplest way
> to solve the security bugs in these CPUs.
>
> Any core-scheduling solution *must* perform better than HT-off for all
> relevant workloads, otherwise what's the point?
>
Got it, I'll measure HT-off cases soon.

Thanks,
-Aubrey


[PATCH v18 2/3] x86,/proc/pid/arch_status: Add AVX-512 usage elapsed time

2019-04-25 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/arch_status | grep AVX512
AVX512_elapsed_ms:  -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/fpu/xstate.c | 47 
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..d5a9c5ddd453 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -208,6 +208,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select PROC_PID_ARCH_STATUS if PROC_FS
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..fcaaf21aa015 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,48 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+
+   return 0;
+}
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
-- 
2.17.1



[PATCH v18 1/3] proc: add /proc//arch_status

2019-04-25 Thread Aubrey Li
The architecture specific information of the running processes
could be useful to the userland. Add /proc//arch_status
interface support to examine process architecture specific
information externally.

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 fs/proc/Kconfig | 4 
 fs/proc/base.c  | 6 ++
 include/linux/proc_fs.h | 9 +
 3 files changed, 19 insertions(+)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 817c02b13b1d..d80ebf19d5f1 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -97,3 +97,7 @@ config PROC_CHILDREN
 
  Say Y if you are running any user-space software which takes benefit 
from
  this interface. For example, rkt is such a piece of software.
+
+config PROC_PID_ARCH_STATUS
+   def_bool n
+   depends on PROC_FS
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6a803a0b75df..8c71eba47031 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3061,6 +3061,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_STACKLEAK_METRICS
ONE("stack_depth", S_IRUGO, proc_stack_depth),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3449,6 +3452,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_LIVEPATCH
ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..a705aa2d03f9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -75,6 +75,15 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
void *data);
 extern struct pid *tgid_pidfd_to_pid(const struct file *file);
 
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * The architecture which selects CONFIG_PROC_PID_ARCH_STATUS must
+ * provide proc_pid_arch_status() definition.
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task);
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)
-- 
2.17.1



[PATCH v18 3/3] Documentation/filesystems/proc.txt: add arch_status file

2019-04-25 Thread Aubrey Li
Added /proc//arch_status file, and added AVX512_elapsed_ms in
/proc//arch_status. Report it in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 Documentation/filesystems/proc.txt | 39 ++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..e8bc403d15df 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc//map_files - Information about memory mapped files
   3.10  /proc//timerslack_ns - Task timerslack value
   3.11 /proc//patch_state - Livepatch patch operation state
+  3.12 /proc//arch_status - Task architecture specific information
 
   4Configuring procfs
   4.1  Mount options
@@ -1948,6 +1949,44 @@ patched.  If the patch is being enabled, then the task 
has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12 /proc//arch_status - task architecture specific status
+---
+When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
+architecture specific status of the task.
+
+Example
+---
+ $ cat /proc/6753/arch_status
+ AVX512_elapsed_ms:  8
+
+Description
+---
+
+x86 specific entries:
+-
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
 
 --
 Configuring procfs
-- 
2.17.1



Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-24 Thread Aubrey Li
On Wed, Apr 24, 2019 at 10:00 PM Julien Desfossez
 wrote:
>
> On 24-Apr-2019 09:13:10 PM, Aubrey Li wrote:
> > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
> >  wrote:
> > >
> > > Second iteration of the core-scheduling feature.
> > >
> > > This version fixes apparent bugs and performance issues in v1. This
> > > doesn't fully address the issue of core sharing between processes
> > > with different tags. Core sharing still happens 1% to 5% of the time
> > > based on the nature of workload and timing of the runnable processes.
> > >
> > > Changes in v2
> > > -
> > > - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> >
> > Thanks to post v2, based on this version, here is my benchmarks result.
> >
> > Environment setup
> > --
> > Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > cgroup1 workload, sysbench (CPU intensive non AVX workload)
> > cgroup2 workload, gemmbench (AVX512 workload)
> >
> > Case 1: task number < CPU num
> > 
> > 36 sysbench threads in cgroup1
> > 36 gemmbench threads in cgroup2
> >
> > core sched off:
> > - sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
> > core sched on:
> > - sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449
> >
> > Due to core cookie matching, sysbench tasks won't be affect by AVX512
> > tasks, latency has ~28% improvement!!!
> >
> > Case 2: task number > CPU number
> > -
> > 72 sysbench threads in cgroup1
> > 72 gemmbench threads in cgroup2
> >
> > core sched off:
> > - sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
> > core sched on:
> > - sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863
> >
> > So not only power, now security and performance is a pair of contradictions.
> > Due to core cookie not matching and forced idle introduced, latency has ~12%
> > regression.
> >
> > Any comments?
>
> Would it be possible to post the results with HT off as well ?

What's the point here to turn HT off? The latency is sensitive to the
relationship
between the task number and CPU number. Usually less CPU number, more run
queue wait time, and worse result.

Thanks,
-Aubrey


Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-24 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> Second iteration of the core-scheduling feature.
>
> This version fixes apparent bugs and performance issues in v1. This
> doesn't fully address the issue of core sharing between processes
> with different tags. Core sharing still happens 1% to 5% of the time
> based on the nature of workload and timing of the runnable processes.
>
> Changes in v2
> -
> - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105

Thanks to post v2, based on this version, here is my benchmarks result.

Environment setup
--
Skylake server, 2 numa nodes, 104 CPUs (HT on)
cgroup1 workload, sysbench (CPU intensive non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

Case 1: task number < CPU num

36 sysbench threads in cgroup1
36 gemmbench threads in cgroup2

core sched off:
- sysbench 95th percentile latency(ms): avg = 4.952, stddev = 0.55342
core sched on:
- sysbench 95th percentile latency(ms): avg = 3.549, stddev = 0.04449

Due to core cookie matching, sysbench tasks won't be affect by AVX512
tasks, latency has ~28% improvement!!!

Case 2: task number > CPU number
-
72 sysbench threads in cgroup1
72 gemmbench threads in cgroup2

core sched off:
- sysbench 95th percentile latency(ms): avg = 11.914, stddev = 3.259
core sched on:
- sysbench 95th percentile latency(ms): avg = 13.289, stddev = 4.863

So not only power, now security and performance is a pair of contradictions.
Due to core cookie not matching and forced idle introduced, latency has ~12%
regression.

Any comments?

Thanks,
-Aubrey


Re: [RFC PATCH v2 15/17] sched: Trivial forced-newidle balancer

2019-04-23 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> From: Peter Zijlstra (Intel) 
>
> When a sibling is forced-idle to match the core-cookie; search for
> matching tasks to fill the core.
>
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/sched.h |   1 +
>  kernel/sched/core.c   | 131 +-
>  kernel/sched/idle.c   |   1 +
>  kernel/sched/sched.h  |   6 ++
>  4 files changed, 138 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a4b39a28236f..1a309e8546cd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -641,6 +641,7 @@ struct task_struct {
>  #ifdef CONFIG_SCHED_CORE
> struct rb_node  core_node;
> unsigned long   core_cookie;
> +   unsigned intcore_occupation;
>  #endif
>
>  #ifdef CONFIG_CGROUP_SCHED
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9e6e90c6f9b9..e8f5ec641d0a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -217,6 +217,21 @@ struct task_struct *sched_core_find(struct rq *rq, 
> unsigned long cookie)
> return match;
>  }
>
> +struct task_struct *sched_core_next(struct task_struct *p, unsigned long 
> cookie)
> +{
> +   struct rb_node *node = >core_node;
> +
> +   node = rb_next(node);
> +   if (!node)
> +   return NULL;
> +
> +   p = container_of(node, struct task_struct, core_node);
> +   if (p->core_cookie != cookie)
> +   return NULL;
> +
> +   return p;
> +}
> +
>  /*
>   * The static-key + stop-machine variable are needed such that:
>   *
> @@ -3672,7 +3687,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
> struct task_struct *next, *max = NULL;
> const struct sched_class *class;
> const struct cpumask *smt_mask;
> -   int i, j, cpu;
> +   int i, j, cpu, occ = 0;
>
> if (!sched_core_enabled(rq))
> return __pick_next_task(rq, prev, rf);
> @@ -3763,6 +3778,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
> goto done;
> }
>
> +   if (!is_idle_task(p))
> +   occ++;
> +
> rq_i->core_pick = p;
>
> /*
> @@ -3786,6 +3804,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
>
> cpu_rq(j)->core_pick = NULL;
> }
> +   occ = 1;
> goto again;
> }
> }
> @@ -3808,6 +3827,8 @@ next_class:;
>
> WARN_ON_ONCE(!rq_i->core_pick);
>
> +   rq_i->core_pick->core_occupation = occ;
> +
> if (i == cpu)
> continue;
>
> @@ -3823,6 +3844,114 @@ next_class:;
> return next;
>  }
>
> +static bool try_steal_cookie(int this, int that)
> +{
> +   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> +   struct task_struct *p;
> +   unsigned long cookie;
> +   bool success = false;
> +

try_steal_cookie() is in the loop of for_each_cpu_wrap().
The root domain could be large and we should avoid
stealing cookie if source rq has only one task or dst is really busy.

The following patch eliminated a deadlock issue on my side if I didn't
miss anything in v1. I'll double check with v2, but it at least avoids
unnecessary irq off/on and double rq lock. Especially, it avoids lock
contention that the idle cpu which is holding rq lock in the progress
of load_balance() and tries to lock rq here. I think it might be worth to
be picked up.

Thanks,
-Aubrey

---
 kernel/sched/core.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 191ebf9..973a75d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3876,6 +3876,13 @@ static bool try_steal_cookie(int this, int that)
unsigned long cookie;
bool success = false;

+   /*
+* Don't steal if src is idle or has only one runnable task,
+* or dst has more than one runnable task
+*/
+   if (src->nr_running <= 1 || unlikely(dst->nr_running >= 1))
+   return false;
+
local_irq_disable();
double_rq_lock(dst, src);

-- 
2.7.4

> +   local_irq_disable();
> +   double_rq_lock(dst, src);
> +
> +   cookie = dst->core->core_cookie;
> +   if (!cookie)
> +   goto unlock;
> +
> +   if (dst->curr != dst->idle)
> +   goto unlock;
> +
> +   p = sched_core_find(src, cookie);
> +   if (p == src->idle)
> +   goto unlock;
> +
> +   do {
> +   if (p == src->core_pick || p == src->curr)

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Aubrey Li
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
 wrote:
>
> Second iteration of the core-scheduling feature.
>
> This version fixes apparent bugs and performance issues in v1. This
> doesn't fully address the issue of core sharing between processes
> with different tags. Core sharing still happens 1% to 5% of the time
> based on the nature of workload and timing of the runnable processes.
>
> Changes in v2
> -
> - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> - Fixes for couple of NULL pointer dereference crashes
>   - Subhra Mazumdar
>   - Tim Chen

Is this one missed? Or fixed with a better impl?

The boot up CPUs don't match the possible cpu map, so the not onlined
CPU rq->core are not initialized, which causes NULL pointer dereference
panic in online_fair_sched_group():

Thanks,
-Aubrey

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85c728d..bdabf20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10492,6 +10492,10 @@ void online_fair_sched_group(struct task_group *tg)
rq = cpu_rq(i);
se = tg->se[i];

+#ifdef CONFIG_SCHED_CORE
+   if (!rq->core)
+   continue;
+#endif
raw_spin_lock_irq(rq_lockp(rq));
update_rq_clock(rq);
attach_entity_cfs_rq(se);

> - Improves priority comparison logic for process in different cpus
>   - Peter Zijlstra
>   - Aaron Lu
> - Fixes a hard lockup in rq locking
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fix for 32bit build
>   - Aubrey Li
>
> Issues
> --
> - Processes with different tags can still share the core
> - A crash when disabling cpus with core-scheduling on
>- https://paste.debian.net/plainh/fa6bcfa8
>
> ---
>
> Peter Zijlstra (16):
>   stop_machine: Fix stop_cpus_in_progress ordering
>   sched: Fix kerneldoc comment for ia64_set_curr_task
>   sched: Wrap rq::lock access
>   sched/{rt,deadline}: Fix set_next_task vs pick_next_task
>   sched: Add task_struct pointer to sched_class::set_curr_task
>   sched/fair: Export newidle_balance()
>   sched: Allow put_prev_task() to drop rq->lock
>   sched: Rework pick_next_task() slow-path
>   sched: Introduce sched_class::pick_task()
>   sched: Core-wide rq->lock
>   sched: Basic tracking of matching tasks
>   sched: A quick and dirty cgroup tagging interface
>   sched: Add core wide task selection and scheduling.
>   sched/fair: Add a few assertions
>   sched: Trivial forced-newidle balancer
>   sched: Debug bits...
>
> Vineeth Remanan Pillai (1):
>   sched: Wake up sibling if it has something to run
>
>  include/linux/sched.h|   9 +-
>  kernel/Kconfig.preempt   |   7 +-
>  kernel/sched/core.c  | 800 +--
>  kernel/sched/cpuacct.c   |  12 +-
>  kernel/sched/deadline.c  |  99 +++--
>  kernel/sched/debug.c |   4 +-
>  kernel/sched/fair.c  | 137 +--
>  kernel/sched/idle.c  |  42 +-
>  kernel/sched/pelt.h  |   2 +-
>  kernel/sched/rt.c|  96 +++--
>  kernel/sched/sched.h | 185 ++---
>  kernel/sched/stop_task.c |  35 +-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c|   2 +
>  14 files changed, 1145 insertions(+), 289 deletions(-)
>
> --
> 2.17.1
>


[PATCH v17 1/3] proc: add /proc//arch_status

2019-04-21 Thread Aubrey Li
The architecture specific information of the running processes
could be useful to the userland. Add /proc//arch_status
interface support to examine process architecture specific
information externally.

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 arch/x86/Kconfig |  1 +
 fs/proc/Kconfig  | 10 ++
 fs/proc/base.c   | 23 +++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..d5a9c5ddd453 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -208,6 +208,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select PROC_PID_ARCH_STATUS if PROC_FS
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 817c02b13b1d..101bf5054e81 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -97,3 +97,13 @@ config PROC_CHILDREN
 
  Say Y if you are running any user-space software which takes benefit 
from
  this interface. For example, rkt is such a piece of software.
+
+config PROC_PID_ARCH_STATUS
+   bool "Enable /proc//arch_status file"
+   default n
+   help
+ Provides a way to examine process architecture specific information.
+ See  for more information.
+
+ Say Y if you are running any user-space software which wants to obtain
+ process architecture specific information from this interface.
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6a803a0b75df..a890d9f12851 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 #include "fd.h"
@@ -2957,6 +2958,22 @@ static int proc_stack_depth(struct seq_file *m, struct 
pid_namespace *ns,
 }
 #endif /* CONFIG_STACKLEAK_METRICS */
 
+/*
+ * Add support for task architecture specific output in /proc/pid/arch_status.
+ * task_arch_status() must be defined in asm/processor.h
+ */
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+# ifndef task_arch_status
+# define task_arch_status(m, task)
+# endif
+static int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   task_arch_status(m, task);
+   return 0;
+}
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
 /*
  * Thread groups
  */
@@ -3061,6 +3078,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_STACKLEAK_METRICS
ONE("stack_depth", S_IRUGO, proc_stack_depth),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3449,6 +3469,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_LIVEPATCH
ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+   ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.17.1



[PATCH v17 2/3] /proc/pid/arch_status: Add AVX-512 usage elapsed time

2019-04-21 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/arch_status | grep AVX512
AVX512_elapsed_ms:  -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 arch/x86/include/asm/processor.h |  6 +
 arch/x86/kernel/fpu/xstate.c | 43 
 2 files changed, 49 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2bb3a648fc12..0728848473a2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -991,4 +991,10 @@ enum l1tf_mitigations {
 
 extern enum l1tf_mitigations l1tf_mitigation;
 
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/* Add support for task architecture specific output in /proc/pid/arch_status 
*/
+void task_arch_status(struct seq_file *m, struct task_struct *task);
+#define task_arch_status task_arch_status
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..a0dda11ab72e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1244,45 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void task_arch_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
-- 
2.17.1



[PATCH v17 3/3] Documentation/filesystems/proc.txt: add arch_status file

2019-04-21 Thread Aubrey Li
Added /proc//arch_status file, and added AVX512_elapsed_ms in
/proc//arch_status. Report it in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Linux API 
---
 Documentation/filesystems/proc.txt | 37 ++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..cf5114a8fb13 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc//map_files - Information about memory mapped files
   3.10  /proc//timerslack_ns - Task timerslack value
   3.11 /proc//patch_state - Livepatch patch operation state
+  3.12 /proc//arch_status - Task architecture specific information
 
   4Configuring procfs
   4.1  Mount options
@@ -1948,6 +1949,42 @@ patched.  If the patch is being enabled, then the task 
has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12 /proc//arch_status - task architecture specific status
+---
+When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
+architecture specific status of the task.
+
+Example
+---
+ $ cat /proc/6753/arch_status
+ AVX512_elapsed_ms:  8
+
+Description
+---
+
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
 
 --
 Configuring procfs
-- 
2.17.1



[PATCH v16 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-04-17 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 Documentation/filesystems/proc.txt | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..c4a9e48681ad 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:   thread vulnerable
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -289,6 +290,32 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 usage recorded
+
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
+
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.21.0



[PATCH v16 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-04-17 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 arch/x86/include/asm/processor.h |  4 +++
 arch/x86/kernel/fpu/xstate.c | 42 
 2 files changed, 46 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2bb3a648fc12..5a7271ab78d8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -991,4 +991,8 @@ enum l1tf_mitigations {
 
 extern enum l1tf_mitigations l1tf_mitigation;
 
+/* Add support for architecture specific output in /proc/pid/status */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
+#define arch_proc_pid_status arch_proc_pid_status
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..5e55ed9584ab 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
-- 
2.21.0



[PATCH v16 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-17 Thread Aubrey Li
The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 fs/proc/array.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2edbb657f859..a6b394402ea2 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -96,6 +96,14 @@
 #include 
 #include "internal.h"
 
+/*
+ * Add support for architecture specific output in /proc/pid/status.
+ * arch_proc_pid_status() must be defined in asm/processor.h
+ */
+#ifndefarch_proc_pid_status
+#definearch_proc_pid_status(m, task)
+#endif
+
 void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 {
char *buf;
@@ -424,6 +432,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }
 
-- 
2.21.0



[PATCH v15 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-04-16 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 Documentation/filesystems/proc.txt | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..c4a9e48681ad 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:   thread vulnerable
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -289,6 +290,32 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 usage recorded
+
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
+
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.21.0



[PATCH v15 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-04-16 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 arch/x86/include/asm/processor.h |  4 +++
 arch/x86/kernel/fpu/xstate.c | 42 
 2 files changed, 46 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2bb3a648fc12..5a7271ab78d8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -991,4 +991,8 @@ enum l1tf_mitigations {
 
 extern enum l1tf_mitigations l1tf_mitigation;
 
+/* Add support for architecture specific output in /proc/pid/status */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
+#define arch_proc_pid_status arch_proc_pid_status
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..5e55ed9584ab 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
-- 
2.21.0



[PATCH v15 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-16 Thread Aubrey Li
The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 fs/proc/array.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2edbb657f859..87bc7e882d35 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -96,6 +96,11 @@
 #include 
 #include "internal.h"
 
+/* Add support for architecture specific output in /proc/pid/status */
+#ifndefarch_proc_pid_status
+#definearch_proc_pid_status(m, task)
+#endif
+
 void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 {
char *buf;
@@ -424,6 +429,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }
 
-- 
2.21.0



Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

2019-04-10 Thread Aubrey Li
On Wed, Apr 10, 2019 at 12:36 PM Aaron Lu  wrote:
>
> On Tue, Apr 09, 2019 at 11:09:45AM -0700, Tim Chen wrote:
> > Now that we have accumulated quite a number of different fixes to your 
> > orginal
> > posted patches.  Would you like to post a v2 of the core scheduler with the 
> > fixes?
>
> One more question I'm not sure: should a task with cookie=0, i.e. tasks
> that are untagged, be allowed to scheduled on the the same core with
> another tagged task?
>
> The current patch seems to disagree on this, e.g. in pick_task(),
> if max is already chosen but max->core_cookie == 0, then we didn't care
> about cookie and simply use class_pick for the other cpu. This means we
> could schedule two tasks with different cookies(one is zero and the
> other can be tagged).
>
> But then sched_core_find() only allow idle task to match with any tagged
> tasks(we didn't place untagged tasks to the core tree of course :-).
>
> Thoughts? Do I understand this correctly? If so, I think we probably
> want to make this clear before v2. I personally feel, we shouldn't allow
> untagged tasks(like kernel threads) to match with tagged tasks.

Does it make sense if we take untagged tasks as hypervisor, and different
cookie tasks as different VMs? Isolation is done between VMs, not between
VM and hypervisor.

Did you see anything harmful if an untagged task and a tagged task
run simultaneously on the same core?

Thanks,
-Aubrey


[PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-09 Thread Aubrey Li
The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 fs/proc/array.c | 5 +
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2edbb657f859..331592a61718 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, 
struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..bf4328cb58ed 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -74,6 +74,8 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
proc_write_t write,
void *data);
 extern struct pid *tgid_pidfd_to_pid(const struct file *file);
+/* Add support for architecture specific output in /proc/pid/status */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.21.0



[PATCH v14 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-04-09 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 Documentation/filesystems/proc.txt | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..c4a9e48681ad 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:   thread vulnerable
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -289,6 +290,32 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 usage recorded
+
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
+
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.21.0



[PATCH v14 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-04-09 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
Cc: Linux API 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
---
 arch/x86/kernel/fpu/xstate.c | 42 
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..5e55ed9584ab 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
-- 
2.21.0



Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer

2019-04-05 Thread Aubrey Li
On Thu, Apr 4, 2019 at 4:31 PM Aubrey Li  wrote:
>
> On Fri, Feb 22, 2019 at 12:42 AM Peter Zijlstra  wrote:
> >
> > On Thu, Feb 21, 2019 at 04:19:46PM +, Valentin Schneider wrote:
> > > Hi,
> > >
> > > On 18/02/2019 16:56, Peter Zijlstra wrote:
> > > [...]
> > > > +static bool try_steal_cookie(int this, int that)
> > > > +{
> > > > +   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> > > > +   struct task_struct *p;
> > > > +   unsigned long cookie;
> > > > +   bool success = false;
> > > > +
> > > > +   local_irq_disable();
> > > > +   double_rq_lock(dst, src);
>
> Here, should we check dst and src's rq status before lock their rq?
> if src is idle, it could be in the progress of load balance already?
>
> Thanks,
> -Aubrey
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e3162f..a1e0a6f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3861,6 +3861,13 @@ static bool try_steal_cookie(int this, int that)
> unsigned long cookie;
> bool success = false;
>
> +   /*
> +* Don't steal if src is idle or has only one runnable task,
> +* or dst has more than one runnable task
> +*/
> +   if (src->nr_running <= 1 || unlikely(dst->nr_running >= 1))
> +   return false;
> +
> local_irq_disable();
> double_rq_lock(dst, src);

This seems to eliminate a hard lockup on my side.

Thanks,
-Aubrey

[  122.961909] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[  122.961910] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  122.961940] irq event stamp: 8200
[  122.961941] hardirqs last  enabled at (8199): []
trace_hardirqs_on_thunk+0x1a/0x1c
[  122.961942] hardirqs last disabled at (8200): []
trace_hardirqs_off_thunk+0x1a/0x1c
[  122.961942] softirqs last  enabled at (8192): []
__do_softirq+0x3a3/0x3f2
[  122.961943] softirqs last disabled at (8185): []
irq_exit+0xc1/0xd0
[  122.961944] CPU: 0 PID: 2704 Comm: schbench Tainted: G  I
5.0.0-rc8-00544-gf24f5e9-dirty #20
[  122.961945] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  122.961945] RIP: 0010:native_queued_spin_lock_slowpath+0x5c/0x1d0
[  122.961946] Code: ff ff ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00
8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74
08 f3 90 <8b> 07 84 c0 75 f8 b8 05
[  122.961947] RSP: :888c09c03e78 EFLAGS: 0002
[  122.961948] RAX: 00740101 RBX: 888c0ade4400 RCX: 0540
[  122.961948] RDX: 0002 RSI: 0001 RDI: 888c0ade4400
[  122.961949] RBP: 888c0ade4400 R08:  R09: 0001
[  122.961950] R10: 888c09c03e20 R11:  R12: 001e4400
[  122.961950] R13:  R14: 0016 R15: 888be034
[  122.961951] FS:  7f21e17ea700() GS:888c09c0()
knlGS:
[  122.961951] CS:  0010 DS:  ES:  CR0: 80050033
[  122.961952] CR2: 7f21e17ea728 CR3: 000be0b6a002 CR4: 000606f0
[  122.961952] Call Trace:
[  122.961953]  
[  122.961953]  do_raw_spin_lock+0xb4/0xc0
[  122.961954]  _raw_spin_lock+0x4b/0x60
[  122.961954]  scheduler_tick+0x48/0x170
[  122.961955]  ? tick_sched_do_timer+0x60/0x60
[  122.961955]  update_process_times+0x40/0x50
[  122.961956]  tick_sched_handle+0x22/0x60
[  122.961956]  tick_sched_timer+0x37/0x70
[  122.961957]  __hrtimer_run_queues+0xed/0x3f0
[  122.961957]  hrtimer_interrupt+0x122/0x270
[  122.961958]  smp_apic_timer_interrupt+0x86/0x210
[  122.961958]  apic_timer_interrupt+0xf/0x20
[  122.961959]  
[  122.961959] RIP: 0033:0x7fff855a2839
[  122.961960] Code: 08 3b 15 6a c8 ff ff 75 df 31 c0 5d c3 b8 e4 00
00 00 5d 0f 05 c3 f3 90 eb ce 0f 1f 80 00 00 00 00 55 48 85 ff 48 89
e5 41 54 <49> 89 f4 53 74 30 48 8b
[  122.961961] RSP: 002b:7f21e17e9e38 EFLAGS: 0206 ORIG_RAX:
ff13
[  122.961962] RAX: 0002038b RBX: 002dc6c0 RCX: 
[  122.961962] RDX: 7f21e17e9e60 RSI:  RDI: 7f21e17e9e50
[  122.961963] RBP: 7f21e17e9e40 R08:  R09: 7a14
[  122.961963] R10: 7f21e17e9e30 R11: 0246 R12: 7f21e17e9ed0
[  122.961964] R13: 7f2202716e6f R14:  R15: 7f21fc826b00
[  122.961964] Kernel panic - not syncing: Hard LOCKUP
[  122.961965] CPU: 0 PID: 2704 Comm: schbench Tainted: G  I
5.0.0-rc8-00544-gf24f5e9-dirty #20
[  122.961966] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21

Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer

2019-04-04 Thread Aubrey Li
On Fri, Feb 22, 2019 at 12:42 AM Peter Zijlstra  wrote:
>
> On Thu, Feb 21, 2019 at 04:19:46PM +, Valentin Schneider wrote:
> > Hi,
> >
> > On 18/02/2019 16:56, Peter Zijlstra wrote:
> > [...]
> > > +static bool try_steal_cookie(int this, int that)
> > > +{
> > > +   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> > > +   struct task_struct *p;
> > > +   unsigned long cookie;
> > > +   bool success = false;
> > > +
> > > +   local_irq_disable();
> > > +   double_rq_lock(dst, src);

Here, should we check dst and src's rq status before lock their rq?
if src is idle, it could be in the progress of load balance already?

Thanks,
-Aubrey

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e3162f..a1e0a6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3861,6 +3861,13 @@ static bool try_steal_cookie(int this, int that)
unsigned long cookie;
bool success = false;

+   /*
+* Don't steal if src is idle or has only one runnable task,
+* or dst has more than one runnable task
+*/
+   if (src->nr_running <= 1 || unlikely(dst->nr_running >= 1))
+   return false;
+
local_irq_disable();
double_rq_lock(dst, src);


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-18 Thread Aubrey Li
On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
 wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >>  wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.
>

okay, here is another one, on my system, the boot up CPUs don't match the
possible cpu map, so the not onlined CPU rq->core are not initialized, which
causes NULL pointer dereference panic in online_fair_sched_group():

And here is a quick fix.
-
@@ -10488,7 +10493,8 @@ void online_fair_sched_group(struct task_group *tg)
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
+   if (!rq->core)
+   continue;
raw_spin_lock_irq(rq_lockp(rq));
update_rq_clock(rq);
attach_entity_cfs_rq(se);

Thanks,
-Aubrey


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-13 Thread Aubrey Li
On Thu, Mar 14, 2019 at 8:35 AM Tim Chen  wrote:
> >>
> >> One more NULL pointer dereference:
> >>
> >> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
> >> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
> >> at 0008
> >> [  201.950254] [ cut here ]
> >> [  201.959045] #PF error: [normal kernel read fault]
> >> [  201.964272] !se->on_rq
> >> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> >> set_next_buddy+0x52/0x70
> >
> Shouldn't the for_each_sched_entity(se) skip the code block for !se case
> have avoided null pointer access of se?
>
> Since
> #define for_each_sched_entity(se) \
> for (; se; se = se->parent)
>
> Scratching my head a bit here on how your changes would have made
> a difference.

This NULL pointer dereference is not replicable, which makes me thought the
change works...

>
> In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed 
> with the actual OOPs?
> Saw also in your original log rb_insert_color.  Wonder if that
> was actually the source of the Oops?

No chance to figure this out, I only saw this once, lockup occurs more
frequently.

Thanks,
-Aubrey


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-12 Thread Aubrey Li
On Tue, Mar 12, 2019 at 3:45 PM Aubrey Li  wrote:
>
> On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
>  wrote:
> >
> >
> > On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> > >
> > > On 3/10/19 9:23 PM, Aubrey Li wrote:
> > >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> > >>  wrote:
> > >>> expected. Most of the performance recovery happens in patch 15 which,
> > >>> unfortunately, is also the one that introduces the hard lockup.
> > >>>
> > >> After applied Subhra's patch, the following is triggered by enabling
> > >> core sched when a cgroup is
> > >> under heavy load.
> > >>
> > > It seems you are facing some other deadlock where printk is involved.
> > > Can you
> > > drop the last patch (patch 16 sched: Debug bits...) and try?
> > >
> > > Thanks,
> > > Subhra
> > >
> > Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> > 16. Btw
> > the NULL fix had something missing,
>
> One more NULL pointer dereference:
>
> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
> at 0008
> [  201.950254] [ cut here ]
> [  201.959045] #PF error: [normal kernel read fault]
> [  201.964272] !se->on_rq
> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> set_next_buddy+0x52/0x70

A quick workaround below:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4fd94f..ef6acfe2cf7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
return;

for_each_sched_entity(se) {
-   if (SCHED_WARN_ON(!se->on_rq))
+   if (SCHED_WARN_ON(!(se && se->on_rq))
return;
cfs_rq_of(se)->last = se;
}
@@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
return;

for_each_sched_entity(se) {
-   if (SCHED_WARN_ON(!se->on_rq))
+   if (SCHED_WARN_ON(!(se && se->on_rq))
return;
cfs_rq_of(se)->next = se;
}

And now I'm running into a hard LOCKUP:

[  326.336279] NMI watchdog: Watchdog detected hard LOCKUP on cpu 31
[  326.336280] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  326.336311] irq event stamp: 164460
[  326.336312] hardirqs last  enabled at (164459):
[] sched_core_balance+0x247/0x470
[  326.336312] hardirqs last disabled at (164460):
[] sched_core_balance+0x113/0x470
[  326.336313] softirqs last  enabled at (164250):
[] __do_softirq+0x359/0x40a
[  326.336314] softirqs last disabled at (164213):
[] irq_exit+0xc1/0xd0
[  326.336315] CPU: 31 PID: 0 Comm: swapper/31 Tainted: G  I
5.0.0-rc8-00542-gd697415be692-dirty #15
[  326.336316] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  326.336317] RIP: 0010:native_queued_spin_lock_slowpath+0x18f/0x1c0
[  326.336318] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 80 51 1e 00 48 03 04 f5 40 58 39 82 48 89 10 8b 42 08 85 c0 75
09 f3 90 <8b> 42 08 85 c0 74 f7 4b
[  326.336318] RSP: :c9000643bd58 EFLAGS: 0046
[  326.336319] RAX:  RBX: 888c0ade4400 RCX: 0080
[  326.336320] RDX: 88980bbe5180 RSI: 0019 RDI: 888c0ade4400
[  326.336321] RBP: 888c0ade4400 R08: 0080 R09: 001e3a80
[  326.336321] R10: c9000643bd08 R11:  R12: 
[  326.336322] R13:  R14: 88980bbe4400 R15: 001f
[  326.336323] FS:  () GS:88980ba0()
knlGS:
[  326.336323] CS:  0010 DS:  ES:  CR0: 80050033
[  326.336324] CR2: 7fdcd7fd7728 CR3: 0017e821a001 CR4: 000606e0
[  326.336325] Call Trace:
[  326.336325]  do_raw_spin_lock+0xab/0xb0
[  326.336326]  _raw_spin_lock+0x4b/0x60
[  326.336326]  double_rq_lock+0x99/0x140
[  326.336327]  sched_core_balance+0x11e/0x470
[  326.336327]  __balance_callback+0x49/0xa0
[  326.336328]  __schedule+0x1113/0x1570
[  326.336328]  schedule_idle+0x1e/0x40
[  326.336329]  do_idle+0x16b/0x2a0
[  326.336329]  cpu_startup_entry+0x19/0x20
[  326.336330]  start_secondary+0x17f/0x1d0
[  326.336331]  secondary_startup_64+0xa4/0xb0
[  330.959367] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-12 Thread Aubrey Li
On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
 wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >>  wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing,

One more NULL pointer dereference:

Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
[  201.950203] BUG: unable to handle kernel NULL pointer dereference
at 0008
[  201.950254] [ cut here ]
[  201.959045] #PF error: [normal kernel read fault]
[  201.964272] !se->on_rq
[  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
set_next_buddy+0x52/0x70
[  201.969596] PGD 800be9ed7067 P4D 800be9ed7067 PUD c00911067 PMD 0
[  201.972300] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  201.981712] Oops:  [#1] SMP PTI
[  201.989463] CPU: 22 PID: 2965 Comm: schbench Tainted: G  I
 5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.074710] CPU: 27 PID: 2947 Comm: schbench Tainted: G  I
 5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.078662] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[  202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[  202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[  202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.118263] RSP: 0018:c9000a5cbbb0 EFLAGS: 00010086
[  202.129858] RSP: 0018:c9000a463cc0 EFLAGS: 00010046
[  202.135102] RAX:  RBX: 88980047e800 RCX: 
[  202.135105] RDX: 888be28caa40 RSI: 0001 RDI: 8110c3fa
[  202.156251] RAX:  RBX: 888bfeb8 RCX: 888bfeb8
[  202.156255] RDX: 888be28c8348 RSI: 88980b5e50c8 RDI: 888bfeb80348
[  202.177390] RBP: 88980047ea00 R08:  R09: 001e3a80
[  202.177393] R10: c9000a5cbb28 R11:  R12: 888c0b9e4400
[  202.183317] RBP: 88980b5e4400 R08: 014f R09: 8898049cf000
[  202.183320] R10: 0078 R11: 8898049cfc5c R12: 0004
[  202.189241] R13: 888be28caa40 R14: 0009 R15: 0009
[  202.189245] FS:  7f05f87f8700() GS:888c0b80()
knlGS:
[  202.197310] R13: c9000a463d20 R14: 0246 R15: 001c
[  202.197314] FS:  7f0611cca700() GS:88980b20()
knlGS:
[  202.205373] CS:  0010 DS:  ES:  CR0: 80050033
[  202.205377] CR2: 7f05e9fdb728 CR3: 000be4d0e006 CR4: 000606e0
[  202.213441] CS:  0010 DS:  ES:  CR0: 80050033
[  202.213444] CR2: 0008 CR3: 000be4d0e005 CR4: 000606e0
[  202.221509] Call Trace:
[  202.229574] Call Trace:
[  202.237640]  dequeue_task_fair+0x7e/0x1b0
[  202.245700]  enqueue_task+0x6f/0xb0
[  202.253761]  __schedule+0xcc8/0x1570
[  202.261823]  ttwu_do_activate+0x6a/0xc0
[  202.270985]  schedule+0x28/0x70
[  202.279042]  try_to_wake_up+0x20b/0x510
[  202.288206]  futex_wait_queue_me+0xbf/0x130
[  202.294714]  wake_up_q+0x3f/0x80
[  202.302773]  futex_wait+0xeb/0x240
[  202.309282]  futex_wake+0x157/0x180
[  202.317353]  ? __switch_to_asm+0x40/0x70
[  202.320158]  do_futex+0x451/0xad0
[  202.322970]  ? __switch_to_asm+0x34/0x70
[  202.322980]  ? __switch_to_asm+0x40/0x70
[  202.327541]  ? do_nanosleep+0xcc/0x1a0
[  202.331521]  do_futex+0x479/0xad0
[  202.335599]  ? hrtimer_nanosleep+0xe7/0x230
[  202.339954]  ? lockdep_hardirqs_on+0xf0/0x180
[  202.343548]  __x64_sys_futex+0x134/0x180
[  202.347906]  ? _raw_spin_unlock_irq+0x29/0x40
[  202.352660]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  202.356343]  ? finish

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-10 Thread Aubrey Li
On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:
>
> expected. Most of the performance recovery happens in patch 15 which,
> unfortunately, is also the one that introduces the hard lockup.
>

After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

Mar 10 22:46:57 aubrey-ivb kernel: [ 2662.973792] core sched enabled
[ 2663.348371] WARNING: CPU: 5 PID: 3087 at kernel/sched/pelt.h:119
update_load_avg+00
[ 2663.357960] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.443269] CPU: 5 PID: 3087 Comm: schbench Tainted: G  I
5.0.0-rc8-7
[ 2663.454520] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.466063] RIP: 0010:update_load_avg+0x52/0x5e0
[ 2663.471286] Code: 8b af 70 01 00 00 8b 3d 14 a6 6e 01 85 ff 74 1c
e9 4c 04 00 00 40
[ 2663.492350] RSP: :c9000a6a3dd8 EFLAGS: 00010046
[ 2663.498276] RAX:  RBX: 888be7937600 RCX: 0001
[ 2663.506337] RDX:  RSI: 888c09fe4418 RDI: 0046
[ 2663.514398] RBP: 888bdfb8aac0 R08:  R09: 888bdfb9aad8
[ 2663.522459] R10:  R11:  R12: 
[ 2663.530520] R13: 888c09fe4400 R14: 0001 R15: 888bdfb8aa40
[ 2663.538582] FS:  7f006a7cc700() GS:888c0a60()
knlGS:000
[ 2663.547739] CS:  0010 DS:  ES:  CR0: 80050033
[ 2663.554241] CR2: 00604048 CR3: 000bfdd64006 CR4: 000606e0
[ 2663.562310] Call Trace:
[ 2663.565128]  ? update_load_avg+0xa6/0x5e0
[ 2663.569690]  ? update_load_avg+0xa6/0x5e0
[ 2663.574252]  set_next_entity+0xd9/0x240
[ 2663.578619]  set_next_task_fair+0x6e/0xa0
[ 2663.583182]  __schedule+0x12af/0x1570
[ 2663.587350]  schedule+0x28/0x70
[ 2663.590937]  exit_to_usermode_loop+0x61/0xf0
[ 2663.595791]  prepare_exit_to_usermode+0xbf/0xd0
[ 2663.600936]  retint_user+0x8/0x18
[ 2663.604719] RIP: 0033:0x402057
[ 2663.608209] Code: 24 10 64 48 8b 04 25 28 00 00 00 48 89 44 24 38
31 c0 e8 2c eb ff
[ 2663.629351] RSP: 002b:7f006a7cbe50 EFLAGS: 0246 ORIG_RAX:
ff02
[ 2663.637924] RAX: 0029778f RBX: 002dc6c0 RCX: 0002
[ 2663.645985] RDX: 7f006a7cbe60 RSI:  RDI: 7f006a7cbe50
[ 2663.654046] RBP: 0006 R08: 0001 R09: 7ffe965450a0
[ 2663.662108] R10: 7f006a7cbe30 R11: 0003b368 R12: 7f006a7cbed0
[ 2663.670160] R13: 7f0098c1ce6f R14:  R15: 7f0084a30390
[ 2663.678226] irq event stamp: 27182
[ 2663.682114] hardirqs last  enabled at (27181): []
exit_to_usermo0
[ 2663.692348] hardirqs last disabled at (27182): []
__schedule+0xd0
[ 2663.701716] softirqs last  enabled at (27004): []
__do_softirq+0a
[ 2663.711268] softirqs last disabled at (26999): []
irq_exit+0xc1/0
[ 2663.720247] ---[ end trace d46e59b84bcde977 ]---
[ 2663.725503] BUG: unable to handle kernel paging request at 005df5f0
[ 2663.733377] #PF error: [WRITE]
[ 2663.736875] PGD 800bff037067 P4D 800bff037067 PUD bff0b1067
PMD bfbf02067 0
[ 2663.745954] Oops: 0002 [#1] SMP PTI
[ 2663.749931] CPU: 5 PID: 3078 Comm: schbench Tainted: GW I
5.0.0-rc8-7
[ 2663.761233] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.772836] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2663.779827] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2663.800970] RSP: :c9000a633e18 EFLAGS: 00010006
[ 2663.806892] RAX: 005df5f0 RBX: 888bdfbf2a40 RCX: 0018
[ 2663.814954] RDX: 888c0a7e5180 RSI: 1fff RDI: 888bdfbf2a40
[ 2663.823015] RBP: 888bdfbf2a40 R08: 0018 R09: 0001
[ 2663.831068] R10: c9000a633dc0 R11: 888bdfbf2a58 R12: 0046
[ 2663.839129] R13: 888bdfb8aa40 R14: 888be5b90d80 R15: 888be5b90d80
[ 2663.847182] FS:  7f00797ea700() GS:888c0a60()
knlGS:000
[ 2663.856330] CS:  0010 DS:  ES:  CR0: 80050033
[ 2663.862834] CR2: 005df5f0 CR3: 000bfdd64006 CR4: 000606e0
[ 2663.870895] Call Trace:
[ 2663.873715]  do_raw_spin_lock+0xab/0xb0
[ 2663.878095]  _raw_spin_lock_irqsave+0x63/0x80
[ 2663.883066]  __balance_callback+0x19/0xa0
[ 2663.887626]  __schedule+0x1113/0x1570
[ 2663.891803]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 2663.897142]  ? apic_timer_interrupt+0xa/0x20
[ 2663.901996]  ? interrupt_entry+0x9a/0xe0
[ 2663.906450]  ? apic_timer_interrupt+0xa/0x20
[ 2663.911307] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.996886] CR2: 005df5f0
[ 2664.000686] ---[ end trace d46e59b84bcde978 ]---
[ 2664.011393] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2664.018386] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2664.039529] RSP: :c9000a633e18 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-26 Thread Aubrey Li
On Tue, Feb 26, 2019 at 4:26 PM Aubrey Li  wrote:
>
> On Sat, Feb 23, 2019 at 3:27 AM Tim Chen  wrote:
> >
> > On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> > >> On 18/02/19 21:40, Peter Zijlstra wrote:
> > >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > >>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  
> > >>>> wrote:
> > >>>>>
> > >>>>> However; whichever way around you turn this cookie; it is expensive 
> > >>>>> and nasty.
> > >>>>
> > >>>> Do you (or anybody else) have numbers for real loads?
> > >>>>
> > >>>> Because performance is all that matters. If performance is bad, then
> > >>>> it's pointless, since just turning off SMT is the answer.
> > >>>
> > >>> Not for these patches; they stopped crashing only yesterday and I
> > >>> cleaned them up and send them out.
> > >>>
> > >>> The previous version; which was more horrible; but L1TF complete, was
> > >>> between OK-ish and horrible depending on the number of VMEXITs a
> > >>> workload had.
> > >>>
> > >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> > >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> > >>> very bestest to have no VMEXITs so it mostly works for them (with the
> > >>> obvious exception of single VCPU guests).
> > >>
> > >> If you are giving access to dedicated cores to guests, you also let them
> > >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> > >> bound workload.
> > >>
> > >> In any case, IIUC what you are looking for is:
> > >>
> > >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> > >> bound.
> > >>
> > >> 2) compare two runs, one without SMT and without core scheduler, and one
> > >> with SMT+core scheduler.
> > >>
> > >> 3) find out whether performance is helped by SMT despite the increased
> > >> overhead of the core scheduler
> > >>
> > >> Do you want some other load in the host, so that the scheduler actually
> > >> does do something?  Or is the point just that you show that the
> > >> performance isn't affected when the scheduler does not have anything to
> > >> do (which should be obvious, but having numbers is always better)?
> > >
> > > Well, what _I_ want is for all this to just go away :-)
> > >
> > > Tim did much of testing last time around; and I don't think he did
> > > core-pinning of VMs much (although I'm sure he did some of that). I'm
> >
> > Yes. The last time around I tested basic scenarios like:
> > 1. single VM pinned on a core
> > 2. 2 VMs pinned on a core
> > 3. system oversubscription (no pinning)
> >
> > In general, CPU bound benchmarks and even things without too much I/O
> > causing lots of VMexits perform better with HT than without for Peter's
> > last patchset.
> >
> > > still a complete virt noob; I can barely boot a VM to save my life.
> > >
> > > (you should be glad to not have heard my cursing at qemu cmdline when
> > > trying to reproduce some of Tim's results -- lets just say that I can
> > > deal with gpg)
> > >
> > > I'm sure he tried some oversubscribed scenarios without pinning.
> >
> > We did try some oversubscribed scenarios like SPECVirt, that tried to
> > squeeze tons of VMs on a single system in over subscription mode.
> >
> > There're two main problems in the last go around:
> >
> > 1. Workload with high rate of Vmexits (SpecVirt is one)
> > were a major source of pain when we tried Peter's previous patchset.
> > The switch from vcpus to qemu and back in previous version of Peter's patch
> > requires some coordination between the hyperthread siblings via IPI.  And 
> > for
> > workload that does this a lot, the overhead quickly added up.
> >
> > For Peter's new patch, this overhead hopefully would be reduced and give
> > better performance.
> >
> > 2. Load balancing is quite tricky.  Peter's last patchset did not have
> > load balancing for consolidating compatible running threads.
> > I did some non-sophisticated load balancing
> > to pair vcpu

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-26 Thread Aubrey Li
On Sat, Feb 23, 2019 at 3:27 AM Tim Chen  wrote:
>
> On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> >> On 18/02/19 21:40, Peter Zijlstra wrote:
> >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>  On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  
>  wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and 
> > nasty.
> 
>  Do you (or anybody else) have numbers for real loads?
> 
>  Because performance is all that matters. If performance is bad, then
>  it's pointless, since just turning off SMT is the answer.
> >>>
> >>> Not for these patches; they stopped crashing only yesterday and I
> >>> cleaned them up and send them out.
> >>>
> >>> The previous version; which was more horrible; but L1TF complete, was
> >>> between OK-ish and horrible depending on the number of VMEXITs a
> >>> workload had.
> >>>
> >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> >>> very bestest to have no VMEXITs so it mostly works for them (with the
> >>> obvious exception of single VCPU guests).
> >>
> >> If you are giving access to dedicated cores to guests, you also let them
> >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> >> bound workload.
> >>
> >> In any case, IIUC what you are looking for is:
> >>
> >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> >> bound.
> >>
> >> 2) compare two runs, one without SMT and without core scheduler, and one
> >> with SMT+core scheduler.
> >>
> >> 3) find out whether performance is helped by SMT despite the increased
> >> overhead of the core scheduler
> >>
> >> Do you want some other load in the host, so that the scheduler actually
> >> does do something?  Or is the point just that you show that the
> >> performance isn't affected when the scheduler does not have anything to
> >> do (which should be obvious, but having numbers is always better)?
> >
> > Well, what _I_ want is for all this to just go away :-)
> >
> > Tim did much of testing last time around; and I don't think he did
> > core-pinning of VMs much (although I'm sure he did some of that). I'm
>
> Yes. The last time around I tested basic scenarios like:
> 1. single VM pinned on a core
> 2. 2 VMs pinned on a core
> 3. system oversubscription (no pinning)
>
> In general, CPU bound benchmarks and even things without too much I/O
> causing lots of VMexits perform better with HT than without for Peter's
> last patchset.
>
> > still a complete virt noob; I can barely boot a VM to save my life.
> >
> > (you should be glad to not have heard my cursing at qemu cmdline when
> > trying to reproduce some of Tim's results -- lets just say that I can
> > deal with gpg)
> >
> > I'm sure he tried some oversubscribed scenarios without pinning.
>
> We did try some oversubscribed scenarios like SPECVirt, that tried to
> squeeze tons of VMs on a single system in over subscription mode.
>
> There're two main problems in the last go around:
>
> 1. Workload with high rate of Vmexits (SpecVirt is one)
> were a major source of pain when we tried Peter's previous patchset.
> The switch from vcpus to qemu and back in previous version of Peter's patch
> requires some coordination between the hyperthread siblings via IPI.  And for
> workload that does this a lot, the overhead quickly added up.
>
> For Peter's new patch, this overhead hopefully would be reduced and give
> better performance.
>
> 2. Load balancing is quite tricky.  Peter's last patchset did not have
> load balancing for consolidating compatible running threads.
> I did some non-sophisticated load balancing
> to pair vcpus up.  But the constant vcpu migrations overhead probably ate up
> any improvements from better load pairing.  So I didn't get much
> improvement in the over-subscription case when turning on load balancing
> to consolidate the VCPUs of the same VM. We'll probably have to try
> out this incarnation of Peter's patch and see how well the load balancing
> works.
>
> I'll try to line up some benchmarking folks to do some tests.

I can help to do some basic tests.

Cgroup bias looks weird to me. If I have hundreds of cgroups, should I turn
core scheduling(cpu.tag) on one by one? Or Is there a global knob I missed?

Thanks,
-Aubrey


[PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output

2019-02-23 Thread Aubrey Li
The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 fs/proc/array.c | 5 +
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9d428d5a0ac8..ea7a981f289c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, 
struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..1de9ba1b064f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
int (*show)(struct seq_file 
*, void *),
proc_write_t write,
void *data);
+/* Add support for architecture specific output in /proc/pid/status */
+extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.17.1



[PATCH v13 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-02-23 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 Documentation/filesystems/proc.txt | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..c4a9e48681ad 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:   thread vulnerable
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -289,6 +290,32 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 usage recorded
+
+ AVX512_elapsed_ms:
+ --
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+1) The time which the task spent on the CPU without being scheduled
+   out. With CPU isolation and a single runnable task this can take
+   several seconds.
+
+2) The time since the task was scheduled out last. Depending on the
+   reason for being scheduled out (time slice exhausted, syscall ...)
+   this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
+
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.17.1



[PATCH v13 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-02-23 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/pid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 arch/x86/kernel/fpu/xstate.c | 42 
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9cc108456d0b..e480a535eeb2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = task->thread.fpu.avx512_timestamp;
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
-- 
2.17.1



[PATCH v12 1/3] /proc/pid/status: Add support for architecture specific output

2019-02-20 Thread Aubrey Li
The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 fs/proc/array.c | 5 +
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9d428d5a0ac8..ea7a981f289c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, 
struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..1de9ba1b064f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
int (*show)(struct seq_file 
*, void *),
proc_write_t write,
void *data);
+/* Add support for architecture specific output in /proc/pid/status */
+extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.17.1



[PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-02-20 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 Documentation/filesystems/proc.txt | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..425f2f09c9aa 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc//map_files - Information about memory mapped files
   3.10  /proc//timerslack_ns - Task timerslack value
   3.11 /proc//patch_state - Livepatch patch operation state
+  3.12 /proc//AVX512_elapsed_ms - time elapsed since last AVX512 use
 
   4Configuring procfs
   4.1  Mount options
@@ -207,6 +208,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:   thread vulnerable
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +226,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -289,6 +291,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 use in millisecond
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
@@ -1948,6 +1951,29 @@ patched.  If the patch is being enabled, then the task 
has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12   /proc//AVX512_elapsed_ms - time elapsed since last AVX512 use
+--
+If AVX512 is supported on the machine, this file displays time elapsed since
+last AVX512 usage of the task in millisecond.
+
+The per-task AVX512 usage tracking mechanism is added during context switch.
+When the task is scheduled out, the AVX512 timestamp of the task is tagged
+by jiffies if AVX512 usage is detected.
+
+When this interface is queried, AVX512_elapsed_ms is calculated as follows:
+
+   delta = (long)(jiffies_now - AVX512_timestamp);
+   AVX512_elpased_ms = jiffies_to_msecs(delta);
+
+Because this tracking mechanism depends on context switch, the number of
+AVX512_elapsed_ms could be inaccurate if the AVX512 using task runs alone on
+a CPU and not scheduled out for a long time. An extreme experiment shows a
+task is spinning on the AVX512 ops on an isolated CPU, but the longest elapsed
+time is close to 4 seconds(HZ = 250).
+
+So 5s or even longer is an appropriate threshold for the job scheduler to poll
+and decide if the task should be classifed as an AVX512 task and migrated
+away from the core on which a Non-AVX512 task is running.
 
 --
 Configuring procfs
-- 
2.17.1



[PATCH v12 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-02-20 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/pid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:  4
AVX512_elapsed_ms:  8
AVX512_elapsed_ms:  4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  -1

The number '-1' indicates the task didn't use AVX-512 components
before thus unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 arch/x86/kernel/fpu/xstate.c | 42 
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9cc108456d0b..e480a535eeb2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = task->thread.fpu.avx512_timestamp;
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
-- 
2.17.1



[PATCH v11 1/3] /proc/pid/status: Add support for architecture specific output

2019-02-12 Thread Aubrey Li
The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 fs/proc/array.c | 5 +
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9d428d5a0ac8..ea7a981f289c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, 
struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..1de9ba1b064f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
int (*show)(struct seq_file 
*, void *),
proc_write_t write,
void *data);
+/* Add support for architecture specific output in /proc/pid/status */
+extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.17.1



[PATCH v11 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-02-12 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 Documentation/filesystems/proc.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..8da60ddcda7f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:   thread vulnerable
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   1020
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -289,6 +290,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 use in millisecond
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.17.1



[PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-02-12 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Example:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  1020

The number '1020' denotes 1020 millisecond elapsed since last time
context switch the off-CPU task using AVX-512 components, thus the
task could cause core frequency drop.

Or:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  -1

The number '-1' indicates the task didn't use AVX-512 components
before thus unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 arch/x86/kernel/fpu/xstate.c | 42 
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9cc108456d0b..e480a535eeb2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = task->thread.fpu.avx512_timestamp;
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
-- 
2.17.1



[PATCH v10 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-02-12 Thread Aubrey Li
AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Example:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  1020

The number '1020' denotes 1020 millisecond elapsed since last time
context switch the off-CPU task using AVX-512 components, thus the
task could cause core frequency drop.

Or:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:  -1

The number '-1' indicates the task didn't use AVX-512 components
before thus unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid  -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

 3,251,565,961  core_power.lvl2_turbo_license

   1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 arch/x86/kernel/fpu/xstate.c | 41 
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9cc108456d0b..c42a233f26f1 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1243,3 +1244,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
 
return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+   unsigned long timestamp = task->thread.fpu.avx512_timestamp;
+   long delta;
+
+   if (!timestamp) {
+   /*
+* Report -1 if no AVX512 usage
+*/
+   delta = -1;
+   } else {
+   delta = (long)(jiffies - timestamp);
+   /*
+* Cap to LONG_MAX if time difference > LONG_MAX
+*/
+   if (delta < 0)
+   delta = LONG_MAX;
+   delta = jiffies_to_msecs(delta);
+   }
+
+   seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+   seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+   /*
+* Report AVX512 state if the processor and build option supported.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+   avx512_status(m, task);
+}
-- 
2.17.1



[PATCH v10 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-02-12 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 Documentation/filesystems/proc.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..8da60ddcda7f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:   thread vulnerable
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   1020
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -289,6 +290,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 use in millisecond
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.17.1



[PATCH v10 1/3] /proc/pid/status: Add support for architecture specific output

2019-02-12 Thread Aubrey Li
The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 fs/proc/array.c | 5 +
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9d428d5a0ac8..ea7a981f289c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, 
struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..1de9ba1b064f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
int (*show)(struct seq_file 
*, void *),
proc_write_t write,
void *data);
+/* Add support for architecture specific output in /proc/pid/status */
+extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.17.1



[PATCH v9 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms

2019-02-11 Thread Aubrey Li
Added AVX512_elapsed_ms in /proc//status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 Documentation/filesystems/proc.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 520f6a84cf50..8275ccd766e4 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -197,6 +197,7 @@ read the file /proc/PID/status:
   Seccomp:0
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
+  AVX512_elapsed_ms:   1020
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -214,7 +215,7 @@ asynchronous manner and the value may not be very precise. 
To see a precise
 snapshot of a moment, you can see /proc//smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.8)
+Table 1-2: Contents of the status files (as of 5.1)
 ..
  Field   Content
  Namefilename of the executable
@@ -275,6 +276,7 @@ Table 1-2: Contents of the status files (as of 4.8)
  Mems_allowed_list   Same as previous, but in "list format"
  voluntary_ctxt_switches number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms   time elapsed since last AVX512 use in millisecond
 ..
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.17.1



  1   2   3   >