[tip: perf/core] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit

2021-04-20 Thread tip-bot2 for Ricardo Neri
The following commit has been merged into the perf/core branch of tip:

Commit-ID: a161545ab53b174c016b0eb63c289525d2f6
Gitweb:
https://git.kernel.org/tip/a161545ab53b174c016b0eb63c289525d2f6
Author:Ricardo Neri 
AuthorDate:Mon, 12 Apr 2021 07:30:41 -07:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 19 Apr 2021 20:03:23 +02:00

x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit

Add feature enumeration to identify a processor with Intel Hybrid
Technology: one in which CPUs of more than one type are the same package.
On a hybrid processor, all CPUs support the same homogeneous (i.e.,
symmetric) instruction set. All CPUs enumerate the same features in CPUID.
Thus, software (user space and kernel) can run and migrate to any CPU in
the system as well as utilize any of the enumerated features without any
change or special provisions. The main difference among CPUs in a hybrid
processor are power and performance properties.

Signed-off-by: Ricardo Neri 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Tony Luck 
Reviewed-by: Len Brown 
Acked-by: Borislav Petkov 
Link: 
https://lkml.kernel.org/r/1618237865-33448-2-git-send-email-kan.li...@linux.intel.com
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index cc96e26..1ba4a6e 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -374,6 +374,7 @@
 #define X86_FEATURE_MD_CLEAR   (18*32+10) /* VERW clears CPU buffers */
 #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */
 #define X86_FEATURE_SERIALIZE  (18*32+14) /* SERIALIZE instruction */
+#define X86_FEATURE_HYBRID_CPU (18*32+15) /* "" This part has CPUs of 
more than one type */
 #define X86_FEATURE_TSXLDTRK   (18*32+16) /* TSX Suspend Load Address 
Tracking */
 #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR   (18*32+19) /* Intel ARCH LBR */


[tip: perf/core] x86/cpu: Add helper function to get the type of the current hybrid CPU

2021-04-20 Thread tip-bot2 for Ricardo Neri
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 250b3c0d79d1f4a55e54d8a9ef48058660483fef
Gitweb:
https://git.kernel.org/tip/250b3c0d79d1f4a55e54d8a9ef48058660483fef
Author:Ricardo Neri 
AuthorDate:Mon, 12 Apr 2021 07:30:42 -07:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 19 Apr 2021 20:03:23 +02:00

x86/cpu: Add helper function to get the type of the current hybrid CPU

On processors with Intel Hybrid Technology (i.e., one having more than
one type of CPU in the same package), all CPUs support the same
instruction set and enumerate the same features on CPUID. Thus, all
software can run on any CPU without restrictions. However, there may be
model-specific differences among types of CPUs. For instance, each type
of CPU may support a different number of performance counters. Also,
machine check error banks may be wired differently. Even though most
software will not care about these differences, kernel subsystems
dealing with these differences must know.

Add and expose a new helper function get_this_hybrid_cpu_type() to query
the type of the current hybrid CPU. The function will be used later in
the perf subsystem.

The Intel Software Developer's Manual defines the CPU type as 8-bit
identifier.

Signed-off-by: Ricardo Neri 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Tony Luck 
Reviewed-by: Len Brown 
Acked-by: Borislav Petkov 
Link: 
https://lkml.kernel.org/r/1618237865-33448-3-git-send-email-kan.li...@linux.intel.com
---
 arch/x86/include/asm/cpu.h  |  6 ++
 arch/x86/kernel/cpu/intel.c | 16 
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index da78ccb..610905d 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -45,6 +45,7 @@ extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 
*c);
 extern void switch_to_sld(unsigned long tifn);
 extern bool handle_user_split_lock(struct pt_regs *regs, long error_code);
 extern bool handle_guest_split_lock(unsigned long ip);
+u8 get_this_hybrid_cpu_type(void);
 #else
 static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {}
 static inline void switch_to_sld(unsigned long tifn) {}
@@ -57,6 +58,11 @@ static inline bool handle_guest_split_lock(unsigned long ip)
 {
return false;
 }
+
+static inline u8 get_this_hybrid_cpu_type(void)
+{
+   return 0;
+}
 #endif
 #ifdef CONFIG_IA32_FEAT_CTL
 void init_ia32_feat_ctl(struct cpuinfo_x86 *c);
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 0e422a5..26fb626 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1195,3 +1195,19 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c)
cpu_model_supports_sld = true;
split_lock_setup();
 }
+
+#define X86_HYBRID_CPU_TYPE_ID_SHIFT   24
+
+/**
+ * get_this_hybrid_cpu_type() - Get the type of this hybrid CPU
+ *
+ * Returns the CPU type [31:24] (i.e., Atom or Core) of a CPU in
+ * a hybrid processor. If the processor is not hybrid, returns 0.
+ */
+u8 get_this_hybrid_cpu_type(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU))
+   return 0;
+
+   return cpuid_eax(0x001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
+}


[PATCH v2 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-04-13 Thread Ricardo Neri
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the CPU doing the load balancing, but also of
its SMT siblings.

If the CPU doing the balancing is idle but its SMT siblings are not busy,
performance suffers if it pulls tasks from a medium priority CPU that does
not have SMT siblings. The decision to label a group for asymmetric packing
balancing is done in update_sg_lb_stats(). However, for SMT, that code does
not account for idle SMT siblings.

Implement asym_can_pull_tasks() to revisit the early decision on whether
the CPU doing the balance can pull tasks once the needed information is
available. arch_sched_asym_prefer_early() and
arch_asym_check_smt_siblings() are used to conserve the legacy behavior.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
Changes since v1:
 * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
   tasks. Instead, reclassify the candidate busiest group, as it
   may still be selected. (PeterZ)
 * Avoid an expensive and unnecessary call to cpumask_weight() when
   determining if a sched_group is comprised of SMT siblings.
   (PeterZ).
---
 include/linux/sched/topology.h |   1 +
 kernel/sched/fair.c| 127 +
 2 files changed, 128 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 663b98959305..6487953b24e8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -58,6 +58,7 @@ static inline int cpu_numa_flags(void)
 
 extern int arch_asym_cpu_priority(int cpu);
 extern bool arch_sched_asym_prefer_early(int a, int b);
+extern bool arch_asym_check_smt_siblings(void);
 
 struct sched_domain_attr {
int relax_domain_level;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e74da853b046..2a33f93646b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -115,6 +115,14 @@ bool __weak arch_sched_asym_prefer_early(int a, int b)
return sched_asym_prefer(a, b);
 }
 
+/*
+ * For asym packing, first check the state of SMT siblings before deciding to
+ * pull tasks.
+ */
+bool __weak arch_asym_check_smt_siblings(void)
+{
+   return false;
+}
 /*
  * The margin used when comparing utilization with CPU capacity.
  *
@@ -8483,6 +8491,107 @@ static inline void update_sg_lb_stats(struct lb_env 
*env,
sgs->group_capacity;
 }
 
+static bool cpu_group_is_smt(int cpu, struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+   if (!static_branch_likely(_smt_present))
+   return false;
+
+   if (sg->group_weight == 1)
+   return false;
+
+   return cpumask_equal(sched_group_span(sg), cpu_smt_mask(cpu));
+#else
+   return false;
+#endif
+}
+
+/**
+ * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
+ * @dst_cpu:   CPU doing the load balancing
+ * @sds:   Load-balancing data with statistics of the local group
+ * @sgs:   Load-balancing statistics of the candidate busiest group
+ * @sg:The candidate busiet group
+ *
+ * Check the state of the SMT siblings of both @sds::local and @sg and decide
+ * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
+ * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
+ * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
+ *
+ * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
+ * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
+ * between the number of busy CPUs is 2 or more. If the difference is of 1,
+ * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
+ * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
+ * has lower priority.
+ */
+static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
+   struct sg_lb_stats *sgs, struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+   int cpu, local_busy_cpus, sg_busy_cpus;
+   bool local_is_smt, sg_is_smt;
+
+   if (!arch_asym_check_smt_siblings())
+   return true;
+
+   cpu = group_first_cpu(sg);
+   local_is_smt = cpu_group_is_smt(dst_cpu, sds->local);
+   sg_is_smt = cpu_group_is_smt(cpu, sg);
+
+   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+   if (!local_is_smt) {
+   /*
+* If we are here, @dst_cpu is idle and does not have SMT
+* siblings. Pull tasks if candidate group has two or more
+* busy CPUs.
+*/
+   if (sg_is_smt && sg_busy_cpus >= 2)
+   return true;
+
+   /*
+ 

[PATCH v2 4/4] x86/sched: Enable checks of the state of SMT siblings in load balancing

2021-04-13 Thread Ricardo Neri
ITMT relies on asymmetric packing of tasks to ensure CPUs are populated in
priority order. When balancing load, the scheduler compares scheduling
groups in pairs, and compares only the priority of the CPUs of highest
priority in the group. This may result on CPUs with medium priority being
overlooked. A recent change introduced logic to also consider the idle
state of the SMT siblings of the CPU doing the load balance. Enable those
checks for x86 when using ITMT.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
Changes since v1:
  * None
---
 arch/x86/kernel/itmt.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 1afbdd1dd777..1407120af82d 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -28,6 +28,8 @@ DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
 
 /* Boolean to track if system has ITMT capabilities */
 static bool __read_mostly sched_itmt_capable;
+/* Boolean to activate checks on the state of SMT siblings */
+static bool __read_mostly sched_itmt_smt_checks;
 
 /*
  * Boolean to control whether we want to move processes to cpu capable
@@ -124,6 +126,8 @@ int sched_set_itmt_support(void)
 
sysctl_sched_itmt_enabled = 1;
 
+   sched_itmt_smt_checks = true;
+
x86_topology_update = true;
rebuild_sched_domains();
 
@@ -160,6 +164,7 @@ void sched_clear_itmt_support(void)
if (sysctl_sched_itmt_enabled) {
/* disable sched_itmt if we are no longer ITMT capable */
sysctl_sched_itmt_enabled = 0;
+   sched_itmt_smt_checks = false;
x86_topology_update = true;
rebuild_sched_domains();
}
@@ -167,6 +172,16 @@ void sched_clear_itmt_support(void)
mutex_unlock(_update_mutex);
 }
 
+bool arch_asym_check_smt_siblings(void)
+{
+   return sched_itmt_smt_checks;
+}
+
+bool arch_sched_asym_prefer_early(int a, int b)
+{
+   return sched_itmt_smt_checks;
+}
+
 int arch_asym_cpu_priority(int cpu)
 {
return per_cpu(sched_core_priority, cpu);
-- 
2.17.1



[PATCH v2 2/4] sched/fair: Introduce arch_sched_asym_prefer_early()

2021-04-13 Thread Ricardo Neri
Introduce arch_sched_asym_prefer_early() so that architectures with SMT
can delay the decision to label a candidate busiest group as
group_asym_packing.

When using asymmetric packing, high priority idle CPUs pull tasks from
scheduling groups with low priority CPUs. The decision on using asymmetric
packing for load balancing is done after collecting the statistics of a
candidate busiest group. However, this decision needs to consider the
state of SMT siblings of dst_cpu.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
Changes since v1:
  * None
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c| 11 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..663b98959305 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
+extern bool arch_sched_asym_prefer_early(int a, int b);
 
 struct sched_domain_attr {
int relax_domain_level;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ef3fa0d5e8d..e74da853b046 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
return -cpu;
 }
 
+/*
+ * For asym packing, early check if CPUs with higher priority should be
+ * preferred. On some architectures, more data is needed to make a decision.
+ */
+bool __weak arch_sched_asym_prefer_early(int a, int b)
+{
+   return sched_asym_prefer(a, b);
+}
+
 /*
  * The margin used when comparing utilization with CPU capacity.
  *
@@ -8458,7 +8467,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
sgs->sum_h_nr_running &&
-   sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+   arch_sched_asym_prefer_early(env->dst_cpu, group->asym_prefer_cpu)) 
{
sgs->group_asym_packing = 1;
}
 
-- 
2.17.1



[PATCH v2 0/4] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING

2021-04-13 Thread Ricardo Neri
0)
threads-sockets group-4  1.00 (  0.70)   +0.61 (  0.88)
threads-sockets group-8  1.00 (  1.20)   +0.29 (  2.20)

netperf
===
caseloadbaseline(std%)  compare%( std%)
TCP_RR  thread-2 1.00 (  0.72)   +3.48 (  0.40)
TCP_RR  thread-4 1.00 (  0.38)   +4.11 (  0.35)
TCP_RR  thread-8 1.00 (  9.29)   -0.67 ( 14.12)
TCP_RR  thread-161.00 ( 21.18)   -0.10 ( 20.92)
UDP_RR  thread-2 1.00 (  0.23)   +4.08 (  0.16)
UDP_RR  thread-4 1.00 (  0.19)   +3.53 (  0.27)
UDP_RR  thread-8 1.00 (  6.83)   -1.63 ( 11.09)
UDP_RR  thread-161.00 ( 20.85)   -1.02 ( 21.20)

tbench
==
caseloadbaseline(std%)  compare%( std%)
loopbackthread-2 1.00 (  0.15)   +4.49 (  0.20)
loopbackthread-4 1.00 (  0.25)   +3.99 (  0.32)
loopbackthread-8 1.00 (  0.16)   +0.49 (  0.30)
loopbackthread-161.00 (  0.32)   +0.49 (  0.21)

schbench

caseloadbaseline(std%)  compare%( std%)
normal  mthread-11.00 ( 15.87)  -17.76 (  8.38)
normal  mthread-21.00 (  0.10)   -0.05 (  0.00)
normal  mthread-41.00 (  0.00)   +0.00 (  0.00)
normal  mthread-81.00 (  4.81)   +5.57 (  2.61)


  Table 2. Test results of patches with HWP
===
hackbench
=
caseloadbaseline(std%)  compare%( std%)
process-pipegroup-1  1.00 (  0.76)   +0.14 (  1.11)
process-pipegroup-2  1.00 (  2.59)   +0.88 (  3.16)
process-pipegroup-4  1.00 (  4.05)   +1.13 (  4.86)
process-pipegroup-8  1.00 (  7.37)   -2.34 ( 14.43)
process-sockets group-1  1.00 ( 15.98)   +7.77 (  1.44)
process-sockets group-2  1.00 (  1.64)   +0.00 (  1.59)
process-sockets group-4  1.00 (  0.95)   -1.54 (  0.85)
process-sockets group-8  1.00 (  1.84)   -3.27 (  4.86)
threads-pipegroup-1  1.00 (  3.27)   +0.64 (  2.91)
threads-pipegroup-2  1.00 (  3.02)   -0.09 (  1.50)
threads-pipegroup-4  1.00 (  5.39)   -6.34 (  3.11)
threads-pipegroup-8  1.00 (  5.56)   +4.61 ( 14.91)
threads-sockets group-1  1.00 (  2.76)   +4.70 (  0.94)
threads-sockets group-2  1.00 (  1.10)   +3.56 (  1.41)
threads-sockets group-4  1.00 (  0.45)   +2.11 (  1.32)
threads-sockets group-8  1.00 (  3.56)   +3.62 (  2.43)

netperf
===
caseloadbaseline(std%)  compare%( std%)
TCP_RR  thread-2 1.00 (  0.36)   +9.85 (  2.09)
TCP_RR  thread-4 1.00 (  0.31)   +1.30 (  0.53)
TCP_RR  thread-8 1.00 ( 11.70)   -0.42 ( 13.31)
TCP_RR  thread-161.00 ( 23.49)   -0.55 ( 21.79)
UDP_RR  thread-2 1.00 (  0.19)  +13.11 (  7.48)
UDP_RR  thread-4 1.00 (  0.13)   +2.69 (  0.26)
UDP_RR  thread-8 1.00 (  8.95)   -1.39 ( 12.39)
UDP_RR  thread-161.00 ( 21.54)   -0.77 ( 20.97)

tbench
==
caseloadbaseline(std%)  compare%( std%)
loopbackthread-2 1.00 (  0.22)   +5.26 (  0.46)
loopbackthread-4 1.00 (  2.56)  +52.11 (  0.73)
loopbackthread-8 1.00 (  0.41)   +0.53 (  0.42)
loopbackthread-161.00 (  0.60)   +1.39 (  0.33)

schbench

caseloadbaseline(std%)  compare%( std%)
normal  mthread-11.00 ( 12.01)   -1.72 (  2.08)
normal  mthread-21.00 (  0.00)   +0.00 (  0.00)
normal  mthread-41.00 (  0.00)   +0.00 (  0.00)
normal  mthread-81.00 (  0.00)   +0.00 (  0.00)


[1]. 
https://lore.kernel.org/lkml/20210406041108.7416-1-ricardo.neri-calde...@linux.intel.com/

Ricardo Neri (4):
  sched/fair: Optimize checking for group_asym_packing
  sched/fair: Introduce arch_sched_asym_prefer_early()
  sched/fair: Consider SMT in ASYM_PACKING load balance
  x86/sched: Enable checks of the state of SMT siblings in load
balancing

 arch/x86/kernel/itmt.c |  15 
 include/linux/sched/topology.h |   2 +
 kernel/sched/fair.c| 140 -
 3 files changed, 155 insertions(+), 2 deletions(-)

-- 
2.17.1



[PATCH v2 1/4] sched/fair: Optimize checking for group_asym_packing

2021-04-13 Thread Ricardo Neri
By checking local_group, we can avoid additional checks and invoking
sched_asmy_prefer() when it is not needed.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
Changes since v1:
  * None
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..4ef3fa0d5e8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8455,7 +8455,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
 
/* Check if dst CPU is idle and preferred to this group */
-   if (env->sd->flags & SD_ASYM_PACKING &&
+   if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
sgs->sum_h_nr_running &&
sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
-- 
2.17.1



Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-04-08 Thread Ricardo Neri
On Thu, Apr 08, 2021 at 01:21:22PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 04:17:51PM -0700, Ricardo Neri wrote:
> > On Tue, Apr 06, 2021 at 01:18:09PM +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote:
> > > > +static bool cpu_group_is_smt(int cpu, struct sched_group *sg)
> > > > +{
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > +   if (!static_branch_likely(_smt_present))
> > > > +   return false;
> > > > +
> > > > +   if (sg->group_weight == 1)
> > > > +   return false;
> > > > +
> > > > +   if (cpumask_weight(cpu_smt_mask(cpu)) == 1)
> > > > +   return false;
> > > 
> > > Please explain this condition. Why is it required?
> > 
> > Thank you for your quick review Peter!
> > 
> > Probably this is not required since the previous check verifies the
> > group weight, and the subsequent check makes sure that @sg matches the
> > SMT siblings of @cpu.
> 
> So the thing is that cpumask_weight() can be fairly expensive, depending
> on how large the machine is.
> 
> Now I suppose this mixing of SMT and !SMT cores is typical for 'small'
> machines (for now), but this is enabled for everything with ITMT on,
> which might very well include large systems.
> 
> So yes, if it can go away, that'd be good.

Sure Peter, I think this check can be removed. I'll post a v2 with the
updates.

Thanks and BR,
Ricardo


Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-04-08 Thread Ricardo Neri
On Thu, Apr 08, 2021 at 01:10:39PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 04:17:10PM -0700, Ricardo Neri wrote:
> > On Tue, Apr 06, 2021 at 01:17:28PM +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote:
> > > > @@ -8507,6 +8619,10 @@ static bool update_sd_pick_busiest(struct lb_env 
> > > > *env,
> > > > if (!sgs->sum_h_nr_running)
> > > > return false;
> > > >  
> > > > +   if (sgs->group_type == group_asym_packing &&
> > > > +   !asym_can_pull_tasks(env->dst_cpu, sds, sgs, sg))
> > > > +   return false;
> > > 
> > > All of this makes my head hurt; but afaict this isn't right.
> > > 
> > > Your update_sg_lb_stats() change makes that we unconditionally set
> > > sgs->group_asym_packing, and then this is to undo that. But it's not
> > > clear this covers all cases right.
> > 
> > We could not make a decision to set sgs->group_asym_packing in
> > update_sg_lb_stats() because we don't have information about the dst_cpu
> > and its SMT siblings if any. That is the reason I proposed to delay the
> > decision to update_sd_pick_busiest(), where we can compare local and
> > sgs.
> 
> Yeah, I sorta got that.
> 
> > > Even if !sched_asym_prefer(), we could end up selecting this sg as
> > > busiest, but you're just bailing out here.
> > 
> > Even if sgs->group_asym_packing is unconditionally set, sgs can still
> > be classified as group_overloaded and group_imbalanced. In such cases
> > we wouldn't bailout. sgs could not be classified as group_fully_busy
> > or group_has_spare and we would bailout, though. Is your concern about
> > these? I can fixup these two cases.
> 
> Yes. Either explain (in a comment) why those cases are not relevant, or
> handle them properly.
> 
> Because when reading this, it wasn't at all obvious that this is correct
> or as intended.

Sure Peter, I will post a v2 handling the remaining cases properly.

Thanks and BR,
Ricardo


Re: [PATCH 2/4] sched/fair: Introduce arch_sched_asym_prefer_early()

2021-04-06 Thread Ricardo Neri
On Tue, Apr 06, 2021 at 04:31:35PM +0200, Vincent Guittot wrote:
> On Tue, 6 Apr 2021 at 06:11, Ricardo Neri
>  wrote:
> >
> > Introduce arch_sched_asym_prefer_early() so that architectures with SMT
> > can delay the decision to label a candidate busiest group as
> > group_asym_packing.
> >
> > When using asymmetric packing, high priority idle CPUs pull tasks from
> > scheduling groups with low priority CPUs. The decision on using asymmetric
> > packing for load balancing is done after collecting the statistics of a
> > candidate busiest group. However, this decision needs to consider the
> > state of SMT siblings of dst_cpu.
> >
> > Cc: Aubrey Li 
> > Cc: Ben Segall 
> > Cc: Daniel Bristot de Oliveira 
> > Cc: Dietmar Eggemann 
> > Cc: Joel Fernandes (Google) 
> > Cc: Mel Gorman 
> > Cc: Quentin Perret 
> > Cc: Srinivas Pandruvada 
> > Cc: Steven Rostedt 
> > Cc: Tim Chen 
> > Reviewed-by: Len Brown 
> > Signed-off-by: Ricardo Neri 
> > ---
> >  include/linux/sched/topology.h |  1 +
> >  kernel/sched/fair.c| 11 ++-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 8f0f778b7c91..663b98959305 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> >  #endif
> >
> >  extern int arch_asym_cpu_priority(int cpu);
> > +extern bool arch_sched_asym_prefer_early(int a, int b);
> >
> >  struct sched_domain_attr {
> > int relax_domain_level;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4ef3fa0d5e8d..e74da853b046 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> > return -cpu;
> >  }
> >
> > +/*
> > + * For asym packing, early check if CPUs with higher priority should be
> > + * preferred. On some architectures, more data is needed to make a 
> > decision.
> > + */
> > +bool __weak arch_sched_asym_prefer_early(int a, int b)
> > +{
> > +   return sched_asym_prefer(a, b);
> > +}
> > +
> >  /*
> >   * The margin used when comparing utilization with CPU capacity.
> >   *
> > @@ -8458,7 +8467,7 @@ static inline void update_sg_lb_stats(struct lb_env 
> > *env,
> > if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
> > env->idle != CPU_NOT_IDLE &&
> > sgs->sum_h_nr_running &&
> > -   sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> > +   arch_sched_asym_prefer_early(env->dst_cpu, 
> > group->asym_prefer_cpu)) {
> 
> If itmt set arch_sched_asym_prefer_early to true all groups will be
> set as group_asym_packing unconditionally which is wrong. The state
> has to be set only when we want asym packing migration

Thanks for your quick review Vincent! 

Indeed, sgs->group_asym_packing should only be set when we want
asym_packing migration. However, for ITMT we also need to know the state
of the SMT siblings of dst_cpu if any. update_sg_lb_stats() does not
anything about the local group. Thus, arch_sched_asym_prefer_early()
intends to give an early decision that can be revoked later when
statistics of the local group are known. I also thought about checking
the state of the SMT siblings from update_sg_lb_stats() but that would
duplicate the functionality of update_sg_lb_stats() when called on the
local group.

The group can still be classified as group_overloaded and
group_imbalanced as they take precedence over group_asym_packing. We
would miss the group_has_spare and group_fully_busy classifications,
though. I can look into fixing that.

Thanks and BR,
Ricardo


Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-04-06 Thread Ricardo Neri
On Tue, Apr 06, 2021 at 01:18:09PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote:
> > +static bool cpu_group_is_smt(int cpu, struct sched_group *sg)
> > +{
> > +#ifdef CONFIG_SCHED_SMT
> > +   if (!static_branch_likely(_smt_present))
> > +   return false;
> > +
> > +   if (sg->group_weight == 1)
> > +   return false;
> > +
> > +   if (cpumask_weight(cpu_smt_mask(cpu)) == 1)
> > +   return false;
> 
> Please explain this condition. Why is it required?

Thank you for your quick review Peter!

Probably this is not required since the previous check verifies the
group weight, and the subsequent check makes sure that @sg matches the
SMT siblings of @cpu.

Thanks and BR,
Ricardo


Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-04-06 Thread Ricardo Neri
On Tue, Apr 06, 2021 at 01:17:28PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote:
> > @@ -8507,6 +8619,10 @@ static bool update_sd_pick_busiest(struct lb_env 
> > *env,
> > if (!sgs->sum_h_nr_running)
> > return false;
> >  
> > +   if (sgs->group_type == group_asym_packing &&
> > +   !asym_can_pull_tasks(env->dst_cpu, sds, sgs, sg))
> > +   return false;
> 
> All of this makes my head hurt; but afaict this isn't right.
> 
> Your update_sg_lb_stats() change makes that we unconditionally set
> sgs->group_asym_packing, and then this is to undo that. But it's not
> clear this covers all cases right.

We could not make a decision to set sgs->group_asym_packing in
update_sg_lb_stats() because we don't have information about the dst_cpu
and its SMT siblings if any. That is the reason I proposed to delay the
decision to update_sd_pick_busiest(), where we can compare local and
sgs.
> 
> Even if !sched_asym_prefer(), we could end up selecting this sg as
> busiest, but you're just bailing out here.

Even if sgs->group_asym_packing is unconditionally set, sgs can still
be classified as group_overloaded and group_imbalanced. In such cases
we wouldn't bailout. sgs could not be classified as group_fully_busy
or group_has_spare and we would bailout, though. Is your concern about
these? I can fixup these two cases.

Thanks and BR,
Ricardo
> 


[PATCH 4/4] x86/sched: Enable checks of the state of SMT siblings in load balancing

2021-04-05 Thread Ricardo Neri
ITMT relies on asymmetric packing of tasks to ensure CPUs are populated in
priority order. When balancing load, the scheduler compares scheduling
groups in pairs, and compares only the priority of the CPUs of highest
priority in the group. This may result on CPUs with medium priority being
overlooked. A recent change introduced logic to also consider the idle
state of the SMT siblings of the CPU doing the load balance. Enable those
checks for x86 when using ITMT.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/itmt.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 1afbdd1dd777..1407120af82d 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -28,6 +28,8 @@ DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
 
 /* Boolean to track if system has ITMT capabilities */
 static bool __read_mostly sched_itmt_capable;
+/* Boolean to activate checks on the state of SMT siblings */
+static bool __read_mostly sched_itmt_smt_checks;
 
 /*
  * Boolean to control whether we want to move processes to cpu capable
@@ -124,6 +126,8 @@ int sched_set_itmt_support(void)
 
sysctl_sched_itmt_enabled = 1;
 
+   sched_itmt_smt_checks = true;
+
x86_topology_update = true;
rebuild_sched_domains();
 
@@ -160,6 +164,7 @@ void sched_clear_itmt_support(void)
if (sysctl_sched_itmt_enabled) {
/* disable sched_itmt if we are no longer ITMT capable */
sysctl_sched_itmt_enabled = 0;
+   sched_itmt_smt_checks = false;
x86_topology_update = true;
rebuild_sched_domains();
}
@@ -167,6 +172,16 @@ void sched_clear_itmt_support(void)
mutex_unlock(_update_mutex);
 }
 
+bool arch_asym_check_smt_siblings(void)
+{
+   return sched_itmt_smt_checks;
+}
+
+bool arch_sched_asym_prefer_early(int a, int b)
+{
+   return sched_itmt_smt_checks;
+}
+
 int arch_asym_cpu_priority(int cpu)
 {
return per_cpu(sched_core_priority, cpu);
-- 
2.17.1



[PATCH 2/4] sched/fair: Introduce arch_sched_asym_prefer_early()

2021-04-05 Thread Ricardo Neri
Introduce arch_sched_asym_prefer_early() so that architectures with SMT
can delay the decision to label a candidate busiest group as
group_asym_packing.

When using asymmetric packing, high priority idle CPUs pull tasks from
scheduling groups with low priority CPUs. The decision on using asymmetric
packing for load balancing is done after collecting the statistics of a
candidate busiest group. However, this decision needs to consider the
state of SMT siblings of dst_cpu.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c| 11 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..663b98959305 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
+extern bool arch_sched_asym_prefer_early(int a, int b);
 
 struct sched_domain_attr {
int relax_domain_level;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ef3fa0d5e8d..e74da853b046 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
return -cpu;
 }
 
+/*
+ * For asym packing, early check if CPUs with higher priority should be
+ * preferred. On some architectures, more data is needed to make a decision.
+ */
+bool __weak arch_sched_asym_prefer_early(int a, int b)
+{
+   return sched_asym_prefer(a, b);
+}
+
 /*
  * The margin used when comparing utilization with CPU capacity.
  *
@@ -8458,7 +8467,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
sgs->sum_h_nr_running &&
-   sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+   arch_sched_asym_prefer_early(env->dst_cpu, group->asym_prefer_cpu)) 
{
sgs->group_asym_packing = 1;
}
 
-- 
2.17.1



[PATCH 1/4] sched/fair: Optimize checking for group_asym_packing

2021-04-05 Thread Ricardo Neri
By checking local_group, we can avoid additional checks and invoking
sched_asmy_prefer() when it is not needed.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..4ef3fa0d5e8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8455,7 +8455,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
 
/* Check if dst CPU is idle and preferred to this group */
-   if (env->sd->flags & SD_ASYM_PACKING &&
+   if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
sgs->sum_h_nr_running &&
sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
-- 
2.17.1



[PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-04-05 Thread Ricardo Neri
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the CPU doing the load balancing, but also of
its SMT siblings.

If the CPU doing the balancing is idle but its SMT siblings are not busy,
performance suffers if it pulls tasks from a medium priority CPU that does
not have SMT siblings. The decision to label a group for asymmetric packing
balancing is done in update_sg_lb_stats(). However, for SMT, that code does
not account for idle SMT siblings.

Implement asym_can_pull_tasks() to revisit the early decision on whether
the CPU doing the balance can pull tasks once the needed information is
available. arch_sched_asym_prefer_early() and
arch_asym_check_smt_siblings() are used to conserve the legacy behavior.

Cc: Aubrey Li 
Cc: Ben Segall 
Cc: Daniel Bristot de Oliveira 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes (Google) 
Cc: Mel Gorman 
Cc: Quentin Perret 
Cc: Srinivas Pandruvada 
Cc: Steven Rostedt 
Cc: Tim Chen 
Reviewed-by: Len Brown 
Signed-off-by: Ricardo Neri 
---
 include/linux/sched/topology.h |   1 +
 kernel/sched/fair.c| 122 +
 2 files changed, 123 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 663b98959305..6487953b24e8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -58,6 +58,7 @@ static inline int cpu_numa_flags(void)
 
 extern int arch_asym_cpu_priority(int cpu);
 extern bool arch_sched_asym_prefer_early(int a, int b);
+extern bool arch_asym_check_smt_siblings(void);
 
 struct sched_domain_attr {
int relax_domain_level;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e74da853b046..14b18dfea5b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -115,6 +115,14 @@ bool __weak arch_sched_asym_prefer_early(int a, int b)
return sched_asym_prefer(a, b);
 }
 
+/*
+ * For asym packing, first check the state of SMT siblings before deciding to
+ * pull tasks.
+ */
+bool __weak arch_asym_check_smt_siblings(void)
+{
+   return false;
+}
 /*
  * The margin used when comparing utilization with CPU capacity.
  *
@@ -8483,6 +8491,110 @@ static inline void update_sg_lb_stats(struct lb_env 
*env,
sgs->group_capacity;
 }
 
+static bool cpu_group_is_smt(int cpu, struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+   if (!static_branch_likely(_smt_present))
+   return false;
+
+   if (sg->group_weight == 1)
+   return false;
+
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 1)
+   return false;
+
+   return cpumask_equal(sched_group_span(sg), cpu_smt_mask(cpu));
+#else
+   return false;
+#endif
+}
+
+/**
+ * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
+ * @dst_cpu:   CPU doing the load balancing
+ * @sds:   Load-balancing data with statistics of the local group
+ * @sgs:   Load-balancing statistics of the candidate busiest group
+ * @sg:The candidate busiet group
+ *
+ * Check the state of the SMT siblings of both @sds::local and @sg and decide
+ * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
+ * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
+ * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
+ *
+ * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
+ * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
+ * between the number of busy CPUs is 2 or more. If the difference is of 1,
+ * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
+ * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
+ * has lower priority.
+ */
+static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
+   struct sg_lb_stats *sgs, struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+   int cpu, local_busy_cpus, sg_busy_cpus;
+   bool local_is_smt, sg_is_smt;
+
+   if (!arch_asym_check_smt_siblings())
+   return true;
+
+   cpu = group_first_cpu(sg);
+   local_is_smt = cpu_group_is_smt(dst_cpu, sds->local);
+   sg_is_smt = cpu_group_is_smt(cpu, sg);
+
+   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+   if (!local_is_smt) {
+   /*
+* If we are here, @dst_cpu is idle and does not have SMT
+* siblings. Pull tasks if candidate group has two or more
+* busy CPUs.
+*/
+   if (sg_is_smt && sg_busy_cpus >= 2)
+   return true;
+
+   /*
+* @dst_cpu does not have SMT siblings. @sg may have SMT
+* siblings and only one is busy. In such case, @dst_cpu
+* can help if it has higher priority and is idle.
+*/
+   return !

[PATCH 0/4] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING

2021-04-05 Thread Ricardo Neri
ead-1 1.00 (  0.13)   -0.24 (  0.20)
UDP_RR  thread-3 1.00 (  0.32)   +3.82 (  0.46)
UDP_RR  thread-7 1.00 (  8.85)   +0.43 ( 13.60)
UDP_RR  thread-141.00 ( 20.71)   +0.58 ( 21.85)

tbench
==
caseloadbaseline(std%)  compare%( std%)
loopbackthread-1 1.00 (  0.10)   +0.06 (  0.27)
loopbackthread-3 1.00 (  0.27)   +3.87 (  0.11)
loopbackthread-7 1.00 (  0.28)   -0.27 (  1.29)
loopbackthread-141.00 (  0.07)   +0.52 (  0.24)

schbench

caseloadbaseline(std%)  compare%( std%)
normal  mthread-11.00 (  2.33)  +32.38 (  1.99)
normal  mthread-21.00 (  0.00)   +0.00 (  0.00)
normal  mthread-41.00 (  0.00)   +0.00 (  0.00)
normal  mthread-81.00 (  3.01)   -2.16 (  3.04)

  Table 2. Test results of patches with HWP
===

hackbench
=
caseloadbaseline(std%)  compare%( std%)
process-pipegroup-1  1.00 (  4.73)   -3.23 (  2.62)
process-pipegroup-2  1.00 (  2.32)   +0.13 (  0.36)
process-pipegroup-4  1.00 (  3.85)   +1.95 (  3.92)
process-pipegroup-8  1.00 ( 14.31)   +2.12 ( 13.44)
process-sockets group-1  1.00 (  3.10)   -1.22 (  1.95)
process-sockets group-2  1.00 (  0.91)   -1.30 (  1.20)
process-sockets group-4  1.00 (  0.87)   -1.29 (  1.35)
process-sockets group-8  1.00 (  4.98)   +1.21 (  4.57)
threads-pipegroup-1  1.00 (  0.76)   +2.87 (  0.89)
threads-pipegroup-2  1.00 (  1.41)   -3.19 (  1.18)
threads-pipegroup-4  1.00 (  2.37)   -1.22 (  1.41)
threads-pipegroup-8  1.00 ( 15.12)   +6.64 (  5.82)
threads-sockets group-1  1.00 (  2.73)   +0.89 (  0.92)
threads-sockets group-2  1.00 (  1.59)   -4.32 (  1.23)
threads-sockets group-4  1.00 (  0.35)   -1.42 (  1.13)
threads-sockets group-8  1.00 (  0.77)   +0.23 (  1.35)

netperf
===
caseloadbaseline(std%)  compare%( std%)
TCP_RR  thread-1 1.00 (  0.27)   +0.05 (  0.41)
TCP_RR  thread-3 1.00 (  4.71)   +8.43 (  0.45)
TCP_RR  thread-7 1.00 ( 10.15)   -1.16 ( 16.95)
TCP_RR  thread-141.00 ( 23.46)   +0.21 ( 23.69)
UDP_RR  thread-1 1.00 (  0.16)   +1.52 (  0.81)
UDP_RR  thread-3 1.00 (  4.00)   +8.33 (  0.50)
UDP_RR  thread-7 1.00 (  7.39)   -0.99 ( 11.34)
UDP_RR  thread-141.00 ( 23.74)   -0.51 ( 21.87)

tbench
==
caseloadbaseline(std%)  compare%( std%)
loopbackthread-1 1.00 (  0.51)   -0.86 (  0.08)
loopbackthread-3 1.00 (  1.40)  +52.18 (  1.13)
loopbackthread-7 1.00 (  0.94)   +0.31 (  0.20)
loopbackthread-141.00 (  0.21)   -0.24 (  0.47)

schbench

caseloadbaseline(std%)  compare%( std%)
normal  mthread-11.00 ( 20.14)  +61.19 (  5.44)
normal  mthread-21.00 (  0.00)   +0.00 (  0.00)
normal  mthread-41.00 (  0.00)   +0.00 (  0.00)
normal  mthread-81.00 (  3.05)   -2.16 (  2.99)

Ricardo Neri (4):
  sched/fair: Optimize checking for group_asym_packing
  sched/fair: Introduce arch_sched_asym_prefer_early()
  sched/fair: Consider SMT in ASYM_PACKING load balance
  x86/sched: Enable checks of the state of SMT siblings in load
balancing

 arch/x86/kernel/itmt.c |  15 
 include/linux/sched/topology.h |   2 +
 kernel/sched/fair.c| 135 -
 3 files changed, 150 insertions(+), 2 deletions(-)

-- 
2.17.1



Re: [PATCH V2 1/25] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit

2021-03-10 Thread Ricardo Neri
On Wed, Mar 10, 2021 at 09:01:47PM +0100, Borislav Petkov wrote:
> On Wed, Mar 10, 2021 at 11:46:44AM -0800, Ricardo Neri wrote:
> > But this series provides the use case, right? Kan's patches handle PMU 
> > counters
> > that may differ cross types of CPUs. In patch 2, get_hybrid_params()
> > needs to check first if X86_FEATURE_HYBRID_CPU is enabled before
> > querying the hybrid parameters. Otherwise, we would need to rely on the
> > maximum level of CPUID, which may not be reliable.
> 
> On Wed, Mar 10, 2021 at 11:33:54AM -0800, Srinivas Pandruvada wrote:
> > We are working on changes to P-State driver for hybrid CPUs using this
> > define. They are still work in progress.
> > But this patch can be submitted later with our set of changes.
> 
> Answering to both with a single mail:
> 
> I don't have a problem with X86_FEATURE_HYBRID_CPU - I simply don't want
> to show "hybrid_cpu" in /proc/cpuinfo unless there's a valid use case
> for userspace to know that it is running on a hybrid CPU.

Ah, I get your point now. You would like to see

#define X86_FEATURE_HYBRID_CPU   (18*32+15) /* "" This part has 
CPUs of more than one type */

Right? Now your first comment makes sense.

Srinivas, Kan, I don't think we need to expose "hybrid_cpu" in
/proc/cpuinfo, do we?

Thanks and BR,
Ricardo


Re: [PATCH V2 1/25] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit

2021-03-10 Thread Ricardo Neri
On Wed, Mar 10, 2021 at 05:53:58PM +0100, Borislav Petkov wrote:
> On Wed, Mar 10, 2021 at 08:37:37AM -0800, kan.li...@linux.intel.com wrote:
> > From: Ricardo Neri 
> > 
> > Add feature enumeration to identify a processor with Intel Hybrid
> > Technology: one in which CPUs of more than one type are the same package.
> > On a hybrid processor, all CPUs support the same homogeneous (i.e.,
> > symmetric) instruction set. All CPUs enumerate the same features in CPUID.
> > Thus, software (user space and kernel) can run and migrate to any CPU in
> > the system as well as utilize any of the enumerated features without any
> > change or special provisions. The main difference among CPUs in a hybrid
> > processor are power and performance properties.
> > 
> > Cc: Andi Kleen 
> > Cc: Kan Liang 
> > Cc: "Peter Zijlstra (Intel)" 
> > Cc: "Rafael J. Wysocki" 
> > Cc: "Ravi V. Shankar" 
> > Cc: Srinivas Pandruvada 
> > Cc: linux-kernel@vger.kernel.org
> > Reviewed-by: Len Brown 
> > Reviewed-by: Tony Luck 
> > Signed-off-by: Ricardo Neri 
> > ---
> > Changes since v1 (as part of patchset for perf change for Alderlake)
> >  * None
> > 
> > Changes since v1 (in a separate posting):
> >  * Reworded commit message to clearly state what is Intel Hybrid
> >Technology. Stress that all CPUs can run the same instruction
> >set and support the same features.
> > ---
> >  arch/x86/include/asm/cpufeatures.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/x86/include/asm/cpufeatures.h 
> > b/arch/x86/include/asm/cpufeatures.h
> > index cc96e26d69f7..e7cfc9eedf8d 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -374,6 +374,7 @@
> >  #define X86_FEATURE_MD_CLEAR   (18*32+10) /* VERW clears CPU 
> > buffers */
> >  #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" 
> > TSX_FORCE_ABORT */
> >  #define X86_FEATURE_SERIALIZE  (18*32+14) /* SERIALIZE 
> > instruction */
> > +#define X86_FEATURE_HYBRID_CPU (18*32+15) /* This part has 
> > CPUs of more than one type */
> 
> /* "" This ...
> 
> unless you have a valid use case for "hybrid_cpu" being present there.

But this series provides the use case, right? Kan's patches handle PMU counters
that may differ cross types of CPUs. In patch 2, get_hybrid_params()
needs to check first if X86_FEATURE_HYBRID_CPU is enabled before
querying the hybrid parameters. Otherwise, we would need to rely on the
maximum level of CPUID, which may not be reliable.

Thanks and BR,
Ricardo


Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

2020-10-07 Thread Ricardo Neri
On Wed, Oct 07, 2020 at 07:15:46AM +0200, Greg Kroah-Hartman wrote:
> On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
> > On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> > > > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > > > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > > > > Hybrid CPU topologies combine CPUs of different microarchitectures 
> > > > > > in the
> > > > > > same die. Thus, even though the instruction set is compatible among 
> > > > > > all
> > > > > > CPUs, there may still be differences in features (e.g., some CPUs 
> > > > > > may
> > > > > > have counters that others CPU do not). There may be applications
> > > > > > interested in knowing the type of micro-architecture topology of the
> > > > > > system to make decisions about process affinity.
> > > > > > 
> > > > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > > > > > cpu_capacity) may be used to infer the types of micro-architecture 
> > > > > > of the
> > > > > > CPUs in the platform, it may not be entirely accurate. For 
> > > > > > instance, two
> > > > > > subsets of CPUs with different types of micro-architecture may have 
> > > > > > the
> > > > > > same capacity due to power or thermal constraints.
> > > > > > 
> > > > > > Create the new directory /sys/devices/system/cpu/types. Under such
> > > > > > directory, create individual subdirectories for each type of CPU 
> > > > > > micro-
> > > > > > architecture. Each subdirectory will have cpulist and cpumap files. 
> > > > > > This
> > > > > > makes it convenient for user space to read all the CPUs of the same 
> > > > > > type
> > > > > > at once without having to inspect each CPU individually.
> > > > > > 
> > > > > > Implement a generic interface using weak functions that 
> > > > > > architectures can
> > > > > > override to indicate a) support for CPU types, b) the CPU type 
> > > > > > number, and
> > > > > > c) a string to identify the CPU vendor and type.
> > > > > > 
> > > > > > For example, an x86 system with one Intel Core and four Intel Atom 
> > > > > > CPUs
> > > > > > would look like this (other architectures have the hooks to use 
> > > > > > whatever
> > > > > > directory naming convention below "types" that meets their needs):
> > > > > > 
> > > > > > user@host:~$: ls /sys/devices/system/cpu/types
> > > > > > intel_atom_0  intel_core_0
> > > > > > 
> > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > > > > > cpulist cpumap
> > > > > > 
> > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > > > > > cpulist cpumap
> > > > > > 
> > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > > > > > 0f
> > > > > > 
> > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > > > > > 0-3
> > > > > > 
> > > > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > > > > > 10
> > > > > > 
> > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > > > > > 4
> > > > 
> > > > Thank you for the quick and detailed Greg!
> > > > 
> > > > > 
> > > > > The output of 'tree' sometimes makes it easier to see here, or:
> > > > >   grep -R . *
> > > > > also works well.
> > > > 
> > > > Indeed, this would definitely make it more readable.
> > > > 
> > > > > 
> > > > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory 
> > > > > > is not
> > > > > > created. Add a hook for this purpose.
> > > > > 
> > > > > Why 

Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

2020-10-06 Thread Ricardo Neri
On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > > Hybrid CPU topologies combine CPUs of different microarchitectures in 
> > > > the
> > > > same die. Thus, even though the instruction set is compatible among all
> > > > CPUs, there may still be differences in features (e.g., some CPUs may
> > > > have counters that others CPU do not). There may be applications
> > > > interested in knowing the type of micro-architecture topology of the
> > > > system to make decisions about process affinity.
> > > > 
> > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > > > cpu_capacity) may be used to infer the types of micro-architecture of 
> > > > the
> > > > CPUs in the platform, it may not be entirely accurate. For instance, two
> > > > subsets of CPUs with different types of micro-architecture may have the
> > > > same capacity due to power or thermal constraints.
> > > > 
> > > > Create the new directory /sys/devices/system/cpu/types. Under such
> > > > directory, create individual subdirectories for each type of CPU micro-
> > > > architecture. Each subdirectory will have cpulist and cpumap files. This
> > > > makes it convenient for user space to read all the CPUs of the same type
> > > > at once without having to inspect each CPU individually.
> > > > 
> > > > Implement a generic interface using weak functions that architectures 
> > > > can
> > > > override to indicate a) support for CPU types, b) the CPU type number, 
> > > > and
> > > > c) a string to identify the CPU vendor and type.
> > > > 
> > > > For example, an x86 system with one Intel Core and four Intel Atom CPUs
> > > > would look like this (other architectures have the hooks to use whatever
> > > > directory naming convention below "types" that meets their needs):
> > > > 
> > > > user@host:~$: ls /sys/devices/system/cpu/types
> > > > intel_atom_0  intel_core_0
> > > > 
> > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > > > cpulist cpumap
> > > > 
> > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > > > cpulist cpumap
> > > > 
> > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > > > 0f
> > > > 
> > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > > > 0-3
> > > > 
> > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > > > 10
> > > > 
> > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > > > 4
> > 
> > Thank you for the quick and detailed Greg!
> > 
> > > 
> > > The output of 'tree' sometimes makes it easier to see here, or:
> > >   grep -R . *
> > > also works well.
> > 
> > Indeed, this would definitely make it more readable.
> > 
> > > 
> > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is 
> > > > not
> > > > created. Add a hook for this purpose.
> > > 
> > > Why should these not show up if the system is not "hybrid"?
> > 
> > My thinking was that on a non-hybrid system, it does not make sense to
> > create this interface, as all the CPUs will be of the same type.
> 
> Why not just have this an attribute type in the existing cpuX directory?
> Why do this have to be a totally separate directory and userspace has to
> figure out to look in two different spots for the same cpu to determine
> what it is?

But if the type is located under cpuX, usespace would need to traverse
all the CPUs and create its own cpu masks. Under the types directory it
would only need to look once for each type of CPU, IMHO.
> 
> That feels wasteful, it should be much simpler to use the existing
> object, right?
> 
> That way, you also show the "type" of all cpus, no matter if they are
> "hybrid" or not, again, making userspace deal with things in a much
> simpler manner.

Indeed, that would be simpler to implement, and perhaps a natural extension
of the existing interface. 

Lastly, legacy and non-hybrid p

Re: [PATCH 0/4] drivers core: Introduce CPU type sysfs interface

2020-10-06 Thread Ricardo Neri
On Tue, Oct 06, 2020 at 09:51:53AM +0100, Qais Yousef wrote:
> Hi Ricardo

Hi Qais,

Thanks for chiming in.

> 
> Adding some people who might be interested.
> 
> On 10/02/20 18:17, Ricardo Neri wrote:
> > Hybrid CPU topologies combine processors with more than one type of
> > micro-architecture. Hence, it is possible that individual CPUs support
> > slightly different features (e.g., performance counters) and different
> > performance properties. Thus, there may be user space entities interested
> > in knowing the topology of the system based on the types of available
> > CPUs.
> > 
> > Currently, there exists an interface for the CPU capacity (/sys/devices/
> > system/cpu/cpuX/cpu_capacity). However, CPU capacity does not always map
> > to CPU types (by the way, I will submit a separate series to bring such
> > interface to x86).
> 
> Why do you need to do this mapping?
> 
> > 
> > This series proposes the new interface /sys/devices/system/cpu/types
> > which, in hybrid parts, creates a subdirectory for each type of CPU.
> > Each subdirectory contains a CPU list and a CPU map that user space can
> > query.
> 
> Why user space needs to query this info?
> 
> The rationale is missing the intention behind all of this. It seems you're
> expecting software to parse this info and take decisions based on that?

I propose this interface to be consumed for application or libraries
wanting to know the topology of the system. Perhaps, a utility program
that wants to depict CPUs by type. Another example is a policy manager
wanting to create a cgroup cpuset of CPUs of a given type for
performance reasons. Similarly, a program may want to affinitize itself
to a type of CPU, if for instance certain performance events are only
counted on a given CPU type.

Also, an interface with cpumasks makes it convenient for userspace to
not have to traverse each CPU individually.

   a) add the type of each CPU in /proc/cpuinfo
   b) add the type of each CPU in /sys/devices/system/cpu/cpu#/type
   c) for performance, derive the CPU type from
  /sys/devices/system/cpu/cpu#/cpu_capacity
   d) add an interface similar to what I propose in this series.

None of this imply that certain instructions will be able to run on one
type of CPU but not on another. Instead, better performance may be
obtained on one type CPU vs another.

Thanks and BR,
Ricardo


Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

2020-10-05 Thread Ricardo Neri
On Fri, Oct 02, 2020 at 08:27:41PM -0700, Randy Dunlap wrote:
> On 10/2/20 6:17 PM, Ricardo Neri wrote:
> > +/**
> > + * arch_get_cpu_type() - Get the CPU type number
> > + * @cpu:   Index of the CPU of which the index is needed
> > + *
> > + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that

Thank you for your feedback Randy!

> Are you sure that @cpu is non-zero?

This is the intent. Maybe it can be reworked to return -EINVAL instead?
I gues it is plausible but less likely that an arch defines a type as
0xffea;

> 
> > + * uniquely identifies a type of CPU micro-architecture. All CPUs of the 
> > same
> > + * type have the same type number. Type numbers are defined by each CPU
> > + * architecture.
> > + */
> > +u32 __weak arch_get_cpu_type(int cpu)
> > +{
> > +   return 0;
> > +}
> 
> arch_get_cpu_type() in patch 4/4 allows @cpu to be 0.

It should not return 0 if the system does not have
X86_FEATURE_HYBRID_CPU. The currently existing CPU types are all
non-zero as per the Intel SDM. Am I missing anything?

Thanks and BR,
Ricardo


Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

2020-10-05 Thread Ricardo Neri
On Sat, Oct 03, 2020 at 01:05:48PM +0200, Greg Kroah-Hartman wrote:
> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > +/**
> > > + * arch_get_cpu_type_name() - Get the CPU type name
> > > + * @cpu_type:Type of CPU micro-architecture.
> > > + *
> > > + * Returns a string name associated with the CPU micro-architecture type 
> > > as
> > > + * indicated in @cpu_type. The format shall be _. 
> > > Returns
> > > + * NULL if the CPU type is not known.
> > > + */
> > > +const char __weak *arch_get_cpu_type_name(u32 cpu_type)
> > > +{
> > > + return NULL;
> > > +}
> > 
> > Why is vendor part of this?  Shouldn't it just be arch?
> > 
> > I say this as "vendor" is kind of "interesting" when it comes to other
> > arches...
> > 
> > Speaking of other arches, we all know that other arches have this
> > feature as well, have you worked with any other groups to verify that
> > this interface will also work with them?
> 
> Here's one set of patches for ARM64 for much the same type of cpu
> design:
>   https://android-review.googlesource.com/c/kernel/common/+/1437098/3
> Yes, it's not been posted to any kernel lists, but this is public so you
> need to work with the ARM developers to come up with an interface that
> works for everyone please.

Thanks for the pointer, Greg! I will study this proposal and work with
the ARM engineers.

BR,
Ricardo


Re: [PATCH 4/4] x86/cpu/topology: Implement the CPU type sysfs interface

2020-10-05 Thread Ricardo Neri
On Sat, Oct 03, 2020 at 10:55:06AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Oct 02, 2020 at 06:17:45PM -0700, Ricardo Neri wrote:
> > Recent Intel processors combine CPUs with different types of micro-
> > architecture in the same package. There may be applications interested in
> > knowing the type topology of the system. For instance, it can be used to
> > to determine which subsets of CPUs share a common feature.
> > 
> > Implement cpu_type sysfs interfaces for Intel processors.
> > 
> > For example, in a system with four Intel Atom CPUs and one Intel Core CPU,
> > these entries look as below. In this example, the native model IDs for
> > both types of CPUs are 0:
> > 
> > user@host:~$: ls /sys/devices/system/cpu/types
> > intel_atom_0 intel_core_0
> > 
> > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > cpulist cpumap
> > 
> > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > cpulist cpumap
> > 
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpumap
> > 0f
> > 
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpulist
> > 0-3
> > 
> > user@nost:~$ cat /sys/devices/system/cpu/types/intel_core/cpumap
> > 10
> > 
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_core/cpulist
> > 4
> 
> You used the same changelog text here as you did in patch 1/4, why?

In both changesets, if merged, somebody could conveniently do git show
on either commit quickly see the result intent of the changeset.

Would it make it better if in patch 1/4 I put an hypothetical generic
example?

Something like:

user@host:~$: ls /sys/devices/system/cpu/types
_ 

Thanks and BR,
Ricardo


Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

2020-10-05 Thread Ricardo Neri
On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > Hybrid CPU topologies combine CPUs of different microarchitectures in the
> > same die. Thus, even though the instruction set is compatible among all
> > CPUs, there may still be differences in features (e.g., some CPUs may
> > have counters that others CPU do not). There may be applications
> > interested in knowing the type of micro-architecture topology of the
> > system to make decisions about process affinity.
> > 
> > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > cpu_capacity) may be used to infer the types of micro-architecture of the
> > CPUs in the platform, it may not be entirely accurate. For instance, two
> > subsets of CPUs with different types of micro-architecture may have the
> > same capacity due to power or thermal constraints.
> > 
> > Create the new directory /sys/devices/system/cpu/types. Under such
> > directory, create individual subdirectories for each type of CPU micro-
> > architecture. Each subdirectory will have cpulist and cpumap files. This
> > makes it convenient for user space to read all the CPUs of the same type
> > at once without having to inspect each CPU individually.
> > 
> > Implement a generic interface using weak functions that architectures can
> > override to indicate a) support for CPU types, b) the CPU type number, and
> > c) a string to identify the CPU vendor and type.
> > 
> > For example, an x86 system with one Intel Core and four Intel Atom CPUs
> > would look like this (other architectures have the hooks to use whatever
> > directory naming convention below "types" that meets their needs):
> > 
> > user@host:~$: ls /sys/devices/system/cpu/types
> > intel_atom_0  intel_core_0
> > 
> > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > cpulist cpumap
> > 
> > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > cpulist cpumap
> > 
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > 0f
> > 
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > 0-3
> > 
> > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > 10
> > 
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > 4

Thank you for the quick and detailed Greg!

> 
> The output of 'tree' sometimes makes it easier to see here, or:
>   grep -R . *
> also works well.

Indeed, this would definitely make it more readable.

> 
> > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> > created. Add a hook for this purpose.
> 
> Why should these not show up if the system is not "hybrid"?

My thinking was that on a non-hybrid system, it does not make sense to
create this interface, as all the CPUs will be of the same type.

> 
> > 
> > Cc: Andi Kleen 
> > Cc: Dave Hansen 
> > Cc: "Gautham R. Shenoy" 
> > Cc: Kan Liang 
> > Cc: Len Brown 
> > Cc: "Rafael J. Wysocki" 
> > Cc: "Ravi V. Shankar" 
> > Cc: Srinivas Pandruvada 
> > Cc: linux-kernel@vger.kernel.org
> > Reviewed-by: Tony Luck 
> > Suggested-by: Len Brown  # Necessity of the interface
> > Suggested-by: Dave Hansen  # Details of the interface
> > Signed-off-by: Ricardo Neri 
> > ---
> >  .../ABI/testing/sysfs-devices-system-cpu  |  13 ++
> >  drivers/base/cpu.c| 214 ++
> >  include/linux/cpu.h   |  12 +
> >  include/linux/cpuhotplug.h|   1 +
> >  4 files changed, 240 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
> > b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > index b555df825447..bcb3d7195102 100644
> > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > @@ -614,3 +614,16 @@ Description:   SPURR ticks for cpuX when it was idle.
> >  
> > This sysfs interface exposes the number of SPURR ticks
> > for cpuX when it was idle.
> > +
> > +What:  /sys/devices/system/cpu/types
> > +   /sys/devices/system/cpu/types/_/cpumap
> > +   /sys/devices/system/cpu/types/_/cpulist
> 
> Please split these up into different entries, because:

Sure, I can split these into two entries.

> 
> > +Date:  Oct 2020
> > +

Re: [PATCH 0/4] drivers core: Introduce CPU type sysfs interface

2020-10-05 Thread Ricardo Neri
On Sat, Oct 03, 2020 at 10:49:34AM +0200, Borislav Petkov wrote:
> On Fri, Oct 02, 2020 at 06:17:41PM -0700, Ricardo Neri wrote:
> > Patch 1 of the series proposes the generic interface, with hooks
> > that architectures can override to suit their needs. The three patches
> > patches implement such interface for x86 (as per request from Boris,
> > I pulled patch 2 from a separate submission [1]).
> 
> So I ask you to show me the whole thing, how this is supposed to be used
> in a *real* use case and you're sending me a couple of patches which
> report these heterogeneous or whatever they're gonna be called CPUs.
> 
> Are you telling me that all this development effort was done so that
> you can report heterogeneity in sysfs? Or you just had to come up with
> *something*?
> 
> Let me try again: please show me the *big* *picture* with all the code
> how this is supposed to be used. In the patches I read a bunch of "may"
> formulations of what is possible and what userspace could do and so on.
> 
> Not that - show me the *full* and *real* use cases which you are
> enabling and which justify all that churn. Instead of leaving it all to
> userspace CPUID and the kernel not caring one bit.
> 
> Does that make more sense?

Yes Boris, thanks for the clarification. The proposed sysfs interface is
one instance in which we use cpuinfo_x86.x86_cpu_type. I have other
changes that use this new member. I will post them.

> 
> > [1]. https://lkml.org/lkml/2020/10/2/1013
> 
> For supplying links, we use lore.kernel.org/r/ solely.
> Please use that from now on.

Sure Boris, I will use lore.kernel.org in the future.

Thanks and BR,
Ricardo


Re: [PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs

2020-10-05 Thread Ricardo Neri
On Sat, Oct 03, 2020 at 11:04:29AM +0200, Borislav Petkov wrote:
> On Fri, Oct 02, 2020 at 07:17:30PM -0700, Luck, Tony wrote:
> > On Sat, Oct 03, 2020 at 03:39:29AM +0200, Thomas Gleixner wrote:
> > > On Fri, Oct 02 2020 at 13:19, Ricardo Neri wrote:
> > > > Add support to discover and enumerate CPUs in Intel hybrid parts. A 
> > > > hybrid
> > > > part has CPUs with more than one type of micro-architecture. Thus, 
> > > > certain
> > > > features may only be present in a specific CPU type.
> > > >
> > > > It is useful to know the type of CPUs present in a system. For instance,
> > > > perf may need to handle CPUs differently depending on the type of micro-
> > > > architecture. Decoding machine check error logs may need the additional
> > > > micro-architecture type information, so include that in the log.
> > > 
> > > 'It is useful' as justification just makes me barf.
> > 
> > This isn't "hetero" ... all of the cores are architecturally the same.
> 
> But it says above "A hybrid part has CPUs with more than one type of
> micro-architecture."
> 
> So which is it?

Yes, even though they have different micro-architectures, all instructions and
features will be the same across CPUs.

> 
> > If CPUID says that some feature is supported, then it will be supported
> > on all of the cores.
> 
> Ok.
> 
> > There might be some model specific performance counter events that only
> > apply to some cores.
> 
> That sounds like the perf counter scheduling code would have to pay
> attention to what is supported. I think we have some functionality for
> that due to some AMD parts but I'd prefer if Peter comments here.
> 
> > Or a machine check error code that is logged in the model specific
> > MSCOD field of IA32_MCi_STATUS. But any and all code can run on any
> > core.
> 
> As long as that is consumed only by userspace I guess that's ok. The
> moment someone starts to want to differentiate on what kind of CPU
> kernel code runs and acts accordingly, then it becomes ugly so we better
> hash it out upfront.

We are not planning to implement changes as such.

Thanks and BR,
Ricardo


Re: [PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs

2020-10-05 Thread Ricardo Neri
On Sat, Oct 03, 2020 at 12:46:29PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 02 2020 at 19:17, Tony Luck wrote:
> 
> > On Sat, Oct 03, 2020 at 03:39:29AM +0200, Thomas Gleixner wrote:
> >> On Fri, Oct 02 2020 at 13:19, Ricardo Neri wrote:
> >> > Add support to discover and enumerate CPUs in Intel hybrid parts. A 
> >> > hybrid
> >> > part has CPUs with more than one type of micro-architecture. Thus, 
> >> > certain
> >> > features may only be present in a specific CPU type.
> >> >
> >> > It is useful to know the type of CPUs present in a system. For instance,
> >> > perf may need to handle CPUs differently depending on the type of micro-
> >> > architecture. Decoding machine check error logs may need the additional
> >> > micro-architecture type information, so include that in the log.
> >> 
> >> 'It is useful' as justification just makes me barf.
> >
> > This isn't "hetero" ... all of the cores are architecturally the same.
> 
> The above clearly says:
> 
> >> > A hybrid part has CPUs with more than one type of micro-architecture.
> 
> Can you folks talk to each other and chose non-ambigous wording in
> changelogs and cover letters?
> 
> > If CPUID says that some feature is supported, then it will be supported
> > on all of the cores.
> 
> That's a different story.

Thank you for the quick reply, Thomas.

I am sorry if my cover letter was not sufficiently clear. I see now that
I should have done a more detailed discussion of the terminology.

Yes, all features and instructions as enumerated in CPUID will be
supported in all CPUs. Thus, the kernel will not have to check if a
feature is supported before running. The hetero/hybrid part means to
reflect that more than one types of CPUs will be present in the package,
and the main difference is power and performance. The same kernel and user code
will run on all CPUs without any other further check.

> 
> > There might be some model specific performance counter events that only
> > apply to some cores. Or a machine check error code that is logged in the
> > model specific MSCOD field of IA32_MCi_STATUS. But any and all code can run
> > on any core.
> 
> Ok. The perf side should be doable, IIRC we already have something like
> that, but Peter should know better.
> 
> > Sure there will be some different power/performance tradeoffs on some
> > cores. But we already have that with some cores able to achieve higher
> > turbo frequencies than others.
> 
> Right, that's not a problem.

We are also working this front.

Thanks and BR,
Ricardo


[PATCH 4/4] x86/cpu/topology: Implement the CPU type sysfs interface

2020-10-02 Thread Ricardo Neri
Recent Intel processors combine CPUs with different types of micro-
architecture in the same package. There may be applications interested in
knowing the type topology of the system. For instance, it can be used to
to determine which subsets of CPUs share a common feature.

Implement cpu_type sysfs interfaces for Intel processors.

For example, in a system with four Intel Atom CPUs and one Intel Core CPU,
these entries look as below. In this example, the native model IDs for
both types of CPUs are 0:

user@host:~$: ls /sys/devices/system/cpu/types
intel_atom_0 intel_core_0

user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
cpulist cpumap

user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
cpulist cpumap

user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpumap
0f

user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpulist
0-3

user@nost:~$ cat /sys/devices/system/cpu/types/intel_core/cpumap
10

user@host:~$ cat /sys/devices/system/cpu/types/intel_core/cpulist
4

Cc: Andi Kleen 
Cc: Dave Hansen 
Cc: Kan Liang 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Srinivas Pandruvada 
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Suggested-by: Len Brown  # Necessity of the interface
Suggested-by: Dave Hansen  # Details of the interface
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/topology.h |  2 ++
 arch/x86/kernel/cpu/topology.c  | 23 +++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index f4234575f3fd..d4a3e1ce8338 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -218,4 +218,6 @@ static inline void arch_set_max_freq_ratio(bool 
turbo_disabled)
 }
 #endif
 
+#define CPUTYPES_MAX_NR 2
+
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index d3a0791bc052..709fc473f905 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -153,3 +153,26 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
 #endif
return 0;
 }
+
+u32 arch_get_cpu_type(int cpu)
+{
+   struct cpuinfo_x86 *c = _data(cpu);
+
+   if (cpu < 0 || cpu >= nr_cpu_ids)
+   return 0;
+
+   return c->x86_cpu_type;
+}
+
+bool arch_has_cpu_type(void)
+{
+   return boot_cpu_has(X86_FEATURE_HYBRID_CPU);
+}
+
+const char *arch_get_cpu_type_name(u32 cpu_type)
+{
+   if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+   return intel_get_hybrid_cpu_type_name(cpu_type);
+
+   return NULL;
+}
-- 
2.17.1



[PATCH 1/4] drivers core: Introduce CPU type sysfs interface

2020-10-02 Thread Ricardo Neri
Hybrid CPU topologies combine CPUs of different microarchitectures in the
same die. Thus, even though the instruction set is compatible among all
CPUs, there may still be differences in features (e.g., some CPUs may
have counters that others CPU do not). There may be applications
interested in knowing the type of micro-architecture topology of the
system to make decisions about process affinity.

While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
cpu_capacity) may be used to infer the types of micro-architecture of the
CPUs in the platform, it may not be entirely accurate. For instance, two
subsets of CPUs with different types of micro-architecture may have the
same capacity due to power or thermal constraints.

Create the new directory /sys/devices/system/cpu/types. Under such
directory, create individual subdirectories for each type of CPU micro-
architecture. Each subdirectory will have cpulist and cpumap files. This
makes it convenient for user space to read all the CPUs of the same type
at once without having to inspect each CPU individually.

Implement a generic interface using weak functions that architectures can
override to indicate a) support for CPU types, b) the CPU type number, and
c) a string to identify the CPU vendor and type.

For example, an x86 system with one Intel Core and four Intel Atom CPUs
would look like this (other architectures have the hooks to use whatever
directory naming convention below "types" that meets their needs):

user@host:~$: ls /sys/devices/system/cpu/types
intel_atom_0  intel_core_0

user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
cpulist cpumap

user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
cpulist cpumap

user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
0f

user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
0-3

user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
10

user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
4

On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
created. Add a hook for this purpose.

Cc: Andi Kleen 
Cc: Dave Hansen 
Cc: "Gautham R. Shenoy" 
Cc: Kan Liang 
Cc: Len Brown 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Srinivas Pandruvada 
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Suggested-by: Len Brown  # Necessity of the interface
Suggested-by: Dave Hansen  # Details of the interface
Signed-off-by: Ricardo Neri 
---
 .../ABI/testing/sysfs-devices-system-cpu  |  13 ++
 drivers/base/cpu.c| 214 ++
 include/linux/cpu.h   |  12 +
 include/linux/cpuhotplug.h|   1 +
 4 files changed, 240 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index b555df825447..bcb3d7195102 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -614,3 +614,16 @@ Description:   SPURR ticks for cpuX when it was idle.
 
This sysfs interface exposes the number of SPURR ticks
for cpuX when it was idle.
+
+What:  /sys/devices/system/cpu/types
+   /sys/devices/system/cpu/types/_/cpumap
+   /sys/devices/system/cpu/types/_/cpulist
+Date:  Oct 2020
+Contact:   Linux kernel mailing list 
+Description:   Types of CPUs in hybrid systems
+
+   If a platform has CPUs with more than one type of micro-
+   architecture, there is one directory for each type of CPU.
+   Inside each directory, the files cpumap and cpulist contain
+   a mask and a list representing CPUs of the same type of micro-
+   architecture. These files only contain CPUs currently online.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index d2136ab9b14a..7e98b11b15a1 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -607,11 +607,225 @@ static void __init cpu_register_vulnerabilities(void)
 static inline void cpu_register_vulnerabilities(void) { }
 #endif
 
+static ssize_t cpulist_show(struct device *device,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct cpumask *mask = dev_get_drvdata(device);
+
+   if (!mask)
+   return -EINVAL;
+
+   return cpumap_print_to_pagebuf(true, buf, mask);
+}
+
+static ssize_t cpumap_show(struct device *device,
+  struct device_attribute *attr,
+  char *buf)
+{
+   struct cpumask *mask = dev_get_drvdata(device);
+
+   if (!mask)
+   return -EINVAL;
+
+   return cpumap_print_to_pagebuf(false, buf, mask);
+}
+
+static DEVICE_ATTR_RO(cpumap);
+static DEVICE_ATTR_RO(cpulist);
+
+static struct attribute *cpu_type_attrs[] = {
+   _attr

[PATCH 2/4] x86/cpu: Describe hybrid CPUs in cpuinfo_x86

2020-10-02 Thread Ricardo Neri
When Linux runs on Intel hybrid parts (i.e., having more than one type of
CPU in the same package), subsystems that deal with specific CPU features
may need to know the type of CPU in which they run. Instead of having each
subsystem to inspect CPUID leaves on its own, add a new member to
cpuinfo_x86 that can be queried to know the type of CPU.

Also, hybrid parts have a native model ID to uniquely identify the
micro-architecture of each CPU. Please note that the native model ID is not
related with the existing x86_model_id read from CPUID leaf 0x1.

In order to uniquely identify a CPU by type and micro-architecture, combine
the aforementioned identifiers into a single new member, x86_cpu_type.

The Intel manual (SDM) defines the CPU type and the CPU native model ID as
8-bit and 24-bit identifiers, respectively; they are packed in %eax when
read from CPUID.

Define also masks that subsystems can use to obtain the CPU type or the
native model separately. The native model ID only requires only a bit mask
as it uses the 24 least significant bits of %eax. The CPU type identifier
requires only a shift value as it uses the 8 most significant bytes of
%eax.

Cc: Andi Kleen 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Kan Liang 
Cc: Len Brown 
Cc: "Peter Zijlstra (Intel)" 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: Srinivas Pandruvada 
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
I pulled this patch from a separate series[1] as per request from Boris.

Changes wrt to such series:
 * Use cpuid_eax() instead of cpuid_count() to read %eax result from
   CPUID. (Boris)

[1]. https://lkml.org/lkml/2020/10/2/1013
---
 arch/x86/include/asm/processor.h | 13 +
 arch/x86/kernel/cpu/common.c |  3 +++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f88c74d7dbd4..d86cdf2b1562 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -141,6 +141,16 @@ struct cpuinfo_x86 {
u32 microcode;
/* Address space bits used by the cache internally */
u8  x86_cache_bits;
+   /*
+* In hybrid parts, there is a CPU type and a native model ID. The
+* CPU type (x86_cpu_type[31:24]) describes the type of micro-
+* architecture families. The native model ID (x86_cpu_type[23:0])
+* describes a specific microarchitecture version. Combining both
+* allows to uniquely identify a CPU.
+*
+* Please note that the native model ID is not related to x86_model.
+*/
+   u32 x86_cpu_type;
unsignedinitialized : 1;
 } __randomize_layout;
 
@@ -168,6 +178,9 @@ enum cpuid_regs_idx {
 
 #define X86_VENDOR_UNKNOWN 0xff
 
+#define X86_HYBRID_CPU_TYPE_ID_SHIFT   24
+#define X86_HYBRID_CPU_NATIVE_MODEL_ID_MASK0xff
+
 /*
  * capabilities of CPUs
  */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..a66c1fdc0e27 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -932,6 +932,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
c->x86_capability[CPUID_D_1_EAX] = eax;
}
 
+   if (cpu_has(c, X86_FEATURE_HYBRID_CPU))
+   c->x86_cpu_type = cpuid_eax(0x001a);
+
/* AMD-defined flags: level 0x8001 */
eax = cpuid_eax(0x8000);
c->extended_cpuid_level = eax;
-- 
2.17.1



[PATCH 3/4] x86/cpu/intel: Add function to get name of hybrid CPU types

2020-10-02 Thread Ricardo Neri
Provided a human-friendly string name for each type of CPU micro-
architecture in Intel hybrid parts. This string is to be used in the
CPU type sysfs interface.

In order to uniquely identify CPUs of the same type, compose the name
string as _.

Cc: Andi Kleen 
Cc: Dave Hansen 
Cc: Kan Liang 
Cc: Len Brown 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Srinivas Pandruvada 
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/intel-family.h |  4 
 arch/x86/kernel/cpu/cpu.h   |  3 +++
 arch/x86/kernel/cpu/intel.c | 23 +++
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/intel-family.h 
b/arch/x86/include/asm/intel-family.h
index 5e658ba2654a..4ec2272e0049 100644
--- a/arch/x86/include/asm/intel-family.h
+++ b/arch/x86/include/asm/intel-family.h
@@ -133,4 +133,8 @@
 /* Family 5 */
 #define INTEL_FAM5_QUARK_X1000 0x09 /* Quark X1000 SoC */
 
+/* Types of CPUs in hybrid parts. */
+#define INTEL_HYBRID_TYPE_ATOM 0x20
+#define INTEL_HYBRID_TYPE_CORE 0x40
+
 #endif /* _ASM_X86_INTEL_FAMILY_H */
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index 9d033693519a..b4474238e1f3 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -56,8 +56,11 @@ extern __ro_after_init enum tsx_ctrl_states tsx_ctrl_state;
 extern void __init tsx_init(void);
 extern void tsx_enable(void);
 extern void tsx_disable(void);
+extern const char *intel_get_hybrid_cpu_type_name(u32 cpu_type);
 #else
 static inline void tsx_init(void) { }
+static inline const char *intel_get_hybrid_cpu_type_name(u32 cpu_type)
+{ return NULL; }
 #endif /* CONFIG_CPU_SUP_INTEL */
 
 extern void get_cpu_cap(struct cpuinfo_x86 *c);
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 59a1e3ce3f14..e1dee382cf98 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1191,3 +1191,26 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c)
cpu_model_supports_sld = true;
split_lock_setup();
 }
+
+static char hybrid_name[64];
+
+const char *intel_get_hybrid_cpu_type_name(u32 cpu_type)
+{
+   u32 native_model_id = cpu_type & X86_HYBRID_CPU_NATIVE_MODEL_ID_MASK;
+   u8 type = cpu_type >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
+
+   switch (type) {
+   case INTEL_HYBRID_TYPE_ATOM:
+   snprintf(hybrid_name, sizeof(hybrid_name), "intel_atom_%u",
+native_model_id);
+   break;
+   case INTEL_HYBRID_TYPE_CORE:
+   snprintf(hybrid_name, sizeof(hybrid_name), "intel_core_%u",
+native_model_id);
+   break;
+   default:
+   return NULL;
+   }
+
+   return hybrid_name;
+}
-- 
2.17.1



[PATCH 0/4] drivers core: Introduce CPU type sysfs interface

2020-10-02 Thread Ricardo Neri
Hybrid CPU topologies combine processors with more than one type of
micro-architecture. Hence, it is possible that individual CPUs support
slightly different features (e.g., performance counters) and different
performance properties. Thus, there may be user space entities interested
in knowing the topology of the system based on the types of available
CPUs.

Currently, there exists an interface for the CPU capacity (/sys/devices/
system/cpu/cpuX/cpu_capacity). However, CPU capacity does not always map
to CPU types (by the way, I will submit a separate series to bring such
interface to x86).

This series proposes the new interface /sys/devices/system/cpu/types
which, in hybrid parts, creates a subdirectory for each type of CPU.
Each subdirectory contains a CPU list and a CPU map that user space can
query.

Patch 1 of the series proposes the generic interface, with hooks
that architectures can override to suit their needs. The three patches
patches implement such interface for x86 (as per request from Boris,
I pulled patch 2 from a separate submission [1]).

Thanks and BR,
Ricardo

[1]. https://lkml.org/lkml/2020/10/2/1013

Ricardo Neri (4):
  drivers core: Introduce CPU type sysfs interface
  x86/cpu: Describe hybrid CPUs in cpuinfo_x86
  x86/cpu/intel: Add function to get name of hybrid CPU types
  x86/cpu/topology: Implement the CPU type sysfs interface

 .../ABI/testing/sysfs-devices-system-cpu  |  13 ++
 arch/x86/include/asm/intel-family.h   |   4 +
 arch/x86/include/asm/processor.h  |  13 ++
 arch/x86/include/asm/topology.h   |   2 +
 arch/x86/kernel/cpu/common.c  |   3 +
 arch/x86/kernel/cpu/cpu.h |   3 +
 arch/x86/kernel/cpu/intel.c   |  23 ++
 arch/x86/kernel/cpu/topology.c|  23 ++
 drivers/base/cpu.c| 214 ++
 include/linux/cpu.h   |  12 +
 include/linux/cpuhotplug.h|   1 +
 11 files changed, 311 insertions(+)

-- 
2.17.1



Re: [PATCH 2/3] x86/cpu: Describe hybrid CPUs in cpuinfo_x86

2020-10-02 Thread Ricardo Neri
On Fri, Oct 02, 2020 at 11:03:06PM +0200, Borislav Petkov wrote:
> On Fri, Oct 02, 2020 at 02:02:31PM -0700, Ricardo Neri wrote:
> > What about patches 1 and 3? Should I resubmit the series with only
> > those?
> 
> Why would you need to resubmit? They're good to go as is, AFAICT.

Thanks for clarifying Boris. Just wanted to check if there was any
action required from me regarding patches 1 & 3.

Thanks and BR,
Ricardo


Re: [PATCH 2/3] x86/cpu: Describe hybrid CPUs in cpuinfo_x86

2020-10-02 Thread Ricardo Neri
On Fri, Oct 02, 2020 at 10:34:52PM +0200, Borislav Petkov wrote:
> On Fri, Oct 02, 2020 at 01:19:30PM -0700, Ricardo Neri wrote:
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 35ad8480c464..0778b3ad26b3 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -932,6 +932,11 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> > c->x86_capability[CPUID_D_1_EAX] = eax;
> > }
> >  
> > +   if (cpu_has(c, X86_FEATURE_HYBRID_CPU)) {
> > +   cpuid_count(0x001a, 0, , , , );
> > +   c->x86_cpu_type = eax;
> 
>   c->x86_cpu_type = cpuid_eax(0x001a);
> 
> should do too.

Thank you for the quick feedback Boris! Sure I can implement it as you
suggest.

> 
> But send this patch together with the code that uses it.

Sure I can submit it along with the code using it.

What about patches 1 and 3? Should I resubmit the series with only
those?

Thanks and BR,
Ricardo


[PATCH 2/3] x86/cpu: Describe hybrid CPUs in cpuinfo_x86

2020-10-02 Thread Ricardo Neri
When Linux runs on Intel hybrid parts (i.e., having more than one type of
CPU in the same package), subsystems that deal with specific CPU features
may need to know the type of CPU in which they run. Instead of having each
subsystem to inspect CPUID leaves on its own, add a new member to
cpuinfo_x86 that can be queried to know the type of CPU.

Also, hybrid parts have a native model ID to uniquely identify the
micro-architecture of each CPU. Please note that the native model ID is not
related with the existing x86_model_id read from CPUID leaf 0x1.

In order to uniquely identify a CPU by type and micro-architecture, combine
the aforementioned identifiers into a single new member, x86_cpu_type.

The Intel manual (SDM) defines the CPU type and the CPU native model ID as
8-bit and 24-bit identifiers, respectively; they are packed in %eax when
read from CPUID.

Define also masks that subsystems can use to obtain the CPU type or the
native model separately. The native model ID only requires only a bit mask
as it uses the 24 least significant bits of %eax. The CPU type identifier
requires only a shift value as it uses the 8 most significant bytes of
%eax.

Cc: Andi Kleen 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Kan Liang 
Cc: Len Brown 
Cc: "Peter Zijlstra (Intel)" 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: Srinivas Pandruvada 
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/processor.h | 13 +
 arch/x86/kernel/cpu/common.c |  5 +
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f88c74d7dbd4..d86cdf2b1562 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -141,6 +141,16 @@ struct cpuinfo_x86 {
u32 microcode;
/* Address space bits used by the cache internally */
u8  x86_cache_bits;
+   /*
+* In hybrid parts, there is a CPU type and a native model ID. The
+* CPU type (x86_cpu_type[31:24]) describes the type of micro-
+* architecture families. The native model ID (x86_cpu_type[23:0])
+* describes a specific microarchitecture version. Combining both
+* allows to uniquely identify a CPU.
+*
+* Please note that the native model ID is not related to x86_model.
+*/
+   u32 x86_cpu_type;
unsignedinitialized : 1;
 } __randomize_layout;
 
@@ -168,6 +178,9 @@ enum cpuid_regs_idx {
 
 #define X86_VENDOR_UNKNOWN 0xff
 
+#define X86_HYBRID_CPU_TYPE_ID_SHIFT   24
+#define X86_HYBRID_CPU_NATIVE_MODEL_ID_MASK0xff
+
 /*
  * capabilities of CPUs
  */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..0778b3ad26b3 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -932,6 +932,11 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
c->x86_capability[CPUID_D_1_EAX] = eax;
}
 
+   if (cpu_has(c, X86_FEATURE_HYBRID_CPU)) {
+   cpuid_count(0x001a, 0, , , , );
+   c->x86_cpu_type = eax;
+   }
+
/* AMD-defined flags: level 0x8001 */
eax = cpuid_eax(0x8000);
c->extended_cpuid_level = eax;
-- 
2.17.1



[PATCH 3/3] x86/mce: include type of core when reporting a machine check error

2020-10-02 Thread Ricardo Neri
In hybrid parts, each type of core reports different types of machine check
errors as the machine check error blocks are tied to different parts of the
hardware. Furthermore, errors may be different across micro-architecture
versions. Thus, in order to decode errors, userspace tools need to know the
type of core as well as the native model ID of the CPU which reported the
error.

This extra information is only included in the error report only when
running on hybrid parts. This conserves the existing behavior when running
on non-hybrid parts. Hence, legacy userspace tools running on new kernels
and hybrid hardware can still understand the format of the reported error
format.

Cc: "Ravi V Shankar" 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/uapi/asm/mce.h | 1 +
 arch/x86/kernel/cpu/mce/core.c  | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/uapi/asm/mce.h b/arch/x86/include/uapi/asm/mce.h
index db9adc081c5a..e730572186d6 100644
--- a/arch/x86/include/uapi/asm/mce.h
+++ b/arch/x86/include/uapi/asm/mce.h
@@ -36,6 +36,7 @@ struct mce {
__u64 ppin; /* Protected Processor Inventory Number */
__u32 microcode;/* Microcode revision */
__u64 kflags;   /* Internal kernel use */
+   __u32 hybrid_info;  /* Type and native model ID in hybrid parts */
 };
 
 #define MCE_GET_RECORD_LEN   _IOR('M', 1, int)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index a6ff407dec71..ecac8d9b6070 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -143,6 +143,9 @@ noinstr void mce_setup(struct mce *m)
m->apicid = cpu_data(m->extcpu).initial_apicid;
m->mcgcap = __rdmsr(MSR_IA32_MCG_CAP);
 
+   if (this_cpu_has(X86_FEATURE_HYBRID_CPU))
+   m->hybrid_info = cpuid_eax(0x1a);
+
if (this_cpu_has(X86_FEATURE_INTEL_PPIN))
m->ppin = __rdmsr(MSR_PPIN);
else if (this_cpu_has(X86_FEATURE_AMD_PPIN))
@@ -264,6 +267,10 @@ static void __print_mce(struct mce *m)
pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x microcode 
%x\n",
m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid,
m->microcode);
+
+   if (this_cpu_has(X86_FEATURE_HYBRID_CPU))
+   pr_emerg(HW_ERR "HYBRID_TYPE %x HYBRID_NATIVE_MODEL_ID %x\n",
+m->hybrid_info >> 24, m->hybrid_info & 0xff);
 }
 
 static void print_mce(struct mce *m)
-- 
2.17.1



[PATCH 1/3] x86/cpufeatures: Enumerate hybrid CPU feature bit

2020-10-02 Thread Ricardo Neri
Add feature enumeration to identify a hybrid part: one in which CPUs with
more than one type of micro-architecture exists in the same package.

Cc: Andi Kleen 
Cc: Kan Liang 
Cc: Len Brown 
Cc: "Peter Zijlstra (Intel)" 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Srinivas Pandruvada 
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..26ecc0f2a6fd 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -371,6 +371,7 @@
 #define X86_FEATURE_MD_CLEAR   (18*32+10) /* VERW clears CPU buffers */
 #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */
 #define X86_FEATURE_SERIALIZE  (18*32+14) /* SERIALIZE instruction */
+#define X86_FEATURE_HYBRID_CPU (18*32+15) /* This part has CPUs of 
more than one type */
 #define X86_FEATURE_TSXLDTRK   (18*32+16) /* TSX Suspend Load Address 
Tracking */
 #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR   (18*32+19) /* Intel ARCH LBR */
-- 
2.17.1



[PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs

2020-10-02 Thread Ricardo Neri
Add support to discover and enumerate CPUs in Intel hybrid parts. A hybrid
part has CPUs with more than one type of micro-architecture. Thus, certain
features may only be present in a specific CPU type.

It is useful to know the type of CPUs present in a system. For instance,
perf may need to handle CPUs differently depending on the type of micro-
architecture. Decoding machine check error logs may need the additional
micro-architecture type information, so include that in the log.

A hybrid part can be identified by reading a new CPUID feature bit.
Likewise, CPUID contains information about the CPU type as well as a new
native model ID. Details can be found in the Intel manual (SDM, [1]).

This series adds support for Intel hybrid parts in two areas: a) adding
the hybrid feature bit as well as struct cpuinfo_x86; and b) decode machine
check errors on hybrid parts.

A later submission will use the proposed functionality to expose the CPU
topology to user space.

Thanks and BR,
Ricardo

[1].
https://software.intel.com/content/dam/develop/public/us/en/documents/325462-sdm-vol-1-2abcd-3abcd.pdf
Vol 2. Section 3.2.CPUID leaf 0x1a

Ricardo Neri (3):
  x86/cpufeatures: Enumerate hybrid CPU feature bit
  x86/cpu: Describe hybrid CPUs in cpuinfo_x86
  x86/mce: include type of core when reporting a machine check error

 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/processor.h   | 13 +
 arch/x86/include/uapi/asm/mce.h|  1 +
 arch/x86/kernel/cpu/common.c   |  5 +
 arch/x86/kernel/cpu/mce/core.c |  7 +++
 5 files changed, 27 insertions(+)

-- 
2.17.1



[PATCH] powercap/rapl: Add support for Lakefield

2020-08-21 Thread Ricardo Neri
Simply add Lakefield model ID. No additional changes are needed.

Cc: Zhang Rui 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Signed-off-by: Ricardo Neri 
---
 drivers/powercap/intel_rapl_common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/powercap/intel_rapl_common.c 
b/drivers/powercap/intel_rapl_common.c
index 6f55aaef8afc..d85a3d95ef20 100644
--- a/drivers/powercap/intel_rapl_common.c
+++ b/drivers/powercap/intel_rapl_common.c
@@ -1036,6 +1036,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = {
X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE,   _defaults_core),
X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, _defaults_core),
X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X,
_defaults_spr_server),
+   X86_MATCH_INTEL_FAM6_MODEL(LAKEFIELD,   _defaults_core),
 
X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT, _defaults_byt),
X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT,_defaults_cht),
-- 
2.17.1



Re: [PATCH] x86/cpu: Fix typos and improve the comments in sync_core()

2020-08-18 Thread Ricardo Neri
On Tue, Aug 18, 2020 at 07:31:30AM +0200, Ingo Molnar wrote:
> 
> * tip-bot2 for Ricardo Neri  wrote:
> 
> > --- a/arch/x86/include/asm/sync_core.h
> > +++ b/arch/x86/include/asm/sync_core.h
> > @@ -5,6 +5,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #ifdef CONFIG_X86_32
> >  static inline void iret_to_self(void)
> > @@ -54,14 +55,23 @@ static inline void iret_to_self(void)
> >  static inline void sync_core(void)
> >  {
> > /*
> > +* The SERIALIZE instruction is the most straightforward way to
> > +* do this but it not universally available.
> > +*/
> > +   if (static_cpu_has(X86_FEATURE_SERIALIZE)) {
> > +   serialize();
> > +   return;
> > +   }
> > +
> > +   /*
> > +* For all other processors, there are quite a few ways to do this.
> > +* IRET-to-self is nice because it works on every CPU, at any CPL
> > +* (so it's compatible with paravirtualization), and it never exits
> > +* to a hypervisor. The only down sides are that it's a bit slow
> > +* (it seems to be a bit more than 2x slower than the fastest
> > +* options) and that it unmasks NMIs.  The "push %cs" is needed
> > +* because, in paravirtual environments, __KERNEL_CS may not be a
> > +* valid CS value when we do IRET directly.
> 
> So there's two typos in the new comments, there are at least two 
> misapplied commas, it departs from existing style, and there's a typo 
> in the existing comments as well.
> 
> Also, before this patch the 'compiler barrier' comment was valid for 
> the whole function (there was no branching), but after this patch it 
> reads of it was only valid for the legacy IRET-to-self branch.
> 
> Which together broke my detector and triggered a bit of compulsive 
> bike-shed painting. ;-) See the resulting patch below.
> 
> Thanks,
> 
>   Ingo
> 
> >
> From: Ingo Molnar 
> Date: Tue, 18 Aug 2020 07:24:05 +0200
> Subject: [PATCH] x86/cpu: Fix typos and improve the comments in sync_core()
> 
> - Fix typos.
> 
> - Move the compiler barrier comment to the top, because it's valid for the
>   whole function, not just the legacy branch.
> 
> Signed-off-by: Ingo Molnar 
> ---
>  arch/x86/include/asm/sync_core.h | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/sync_core.h 
> b/arch/x86/include/asm/sync_core.h
> index 4631c0f969d4..0fd4a9dfb29c 100644
> --- a/arch/x86/include/asm/sync_core.h
> +++ b/arch/x86/include/asm/sync_core.h
> @@ -47,16 +47,19 @@ static inline void iret_to_self(void)
>   *
>   *  b) Text was modified on a different CPU, may subsequently be
>   * executed on this CPU, and you want to make sure the new version
> - * gets executed.  This generally means you're calling this in a IPI.
> + * gets executed.  This generally means you're calling this in an IPI.
>   *
>   * If you're calling this for a different reason, you're probably doing
>   * it wrong.
> + *
> + * Like all of Linux's memory ordering operations, this is a
> + * compiler barrier as well.
>   */
>  static inline void sync_core(void)
>  {
>   /*
>* The SERIALIZE instruction is the most straightforward way to
> -  * do this but it not universally available.
> +  * do this, but it is not universally available.

Indeed, I missed this grammar error.

>*/
>   if (static_cpu_has(X86_FEATURE_SERIALIZE)) {
>   serialize();
> @@ -67,10 +70,10 @@ static inline void sync_core(void)
>* For all other processors, there are quite a few ways to do this.
>* IRET-to-self is nice because it works on every CPU, at any CPL
>* (so it's compatible with paravirtualization), and it never exits
> -  * to a hypervisor. The only down sides are that it's a bit slow
> +  * to a hypervisor.  The only downsides are that it's a bit slow
>* (it seems to be a bit more than 2x slower than the fastest
> -  * options) and that it unmasks NMIs.  The "push %cs" is needed
> -  * because, in paravirtual environments, __KERNEL_CS may not be a
> +  * options) and that it unmasks NMIs.  The "push %cs" is needed,
> +  * because in paravirtual environments __KERNEL_CS may not be a

I didn't realize that the double spaces after the period were part of the
style.

FWIW,

Reviewed-by: Ricardo Neri 


[tip: x86/cpu] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-17 Thread tip-bot2 for Ricardo Neri
The following commit has been merged into the x86/cpu branch of tip:

Commit-ID: bf9c912f9a649776c2d741310486a6984edaac72
Gitweb:
https://git.kernel.org/tip/bf9c912f9a649776c2d741310486a6984edaac72
Author:Ricardo Neri 
AuthorDate:Thu, 06 Aug 2020 20:28:33 -07:00
Committer: Borislav Petkov 
CommitterDate: Mon, 17 Aug 2020 17:23:04 +02:00

x86/cpu: Use SERIALIZE in sync_core() when available

The SERIALIZE instruction gives software a way to force the processor to
complete all modifications to flags, registers and memory from previous
instructions and drain all buffered writes to memory before the next
instruction is fetched and executed. Thus, it serves the purpose of
sync_core(). Use it when available.

Suggested-by: Andy Lutomirski 
Signed-off-by: Ricardo Neri 
Signed-off-by: Borislav Petkov 
Reviewed-by: Tony Luck 
Link: 
https://lkml.kernel.org/r/20200807032833.17484-1-ricardo.neri-calde...@linux.intel.com
---
 arch/x86/include/asm/special_insns.h |  6 ++
 arch/x86/include/asm/sync_core.h | 26 ++
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 59a3e13..5999b0b 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -234,6 +234,12 @@ static inline void clwb(volatile void *__p)
 
 #define nop() asm volatile ("nop")
 
+static inline void serialize(void)
+{
+   /* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */
+   asm volatile(".byte 0xf, 0x1, 0xe8" ::: "memory");
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index fdb5b35..4631c0f 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_32
 static inline void iret_to_self(void)
@@ -54,14 +55,23 @@ static inline void iret_to_self(void)
 static inline void sync_core(void)
 {
/*
-* There are quite a few ways to do this.  IRET-to-self is nice
-* because it works on every CPU, at any CPL (so it's compatible
-* with paravirtualization), and it never exits to a hypervisor.
-* The only down sides are that it's a bit slow (it seems to be
-* a bit more than 2x slower than the fastest options) and that
-* it unmasks NMIs.  The "push %cs" is needed because, in
-* paravirtual environments, __KERNEL_CS may not be a valid CS
-* value when we do IRET directly.
+* The SERIALIZE instruction is the most straightforward way to
+* do this but it not universally available.
+*/
+   if (static_cpu_has(X86_FEATURE_SERIALIZE)) {
+   serialize();
+   return;
+   }
+
+   /*
+* For all other processors, there are quite a few ways to do this.
+* IRET-to-self is nice because it works on every CPU, at any CPL
+* (so it's compatible with paravirtualization), and it never exits
+* to a hypervisor. The only down sides are that it's a bit slow
+* (it seems to be a bit more than 2x slower than the fastest
+* options) and that it unmasks NMIs.  The "push %cs" is needed
+* because, in paravirtual environments, __KERNEL_CS may not be a
+* valid CS value when we do IRET directly.
 *
 * In case NMI unmasking or performance ever becomes a problem,
 * the next best option appears to be MOV-to-CR2 and an


[PATCH v4] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-06 Thread Ricardo Neri
The SERIALIZE instruction gives software a way to force the processor to
complete all modifications to flags, registers and memory from previous
instructions and drain all buffered writes to memory before the next
instruction is fetched and executed. Thus, it serves the purpose of
sync_core(). Use it when available.

Cc: Andy Lutomirski 
Cc: Cathy Zhang 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: "H. Peter Anvin" 
Cc: Kyung Min Park 
Cc: Peter Zijlstra 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Suggested-by: Andy Lutomirski 
Signed-off-by: Ricardo Neri 
---
This is v4 from my three previous submission [1], [2], and [3]. The first
three patches of the series have been merged in Linus' tree. Hence, I am
submitting only this patch for review.

[1]. https://lkml.org/lkml/2020/7/27/8
[2]. https://lkml.org/lkml/2020/8/4/1090
[3]. https://lkml.org/lkml/2020/8/6/808

Changes since v3:
 * Reworked comments in sync_core() for better readability. (Dave Hansen)
 * Reworked the implementation to align with the style in special_insns.h.
   No functional changes were introduced. (Tony Luck)

Changes since v2:
 * Support serialize with static_cpu_has() instead of using alternative
   runtime patching directly. (Borislav Petkov)

Changes since v1:
 * Support SERIALIZE using alternative runtime patching.
   (Peter Zijlstra, H. Peter Anvin)
 * Added a note to specify which version of binutils supports SERIALIZE.
   (Peter Zijlstra)
 * Verified that (::: "memory") is used. (H. Peter Anvin)
---
 arch/x86/include/asm/special_insns.h |  6 ++
 arch/x86/include/asm/sync_core.h | 26 ++
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 59a3e13204c3..5999b0b3dd4a 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -234,6 +234,12 @@ static inline void clwb(volatile void *__p)
 
 #define nop() asm volatile ("nop")
 
+static inline void serialize(void)
+{
+   /* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */
+   asm volatile(".byte 0xf, 0x1, 0xe8" ::: "memory");
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index fdb5b356e59b..089712777fd9 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_32
 static inline void iret_to_self(void)
@@ -54,14 +55,23 @@ static inline void iret_to_self(void)
 static inline void sync_core(void)
 {
/*
-* There are quite a few ways to do this.  IRET-to-self is nice
-* because it works on every CPU, at any CPL (so it's compatible
-* with paravirtualization), and it never exits to a hypervisor.
-* The only down sides are that it's a bit slow (it seems to be
-* a bit more than 2x slower than the fastest options) and that
-* it unmasks NMIs.  The "push %cs" is needed because, in
-* paravirtual environments, __KERNEL_CS may not be a valid CS
-* value when we do IRET directly.
+* The SERIALIZE instruction is the most straightforward way to
+* do this but it not universally available.
+*/
+   if (static_cpu_has(X86_FEATURE_SERIALIZE)) {
+   serialize();
+   return;
+   }
+
+   /*
+* For all other processors, there are quite a few ways to do this
+* IRET-to-self is nice because it works on every CPU, at any CPL
+* (so it's compatible with paravirtualization), and it never exits
+* to a hypervisor. The only down sides are that it's a bit slow
+* (it seems to be a bit more than 2x slower than the fastest
+* options) and that it unmasks NMIs.  The "push %cs" is needed
+* because, in paravirtual environments, __KERNEL_CS may not be a
+* valid CS value when we do IRET directly.
 *
 * In case NMI unmasking or performance ever becomes a problem,
 * the next best option appears to be MOV-to-CR2 and an
-- 
2.17.1



Re: [PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-06 Thread Ricardo Neri
On Thu, Aug 06, 2020 at 04:08:47PM -0700, Dave Hansen wrote:
> On 8/6/20 4:04 PM, Ricardo Neri wrote:
> >  * CPUID is the conventional way, but it's nasty: it doesn't
> >  * exist on some 486-like CPUs, and it usually exits to a
> >  * hypervisor.
> >  *
> >  * The SERIALIZE instruction is the most straightforward way to
> >  * do this as it does not clobber registers or exit to a
> >  * hypervisor. However, it is not universally available.
> >  *
> >  * Like all of Linux's memory ordering operations, this is a
> >  * compiler barrier as well.
> >  */
> > 
> > What do you think?
> 
> I like what I suggested.  :)
> 
> SERIALIZE is best where available.  Do it first, comment it by itself.
> 
> Then, go into the long discussion of the other alternatives.  They only
> make sense when SERIALIZE isn't there, and the logic for selection there
> is substantially more complicated.

Sure Dave, I think this layout makes sense. I will rework the comments.

Thanks and BR,
Ricardo


Re: [PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-06 Thread Ricardo Neri
On Thu, Aug 06, 2020 at 12:57:26PM -0700, Dave Hansen wrote:
> On 8/6/20 12:25 PM, Ricardo Neri wrote:
> >  static inline void sync_core(void)
> >  {
> > /*
> > -* There are quite a few ways to do this.  IRET-to-self is nice
> > +* Hardware can do this for us if SERIALIZE is available. Otherwise,
> > +* there are quite a few ways to do this.  IRET-to-self is nice
> 
> This seems to imply that things other than SERIALIZE aren't the hardware
> doing this.  All of these methods are equally architecturally
> serializing *and* provided by the hardware.

Indeed, I can see how the wording may imply that.

> 
> I also don't quite get the point of separating the comments from the
> code.  Shouldn't this be:
> 
>   /*
>* The SERIALIZE instruction is the most straightforward way to
>* do this but it not universally available.
>*/

I regard the comment as describing all the considered options to for
architectural serialization. What about if I add SERIALIZE as another
option? I propose to discuss it towards the end of the comment:

/*
 * There are quite a few ways to do this.  IRET-to-self is nice
 * because it works on every CPU, at any CPL (so it's compatible
 * with paravirtualization), and it never exits to a hypervisor.
 * The only down sides are that it's a bit slow (it seems to be
 * a bit more than 2x slower than the fastest options) and that
 * it unmasks NMIs.  The "push %cs" is needed because, in
 * paravirtual environments, __KERNEL_CS may not be a valid CS
 * value when we do IRET directly.
 *
 * In case NMI unmasking or performance ever becomes a problem,
 * the next best option appears to be MOV-to-CR2 and an
 * unconditional jump.  That sequence also works on all CPUs,
 * but it will fault at CPL3 (i.e. Xen PV).
 *
 * CPUID is the conventional way, but it's nasty: it doesn't
 * exist on some 486-like CPUs, and it usually exits to a
 * hypervisor.
 *
 * The SERIALIZE instruction is the most straightforward way to
 * do this as it does not clobber registers or exit to a
 * hypervisor. However, it is not universally available.
 *
 * Like all of Linux's memory ordering operations, this is a
 * compiler barrier as well.
 */

What do you think?

>   if (static_cpu_has(X86_FEATURE_SERIALIZE)) {
>   asm volatile(__ASM_SERIALIZE ::: "memory");
>   return;
>   }
> 
>   /*
>* For all other processors, IRET-to-self is nice ...
>*/
>   iret_to_self();



[PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-06 Thread Ricardo Neri
The SERIALIZE instruction gives software a way to force the processor to
complete all modifications to flags, registers and memory from previous
instructions and drain all buffered writes to memory before the next
instruction is fetched and executed. Thus, it serves the purpose of
sync_core(). Use it when available.

Cc: Andy Lutomirski 
Cc: Cathy Zhang 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: "H. Peter Anvin" 
Cc: Kyung Min Park 
Cc: Peter Zijlstra 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Suggested-by: Andy Lutomirski 
Signed-off-by: Ricardo Neri 
---
This is a v3 from my two previous submissions [1], [2]. The first three
patches of the series have been merged into Linus' tree. Hence, I am
submitting only this patch for review.

[1]. https://lkml.org/lkml/2020/7/27/8
[2]. https://lkml.org/lkml/2020/8/4/1090

Changes since v2:
 * Support serialize with static_cpu_has() instead of using alternative
   runtime patching directly. (Borislav Petkov)

Changes since v1:
 * Support SERIALIZE using alternative runtime patching.
   (Peter Zijlstra, H. Peter Anvin)
 * Added a note to specify which version of binutils supports SERIALIZE.
   (Peter Zijlstra)
 * Verified that (::: "memory") is used. (H. Peter Anvin)
---
 arch/x86/include/asm/special_insns.h | 2 ++
 arch/x86/include/asm/sync_core.h | 9 -
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 59a3e13204c3..25cd67801dda 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -10,6 +10,8 @@
 #include 
 #include 
 
+/* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */
+#define __ASM_SERIALIZE ".byte 0xf, 0x1, 0xe8"
 /*
  * Volatile isn't enough to prevent the compiler from reordering the
  * read/write functions for the control registers and messing everything up.
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index fdb5b356e59b..b74580413a0b 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_32
 static inline void iret_to_self(void)
@@ -54,7 +55,8 @@ static inline void iret_to_self(void)
 static inline void sync_core(void)
 {
/*
-* There are quite a few ways to do this.  IRET-to-self is nice
+* Hardware can do this for us if SERIALIZE is available. Otherwise,
+* there are quite a few ways to do this.  IRET-to-self is nice
 * because it works on every CPU, at any CPL (so it's compatible
 * with paravirtualization), and it never exits to a hypervisor.
 * The only down sides are that it's a bit slow (it seems to be
@@ -75,6 +77,11 @@ static inline void sync_core(void)
 * Like all of Linux's memory ordering operations, this is a
 * compiler barrier as well.
 */
+   if (static_cpu_has(X86_FEATURE_SERIALIZE)) {
+   asm volatile(__ASM_SERIALIZE ::: "memory");
+   return;
+   }
+
iret_to_self();
 }
 
-- 
2.17.1



Re: [PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-05 Thread Ricardo Neri
On Wed, Aug 05, 2020 at 06:48:40AM +0200, Borislav Petkov wrote:
> On Tue, Aug 04, 2020 at 07:10:59PM -0700, Ricardo Neri wrote:
> > The SERIALIZE instruction gives software a way to force the processor to
> > complete all modifications to flags, registers and memory from previous
> > instructions and drain all buffered writes to memory before the next
> > instruction is fetched and executed. Thus, it serves the purpose of
> > sync_core(). Use it when available.
> > 
> > Commit 7117f16bf460 ("objtool: Fix ORC vs alternatives") enforced stack
> > invariance in alternatives. The iret-to-self does not comply with such
> > invariance. Thus, it cannot be used inside alternative code. Instead, use
> > an alternative that jumps to SERIALIZE when available.
> > 
> > Cc: Andy Lutomirski 
> > Cc: Cathy Zhang 
> > Cc: Dave Hansen 
> > Cc: Fenghua Yu 
> > Cc: "H. Peter Anvin" 
> > Cc: Kyung Min Park 
> > Cc: Peter Zijlstra 
> > Cc: "Ravi V. Shankar" 
> > Cc: Sean Christopherson 
> > Cc: linux-e...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Suggested-by: Andy Lutomirski 
> > Signed-off-by: Ricardo Neri 
> > ---
> > This is a v2 from my initial submission [1]. The first three patches of
> > the series have been merged in Linus' tree. Hence, I am submitting only
> > this patch for review.
> > 
> > [1]. https://lkml.org/lkml/2020/7/27/8
> > 
> > Changes since v1:
> >  * Support SERIALIZE using alternative runtime patching.
> >(Peter Zijlstra, H. Peter Anvin)
> >  * Added a note to specify which version of binutils supports SERIALIZE.
> >(Peter Zijlstra)
> >  * Verified that (::: "memory") is used. (H. Peter Anvin)
> > ---
> >  arch/x86/include/asm/special_insns.h |  2 ++
> >  arch/x86/include/asm/sync_core.h | 10 +-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/special_insns.h 
> > b/arch/x86/include/asm/special_insns.h
> > index 59a3e13204c3..25cd67801dda 100644
> > --- a/arch/x86/include/asm/special_insns.h
> > +++ b/arch/x86/include/asm/special_insns.h
> > @@ -10,6 +10,8 @@
> >  #include 
> >  #include 
> >  
> > +/* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */
> > +#define __ASM_SERIALIZE ".byte 0xf, 0x1, 0xe8"
> >  /*
> >   * Volatile isn't enough to prevent the compiler from reordering the
> >   * read/write functions for the control registers and messing everything 
> > up.
> > diff --git a/arch/x86/include/asm/sync_core.h 
> > b/arch/x86/include/asm/sync_core.h
> > index fdb5b356e59b..201ea3d9a6bd 100644
> > --- a/arch/x86/include/asm/sync_core.h
> > +++ b/arch/x86/include/asm/sync_core.h
> > @@ -5,15 +5,19 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #ifdef CONFIG_X86_32
> >  static inline void iret_to_self(void)
> >  {
> > asm volatile (
> > +   ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE)
> > "pushfl\n\t"
> > "pushl %%cs\n\t"
> > "pushl $1f\n\t"
> > "iret\n\t"
> > +   "2:\n\t"
> > +   __ASM_SERIALIZE "\n"
> > "1:"
> > : ASM_CALL_CONSTRAINT : : "memory");
> >  }
> > @@ -23,6 +27,7 @@ static inline void iret_to_self(void)
> > unsigned int tmp;
> >  
> > asm volatile (
> > +   ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE)
> 
> Why is this and above stuck inside the asm statement?
> 
> Why can't you simply do:
> 
>   if (static_cpu_has(X86_FEATURE_SERIALIZE)) {
>   asm volatile(__ASM_SERIALIZE ::: "memory");
>   return;
>   }
> 
> on function entry instead of making it more unreadable for no particular
> reason?

My my first submission I had implemented it as you describe. The only
difference was that I used boot_cpu_has() instead of static_cpu_has()
as the latter has a comment stating that:
"Use static_cpu_has() only in fast paths (...) boot_cpu_has() is
 already fast enough for the majority of cases..."

sync_core_before_usermode() already handles what I think are the
critical paths.

Thanks and BR,
Ricardo


Re: [PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-05 Thread Ricardo Neri
On Wed, Aug 05, 2020 at 07:08:08AM +0200, Borislav Petkov wrote:
> On Tue, Aug 04, 2020 at 09:58:25PM -0700, h...@zytor.com wrote:
> > Because why use an alternative to jump over one instruction?
> >
> > I personally would prefer to have the IRET put out of line
> 
> Can't yet - SERIALIZE CPUs are a minority at the moment.
> 
> > and have the call/jmp replaced by SERIALIZE inline.
> 
> Well, we could do:
> 
>   alternative_io("... IRET bunch", __ASM_SERIALIZE, 
> X86_FEATURE_SERIALIZE, ...);
> 
> and avoid all kinds of jumping. Alternatives get padded so there
> would be a couple of NOPs following when SERIALIZE gets patched in
> but it shouldn't be a problem. I guess one needs to look at what gcc
> generates...

But the IRET-TO-SELF code has instruction which modify the stack. This
would violate stack invariance in alternatives as enforced in commit
7117f16bf460 ("objtool: Fix ORC vs alternatives"). As a result, objtool
gives warnings as follows:

arch/x86/kernel/alternative.o: warning: objtool: do_sync_core()+0xe:
alternative modifies stack

Perhaps in this specific case it does not matter as the changes in the
stack will be undone by IRET. However, using alternative_io would require
adding the macro STACK_FRAME_NON_STANDARD to functions using sync_core().
IMHO, it wouldn't look good.

So maybe the best approach is to implement as you suggested using
static_cpu_has()?

Thanks and BR,
Ricardo


Re: [PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-05 Thread Ricardo Neri
On Wed, Aug 05, 2020 at 11:28:31AM -0700, Andy Lutomirski wrote:
> On Wed, Aug 5, 2020 at 10:07 AM Ricardo Neri
>  wrote:
> >
> > On Wed, Aug 05, 2020 at 07:08:08AM +0200, Borislav Petkov wrote:
> > > On Tue, Aug 04, 2020 at 09:58:25PM -0700, h...@zytor.com wrote:
> > > > Because why use an alternative to jump over one instruction?
> > > >
> > > > I personally would prefer to have the IRET put out of line
> > >
> > > Can't yet - SERIALIZE CPUs are a minority at the moment.
> > >
> > > > and have the call/jmp replaced by SERIALIZE inline.
> > >
> > > Well, we could do:
> > >
> > >   alternative_io("... IRET bunch", __ASM_SERIALIZE, 
> > > X86_FEATURE_SERIALIZE, ...);
> > >
> > > and avoid all kinds of jumping. Alternatives get padded so there
> > > would be a couple of NOPs following when SERIALIZE gets patched in
> > > but it shouldn't be a problem. I guess one needs to look at what gcc
> > > generates...
> >
> > But the IRET-TO-SELF code has instruction which modify the stack. This
> > would violate stack invariance in alternatives as enforced in commit
> > 7117f16bf460 ("objtool: Fix ORC vs alternatives"). As a result, objtool
> > gives warnings as follows:
> >
> > arch/x86/kernel/alternative.o: warning: objtool: do_sync_core()+0xe:
> > alternative modifies stack
> >
> > Perhaps in this specific case it does not matter as the changes in the
> > stack will be undone by IRET. However, using alternative_io would require
> > adding the macro STACK_FRAME_NON_STANDARD to functions using sync_core().
> > IMHO, it wouldn't look good.
> >
> > So maybe the best approach is to implement as you suggested using
> > static_cpu_has()?
> 
> I agree.  Let's keep it simple.
> 
> Honestly, I think the right solution is to have iret_to_self() in
> actual asm and invoke it from C as needed. 

Do you mean anything different from what we have already [1]? If I
understand your comment correctly, we have exactly that: an
iret_to_self() asm implementation invoked from C.

Thanks and BR,
Ricardo

[1]. 
https://lore.kernel.org/lkml/20200727043132.15082-4-ricardo.neri-calde...@linux.intel.com/

Thanks and BR,
Ricardo


[PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-04 Thread Ricardo Neri
The SERIALIZE instruction gives software a way to force the processor to
complete all modifications to flags, registers and memory from previous
instructions and drain all buffered writes to memory before the next
instruction is fetched and executed. Thus, it serves the purpose of
sync_core(). Use it when available.

Commit 7117f16bf460 ("objtool: Fix ORC vs alternatives") enforced stack
invariance in alternatives. The iret-to-self does not comply with such
invariance. Thus, it cannot be used inside alternative code. Instead, use
an alternative that jumps to SERIALIZE when available.

Cc: Andy Lutomirski 
Cc: Cathy Zhang 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: "H. Peter Anvin" 
Cc: Kyung Min Park 
Cc: Peter Zijlstra 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Andy Lutomirski 
Signed-off-by: Ricardo Neri 
---
This is a v2 from my initial submission [1]. The first three patches of
the series have been merged in Linus' tree. Hence, I am submitting only
this patch for review.

[1]. https://lkml.org/lkml/2020/7/27/8

Changes since v1:
 * Support SERIALIZE using alternative runtime patching.
   (Peter Zijlstra, H. Peter Anvin)
 * Added a note to specify which version of binutils supports SERIALIZE.
   (Peter Zijlstra)
 * Verified that (::: "memory") is used. (H. Peter Anvin)
---
 arch/x86/include/asm/special_insns.h |  2 ++
 arch/x86/include/asm/sync_core.h | 10 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 59a3e13204c3..25cd67801dda 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -10,6 +10,8 @@
 #include 
 #include 
 
+/* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */
+#define __ASM_SERIALIZE ".byte 0xf, 0x1, 0xe8"
 /*
  * Volatile isn't enough to prevent the compiler from reordering the
  * read/write functions for the control registers and messing everything up.
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index fdb5b356e59b..201ea3d9a6bd 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -5,15 +5,19 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_32
 static inline void iret_to_self(void)
 {
asm volatile (
+   ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE)
"pushfl\n\t"
"pushl %%cs\n\t"
"pushl $1f\n\t"
"iret\n\t"
+   "2:\n\t"
+   __ASM_SERIALIZE "\n"
"1:"
: ASM_CALL_CONSTRAINT : : "memory");
 }
@@ -23,6 +27,7 @@ static inline void iret_to_self(void)
unsigned int tmp;
 
asm volatile (
+   ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE)
"mov %%ss, %0\n\t"
"pushq %q0\n\t"
"pushq %%rsp\n\t"
@@ -32,6 +37,8 @@ static inline void iret_to_self(void)
"pushq %q0\n\t"
"pushq $1f\n\t"
"iretq\n\t"
+   "2:\n\t"
+   __ASM_SERIALIZE "\n"
"1:"
: "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
 }
@@ -54,7 +61,8 @@ static inline void iret_to_self(void)
 static inline void sync_core(void)
 {
/*
-* There are quite a few ways to do this.  IRET-to-self is nice
+* Hardware can do this for us if SERIALIZE is available. Otherwise,
+* there are quite a few ways to do this.  IRET-to-self is nice
 * because it works on every CPU, at any CPL (so it's compatible
 * with paravirtualization), and it never exits to a hypervisor.
 * The only down sides are that it's a bit slow (it seems to be
-- 
2.17.1



Re: [PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available

2020-07-27 Thread Ricardo Neri
On Mon, Jul 27, 2020 at 03:30:20PM +0200, pet...@infradead.org wrote:
> On Mon, Jul 27, 2020 at 03:05:36PM +0200, pet...@infradead.org wrote:
> > Yeah, I'm not sure.. the 'funny' thing is that typically call
> > sync_core() from an IPI anyway. And the synchronous broadcast IPI is by
> > far the most expensive part of that.
> > 
> > Something like this...
> > 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index 20e07feb4064..528e049ee1d9 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -989,12 +989,13 @@ void *text_poke_kgdb(void *addr, const void *opcode, 
> > size_t len)
> >  
> >  static void do_sync_core(void *info)
> >  {
> > -   sync_core();
> > +   /* IRET implies sync_core() */
> >  }
> >  
> >  void text_poke_sync(void)
> >  {
> > on_each_cpu(do_sync_core, NULL, 1);
> > +   sync_core();
> >  }
> >  
> >  struct text_poke_loc {
> 
> So 'people' have wanted to optimize this for NOHZ_FULL and I suppose
> virt as well.
> 
> IFF VMENTER is serializing, I suppose we can simply do something like:
> 
> bool text_poke_cond(int cpu, void *info)
> {
>   /*
>* If we observe the vCPU is preempted, it will do VMENTER
>* no point in sending an IPI to SERIALIZE.
>*/
>   return !vcpu_is_preempted(cpu);
> }
> 
> void text_poke_sync(void)
> {
>   smp_call_function_many_cond(cpu_possible_mask,
>   do_sync_core, NULL, 1, text_poke_cond);
>   sync_core();
> }
> 
> The 'same' for NOHZ_FULL, except we need to cmpxchg a value such that
> if the cmpxchg() succeeds we know the CPU is in userspace and will
> SERIALIZE on the next entry. Much like kvm_flush_tlb_others().
> 
> 
> Anyway, that's all hand-wavey.. I'll let someone that cares about those
> things write actual patches :-)

I think I got a little lost here. If I understand correctly, there are
two alternatives to implement support for serialize better:

  a) alternative(IRET_TO_SELF, SERIALIZE, X86_FEATURE_SERIALIZE); or
  b) asm volatile("1:.byte 0xf, 0x1, 0xe8;2:" _ASM_EXTABLE(1b:2b)

a) would be the traditional and simpler solution. b) would rely on
causing an #UD and getting an IRET on existing hardware b) would need some
more optimization work when handling the exception and a few reworks on
the poke patching code.

Which option should I focus on? Which option would be more desirable/better?

Thanks and BR,
Ricardo


Re: [PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available

2020-07-27 Thread Ricardo Neri
On Mon, Jul 27, 2020 at 05:47:32AM -0700, h...@zytor.com wrote:
> On July 27, 2020 1:20:03 AM PDT, pet...@infradead.org wrote:
> >On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote:
> >> +static inline void serialize(void)
> >> +{
> >> +  asm volatile(".byte 0xf, 0x1, 0xe8");
> >> +}
> >
> >Can we pretty please have a comment with the binutils version that has
> >the mnomic? Such that when we increase the required binutils version we
> >can grep for this.
> 
> It also needs : : : "memory" to be a barrier.

Sure Peter, I will make this change.

Thanks and BR,
Ricardo


Re: [PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available

2020-07-27 Thread Ricardo Neri
On Mon, Jul 27, 2020 at 10:20:03AM +0200, pet...@infradead.org wrote:
> On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote:
> > +static inline void serialize(void)
> > +{
> > +   asm volatile(".byte 0xf, 0x1, 0xe8");
> > +}
> 
> Can we pretty please have a comment with the binutils version that has
> the mnomic? Such that when we increase the required binutils version we
> can grep for this.

This is supported since binutils 2.35[1]. I'll add a comment with the
clarification.

Thanks and BR,
Ricardo

[1]. 
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob_plain;f=gas/NEWS;;hb=refs/tags/binutils-2_35


Re: [PATCH 0/4] x86/cpu: Use SERIALIZE in sync_core()

2020-07-27 Thread Ricardo Neri
On Mon, Jul 27, 2020 at 01:07:08PM +0200, Ingo Molnar wrote:
> 
> * Ricardo Neri  wrote:
> 
> > A recent submission to LKML introduced a CPU feature flag for a new
> > Intel architecture Serializing Instruction, SERIALIZE [1]. Unlike the
> > existing Serializing Instructions, this new instruction does not have
> > side effects such as clobbering registers or exiting to a hypervisor.
> > 
> > As stated in the Intel "extensions" (ISE) manual [2], this instruction
> > will appear first in Sapphire Rapids and Alder Lake.
> > 
> > Andy Lutomirski suggested to use this instruction in sync_core() as it
> > serves the very purpose of this function [3].
> > 
> > For completeness, I picked patch #3 from Cathy's series (and has become
> > patch #1 here) [1]. Her series depends on such patch to build correctly.
> > Maybe it can be merged independently while the discussion continues?
> > 
> > Thanks and BR,
> > Ricardo
> > 
> > [1]. 
> > https://lore.kernel.org/kvm/1594088183-7187-1-git-send-email-cathy.zh...@intel.com/
> > [2]. 
> > https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> > [3]. 
> > https://lore.kernel.org/kvm/CALCETrWudiF8G8r57r5i4JefuP5biG1kHg==0o8yxb-bys-...@mail.gmail.com/
> > 
> > Ricardo Neri (4):
> >   x86/cpufeatures: Add enumeration for SERIALIZE instruction
> >   x86/cpu: Relocate sync_core() to sync_core.h
> >   x86/cpu: Refactor sync_core() for readability
> >   x86/cpu: Use SERIALIZE in sync_core() when available
> 
> I've picked up the first 3 preparatory patches into tip:x86/cpu, as 
> they are valid improvements even without the 4th patch. The actual 
> functionality in the 4th patch still needs work.

Thank you Ingo! I'll work on the fourth patch.

BR,
Ricardo


[tip: x86/cpu] x86/cpufeatures: Add enumeration for SERIALIZE instruction

2020-07-27 Thread tip-bot2 for Ricardo Neri
The following commit has been merged into the x86/cpu branch of tip:

Commit-ID: 85b23fbc7d88f8c6e3951721802d7845bc39663d
Gitweb:
https://git.kernel.org/tip/85b23fbc7d88f8c6e3951721802d7845bc39663d
Author:Ricardo Neri 
AuthorDate:Sun, 26 Jul 2020 21:31:29 -07:00
Committer: Ingo Molnar 
CommitterDate: Mon, 27 Jul 2020 12:42:06 +02:00

x86/cpufeatures: Add enumeration for SERIALIZE instruction

The Intel architecture defines a set of Serializing Instructions (a
detailed definition can be found in Vol.3 Section 8.3 of the Intel "main"
manual, SDM). However, these instructions do more than what is required,
have side effects and/or may be rather invasive. Furthermore, some of
these instructions are only available in kernel mode or may cause VMExits.
Thus, software using these instructions only to serialize execution (as
defined in the manual) must handle the undesired side effects.

As indicated in the name, SERIALIZE is a new Intel architecture
Serializing Instruction. Crucially, it does not have any of the mentioned
side effects. Also, it does not cause VMExit and can be used in user mode.

This new instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual in the future.

Signed-off-by: Ricardo Neri 
Signed-off-by: Ingo Molnar 
Reviewed-by: Tony Luck 
Acked-by: Dave Hansen 
Link: 
https://lore.kernel.org/r/20200727043132.15082-2-ricardo.neri-calde...@linux.intel.com
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 02dabc9..adf45cf 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -365,6 +365,7 @@
 #define X86_FEATURE_SRBDS_CTRL (18*32+ 9) /* "" SRBDS mitigation MSR 
available */
 #define X86_FEATURE_MD_CLEAR   (18*32+10) /* VERW clears CPU buffers */
 #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */
+#define X86_FEATURE_SERIALIZE  (18*32+14) /* SERIALIZE instruction */
 #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL  (18*32+26) /* "" Speculation Control 
(IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP(18*32+27) /* "" Single Thread 
Indirect Branch Predictors */


[tip: x86/cpu] x86/cpu: Relocate sync_core() to sync_core.h

2020-07-27 Thread tip-bot2 for Ricardo Neri
The following commit has been merged into the x86/cpu branch of tip:

Commit-ID: 9998a9832c4027e907353e5e05fde730cf624b77
Gitweb:
https://git.kernel.org/tip/9998a9832c4027e907353e5e05fde730cf624b77
Author:Ricardo Neri 
AuthorDate:Sun, 26 Jul 2020 21:31:30 -07:00
Committer: Ingo Molnar 
CommitterDate: Mon, 27 Jul 2020 12:42:06 +02:00

x86/cpu: Relocate sync_core() to sync_core.h

Having sync_core() in processor.h is problematic since it is not possible
to check for hardware capabilities via the *cpu_has() family of macros.
The latter needs the definitions in processor.h.

It also looks more intuitive to relocate the function to sync_core.h.

This changeset does not make changes in functionality.

Signed-off-by: Ricardo Neri 
Signed-off-by: Ingo Molnar 
Reviewed-by: Tony Luck 
Link: 
https://lore.kernel.org/r/20200727043132.15082-3-ricardo.neri-calde...@linux.intel.com
---
 arch/x86/include/asm/processor.h| 64 +
 arch/x86/include/asm/sync_core.h| 64 -
 arch/x86/kernel/alternative.c   |  1 +-
 arch/x86/kernel/cpu/mce/core.c  |  1 +-
 drivers/misc/sgi-gru/grufault.c |  1 +-
 drivers/misc/sgi-gru/gruhandles.c   |  1 +-
 drivers/misc/sgi-gru/grukservices.c |  1 +-
 7 files changed, 69 insertions(+), 64 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 03b7c4c..68ba42f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -678,70 +678,6 @@ static inline unsigned int cpuid_edx(unsigned int op)
return edx;
 }
 
-/*
- * This function forces the icache and prefetched instruction stream to
- * catch up with reality in two very specific cases:
- *
- *  a) Text was modified using one virtual address and is about to be executed
- * from the same physical page at a different virtual address.
- *
- *  b) Text was modified on a different CPU, may subsequently be
- * executed on this CPU, and you want to make sure the new version
- * gets executed.  This generally means you're calling this in a IPI.
- *
- * If you're calling this for a different reason, you're probably doing
- * it wrong.
- */
-static inline void sync_core(void)
-{
-   /*
-* There are quite a few ways to do this.  IRET-to-self is nice
-* because it works on every CPU, at any CPL (so it's compatible
-* with paravirtualization), and it never exits to a hypervisor.
-* The only down sides are that it's a bit slow (it seems to be
-* a bit more than 2x slower than the fastest options) and that
-* it unmasks NMIs.  The "push %cs" is needed because, in
-* paravirtual environments, __KERNEL_CS may not be a valid CS
-* value when we do IRET directly.
-*
-* In case NMI unmasking or performance ever becomes a problem,
-* the next best option appears to be MOV-to-CR2 and an
-* unconditional jump.  That sequence also works on all CPUs,
-* but it will fault at CPL3 (i.e. Xen PV).
-*
-* CPUID is the conventional way, but it's nasty: it doesn't
-* exist on some 486-like CPUs, and it usually exits to a
-* hypervisor.
-*
-* Like all of Linux's memory ordering operations, this is a
-* compiler barrier as well.
-*/
-#ifdef CONFIG_X86_32
-   asm volatile (
-   "pushfl\n\t"
-   "pushl %%cs\n\t"
-   "pushl $1f\n\t"
-   "iret\n\t"
-   "1:"
-   : ASM_CALL_CONSTRAINT : : "memory");
-#else
-   unsigned int tmp;
-
-   asm volatile (
-   "mov %%ss, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq %%rsp\n\t"
-   "addq $8, (%%rsp)\n\t"
-   "pushfq\n\t"
-   "mov %%cs, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq $1f\n\t"
-   "iretq\n\t"
-   "1:"
-   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
-#endif
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void amd_e400_c1e_apic_setup(void);
 
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index c67caaf..9c5573f 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -7,6 +7,70 @@
 #include 
 
 /*
+ * This function forces the icache and prefetched instruction stream to
+ * catch up with reality in two very specific cases:
+ *
+ *  a) Text was modified using one virtual address and is about to be executed
+ * from the same physical page at a different virtual address.
+ *
+ *  b) Text was modified on a different CPU, may subsequently be
+ * executed on this CPU, and yo

[tip: x86/cpu] x86/cpu: Refactor sync_core() for readability

2020-07-27 Thread tip-bot2 for Ricardo Neri
The following commit has been merged into the x86/cpu branch of tip:

Commit-ID: f69ca629d89d65737537e05308ac531f7bb07d5c
Gitweb:
https://git.kernel.org/tip/f69ca629d89d65737537e05308ac531f7bb07d5c
Author:Ricardo Neri 
AuthorDate:Sun, 26 Jul 2020 21:31:31 -07:00
Committer: Ingo Molnar 
CommitterDate: Mon, 27 Jul 2020 12:42:06 +02:00

x86/cpu: Refactor sync_core() for readability

Instead of having #ifdef/#endif blocks inside sync_core() for X86_64 and
X86_32, implement the new function iret_to_self() with two versions.

In this manner, avoid having to use even more more #ifdef/#endif blocks
when adding support for SERIALIZE in sync_core().

Co-developed-by: Tony Luck 
Signed-off-by: Tony Luck 
Signed-off-by: Ricardo Neri 
Signed-off-by: Ingo Molnar 
Link: 
https://lore.kernel.org/r/20200727043132.15082-4-ricardo.neri-calde...@linux.intel.com
---
 arch/x86/include/asm/special_insns.h |  1 +-
 arch/x86/include/asm/sync_core.h | 56 +++
 2 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index eb8e781..59a3e13 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -234,7 +234,6 @@ static inline void clwb(volatile void *__p)
 
 #define nop() asm volatile ("nop")
 
-
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index 9c5573f..fdb5b35 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -6,6 +6,37 @@
 #include 
 #include 
 
+#ifdef CONFIG_X86_32
+static inline void iret_to_self(void)
+{
+   asm volatile (
+   "pushfl\n\t"
+   "pushl %%cs\n\t"
+   "pushl $1f\n\t"
+   "iret\n\t"
+   "1:"
+   : ASM_CALL_CONSTRAINT : : "memory");
+}
+#else
+static inline void iret_to_self(void)
+{
+   unsigned int tmp;
+
+   asm volatile (
+   "mov %%ss, %0\n\t"
+   "pushq %q0\n\t"
+   "pushq %%rsp\n\t"
+   "addq $8, (%%rsp)\n\t"
+   "pushfq\n\t"
+   "mov %%cs, %0\n\t"
+   "pushq %q0\n\t"
+   "pushq $1f\n\t"
+   "iretq\n\t"
+   "1:"
+   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
+}
+#endif /* CONFIG_X86_32 */
+
 /*
  * This function forces the icache and prefetched instruction stream to
  * catch up with reality in two very specific cases:
@@ -44,30 +75,7 @@ static inline void sync_core(void)
 * Like all of Linux's memory ordering operations, this is a
 * compiler barrier as well.
 */
-#ifdef CONFIG_X86_32
-   asm volatile (
-   "pushfl\n\t"
-   "pushl %%cs\n\t"
-   "pushl $1f\n\t"
-   "iret\n\t"
-   "1:"
-   : ASM_CALL_CONSTRAINT : : "memory");
-#else
-   unsigned int tmp;
-
-   asm volatile (
-   "mov %%ss, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq %%rsp\n\t"
-   "addq $8, (%%rsp)\n\t"
-   "pushfq\n\t"
-   "mov %%cs, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq $1f\n\t"
-   "iretq\n\t"
-   "1:"
-   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
-#endif
+   iret_to_self();
 }
 
 /*


[PATCH 3/4] x86/cpu: Refactor sync_core() for readability

2020-07-26 Thread Ricardo Neri
Instead of having #ifdef/#endif blocks inside sync_core() for X86_64 and
X86_32, implement the new function iret_to_self() with two versions.

In this manner, avoid having to use even more more #ifdef/#endif blocks
when adding support for SERIALIZE in sync_core().

Cc: Andy Lutomirski 
Cc: Cathy Zhang 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: "H. Peter Anvin" 
Cc: Kyung Min Park 
Cc: Peter Zijlstra 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Co-developed-by: Tony Luck 
Signed-off-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
---
 arch/x86/include/asm/special_insns.h |  1 -
 arch/x86/include/asm/sync_core.h | 56 
 2 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index eb8e781c4353..59a3e13204c3 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -234,7 +234,6 @@ static inline void clwb(volatile void *__p)
 
 #define nop() asm volatile ("nop")
 
-
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index 9c5573f2c333..fdb5b356e59b 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -6,6 +6,37 @@
 #include 
 #include 
 
+#ifdef CONFIG_X86_32
+static inline void iret_to_self(void)
+{
+   asm volatile (
+   "pushfl\n\t"
+   "pushl %%cs\n\t"
+   "pushl $1f\n\t"
+   "iret\n\t"
+   "1:"
+   : ASM_CALL_CONSTRAINT : : "memory");
+}
+#else
+static inline void iret_to_self(void)
+{
+   unsigned int tmp;
+
+   asm volatile (
+   "mov %%ss, %0\n\t"
+   "pushq %q0\n\t"
+   "pushq %%rsp\n\t"
+   "addq $8, (%%rsp)\n\t"
+   "pushfq\n\t"
+   "mov %%cs, %0\n\t"
+   "pushq %q0\n\t"
+   "pushq $1f\n\t"
+   "iretq\n\t"
+   "1:"
+   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
+}
+#endif /* CONFIG_X86_32 */
+
 /*
  * This function forces the icache and prefetched instruction stream to
  * catch up with reality in two very specific cases:
@@ -44,30 +75,7 @@ static inline void sync_core(void)
 * Like all of Linux's memory ordering operations, this is a
 * compiler barrier as well.
 */
-#ifdef CONFIG_X86_32
-   asm volatile (
-   "pushfl\n\t"
-   "pushl %%cs\n\t"
-   "pushl $1f\n\t"
-   "iret\n\t"
-   "1:"
-   : ASM_CALL_CONSTRAINT : : "memory");
-#else
-   unsigned int tmp;
-
-   asm volatile (
-   "mov %%ss, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq %%rsp\n\t"
-   "addq $8, (%%rsp)\n\t"
-   "pushfq\n\t"
-   "mov %%cs, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq $1f\n\t"
-   "iretq\n\t"
-   "1:"
-   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
-#endif
+   iret_to_self();
 }
 
 /*
-- 
2.17.1



[PATCH 2/4] x86/cpu: Relocate sync_core() to sync_core.h

2020-07-26 Thread Ricardo Neri
Having sync_core() in processor.h is problematic since it is not possible
to check for hardware capabilities via the *cpu_has() family of macros.
The latter needs the definitions in processor.h.

It also looks more intuitive to relocate the function to sync_core.h.

This changeset does not make changes in functionality.

Cc: Andy Lutomirski 
Cc: Cathy Zhang 
Cc: Dave Hansen 
Cc: Dimitri Sivanich 
Cc: Fenghua Yu 
Cc: "H. Peter Anvin" 
Cc: Kyung Min Park 
Cc: Peter Zijlstra 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
---
 arch/x86/include/asm/processor.h| 64 -
 arch/x86/include/asm/sync_core.h| 64 +
 arch/x86/kernel/alternative.c   |  1 +
 arch/x86/kernel/cpu/mce/core.c  |  1 +
 drivers/misc/sgi-gru/grufault.c |  1 +
 drivers/misc/sgi-gru/gruhandles.c   |  1 +
 drivers/misc/sgi-gru/grukservices.c |  1 +
 7 files changed, 69 insertions(+), 64 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2a1f7e1d7151..97143d87994c 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -676,70 +676,6 @@ static inline unsigned int cpuid_edx(unsigned int op)
return edx;
 }
 
-/*
- * This function forces the icache and prefetched instruction stream to
- * catch up with reality in two very specific cases:
- *
- *  a) Text was modified using one virtual address and is about to be executed
- * from the same physical page at a different virtual address.
- *
- *  b) Text was modified on a different CPU, may subsequently be
- * executed on this CPU, and you want to make sure the new version
- * gets executed.  This generally means you're calling this in a IPI.
- *
- * If you're calling this for a different reason, you're probably doing
- * it wrong.
- */
-static inline void sync_core(void)
-{
-   /*
-* There are quite a few ways to do this.  IRET-to-self is nice
-* because it works on every CPU, at any CPL (so it's compatible
-* with paravirtualization), and it never exits to a hypervisor.
-* The only down sides are that it's a bit slow (it seems to be
-* a bit more than 2x slower than the fastest options) and that
-* it unmasks NMIs.  The "push %cs" is needed because, in
-* paravirtual environments, __KERNEL_CS may not be a valid CS
-* value when we do IRET directly.
-*
-* In case NMI unmasking or performance ever becomes a problem,
-* the next best option appears to be MOV-to-CR2 and an
-* unconditional jump.  That sequence also works on all CPUs,
-* but it will fault at CPL3 (i.e. Xen PV).
-*
-* CPUID is the conventional way, but it's nasty: it doesn't
-* exist on some 486-like CPUs, and it usually exits to a
-* hypervisor.
-*
-* Like all of Linux's memory ordering operations, this is a
-* compiler barrier as well.
-*/
-#ifdef CONFIG_X86_32
-   asm volatile (
-   "pushfl\n\t"
-   "pushl %%cs\n\t"
-   "pushl $1f\n\t"
-   "iret\n\t"
-   "1:"
-   : ASM_CALL_CONSTRAINT : : "memory");
-#else
-   unsigned int tmp;
-
-   asm volatile (
-   "mov %%ss, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq %%rsp\n\t"
-   "addq $8, (%%rsp)\n\t"
-   "pushfq\n\t"
-   "mov %%cs, %0\n\t"
-   "pushq %q0\n\t"
-   "pushq $1f\n\t"
-   "iretq\n\t"
-   "1:"
-   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
-#endif
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void amd_e400_c1e_apic_setup(void);
 
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index c67caafd3381..9c5573f2c333 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -6,6 +6,70 @@
 #include 
 #include 
 
+/*
+ * This function forces the icache and prefetched instruction stream to
+ * catch up with reality in two very specific cases:
+ *
+ *  a) Text was modified using one virtual address and is about to be executed
+ * from the same physical page at a different virtual address.
+ *
+ *  b) Text was modified on a different CPU, may subsequently be
+ * executed on this CPU, and you want to make sure the new version
+ * gets executed.  This generally means you're calling this in a IPI.
+ *
+ * If you're calling this for a different reason, you're probably doing
+ * it wrong.
+ */
+static i

[PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available

2020-07-26 Thread Ricardo Neri
The SERIALIZE instruction gives software a way to force the processor to
complete all modifications to flags, registers and memory from previous
instructions and drain all buffered writes to memory before the next
instruction is fetched and executed. Thus, it serves the purpose of
sync_core(). Use it when available.

Use boot_cpu_has() and not static_cpu_has(); the most critical paths
(returning to user mode and from interrupt and NMI) will not reach
sync_core().

Cc: Andy Lutomirski 
Cc: Cathy Zhang 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: "H. Peter Anvin" 
Cc: Kyung Min Park 
Cc: Peter Zijlstra 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviwed-by: Tony Luck 
Suggested-by: Andy Lutomirski 
Signed-off-by: Ricardo Neri 
---
---
 arch/x86/include/asm/special_insns.h |  5 +
 arch/x86/include/asm/sync_core.h | 10 +-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 59a3e13204c3..0a2a60bba282 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -234,6 +234,11 @@ static inline void clwb(volatile void *__p)
 
 #define nop() asm volatile ("nop")
 
+static inline void serialize(void)
+{
+   asm volatile(".byte 0xf, 0x1, 0xe8");
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index fdb5b356e59b..bf132c09d61b 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_32
 static inline void iret_to_self(void)
@@ -54,7 +55,8 @@ static inline void iret_to_self(void)
 static inline void sync_core(void)
 {
/*
-* There are quite a few ways to do this.  IRET-to-self is nice
+* Hardware can do this for us if SERIALIZE is available. Otherwise,
+* there are quite a few ways to do this.  IRET-to-self is nice
 * because it works on every CPU, at any CPL (so it's compatible
 * with paravirtualization), and it never exits to a hypervisor.
 * The only down sides are that it's a bit slow (it seems to be
@@ -75,6 +77,12 @@ static inline void sync_core(void)
 * Like all of Linux's memory ordering operations, this is a
 * compiler barrier as well.
 */
+
+   if (boot_cpu_has(X86_FEATURE_SERIALIZE)) {
+   serialize();
+   return;
+   }
+
iret_to_self();
 }
 
-- 
2.17.1



[PATCH 1/4] x86/cpufeatures: Add enumeration for SERIALIZE instruction

2020-07-26 Thread Ricardo Neri
The Intel architecture defines a set of Serializing Instructions (a
detailed definition can be found in Vol.3 Section 8.3 of the Intel "main"
manual, SDM). However, these instructions do more than what is required,
have side effects and/or may be rather invasive. Furthermore, some of
these instructions are only available in kernel mode or may cause VMExits.
Thus, software using these instructions only to serialize execution (as
defined in the manual) must handle the undesired side effects.

As indicated in the name, SERIALIZE is a new Intel architecture
Serializing Instruction. Crucially, it does not have any of the mentioned
side effects. Also, it does not cause VMExit and can be used in user mode.

This new instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual in the future.

Cc: Andy Lutomirski 
Cc: Cathy Zhang 
Cc: Fenghua Yu 
Cc: "H. Peter Anvin" 
Cc: Kyung Min Park 
Cc: Peter Zijlstra 
Cc: "Ravi V. Shankar" 
Cc: Sean Christopherson 
Cc: Tony Luck 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Dave Hansen 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 03390a1ef8e7..2901d5df4366 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -367,6 +367,7 @@
 #define X86_FEATURE_SRBDS_CTRL (18*32+ 9) /* "" SRBDS mitigation MSR 
available */
 #define X86_FEATURE_MD_CLEAR   (18*32+10) /* VERW clears CPU buffers */
 #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */
+#define X86_FEATURE_SERIALIZE  (18*32+14) /* SERIALIZE instruction */
 #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR   (18*32+19) /* Intel ARCH LBR */
 #define X86_FEATURE_SPEC_CTRL  (18*32+26) /* "" Speculation Control 
(IBRS + IBPB) */
-- 
2.17.1



[PATCH 0/4] x86/cpu: Use SERIALIZE in sync_core()

2020-07-26 Thread Ricardo Neri
A recent submission to LKML introduced a CPU feature flag for a new
Intel architecture Serializing Instruction, SERIALIZE [1]. Unlike the
existing Serializing Instructions, this new instruction does not have
side effects such as clobbering registers or exiting to a hypervisor.

As stated in the Intel "extensions" (ISE) manual [2], this instruction
will appear first in Sapphire Rapids and Alder Lake.

Andy Lutomirski suggested to use this instruction in sync_core() as it
serves the very purpose of this function [3].

For completeness, I picked patch #3 from Cathy's series (and has become
patch #1 here) [1]. Her series depends on such patch to build correctly.
Maybe it can be merged independently while the discussion continues?

Thanks and BR,
Ricardo

[1]. 
https://lore.kernel.org/kvm/1594088183-7187-1-git-send-email-cathy.zh...@intel.com/
[2]. 
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
[3]. 
https://lore.kernel.org/kvm/CALCETrWudiF8G8r57r5i4JefuP5biG1kHg==0o8yxb-bys-...@mail.gmail.com/

Ricardo Neri (4):
  x86/cpufeatures: Add enumeration for SERIALIZE instruction
  x86/cpu: Relocate sync_core() to sync_core.h
  x86/cpu: Refactor sync_core() for readability
  x86/cpu: Use SERIALIZE in sync_core() when available

 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/processor.h | 64 --
 arch/x86/include/asm/special_insns.h |  4 ++
 arch/x86/include/asm/sync_core.h | 80 
 arch/x86/kernel/alternative.c|  1 +
 arch/x86/kernel/cpu/mce/core.c   |  1 +
 drivers/misc/sgi-gru/grufault.c  |  1 +
 drivers/misc/sgi-gru/gruhandles.c|  1 +
 drivers/misc/sgi-gru/grukservices.c  |  1 +
 9 files changed, 90 insertions(+), 64 deletions(-)

-- 
2.17.1



Re: [PATCH v2 1/4] x86/cpufeatures: Add enumeration for SERIALIZE instruction

2020-07-23 Thread Ricardo Neri
On Thu, Jul 23, 2020 at 01:02:43AM +0200, Thomas Gleixner wrote:
> Ricardo Neri  writes:
> > On Tue, Jul 07, 2020 at 09:36:15AM -0700, Andy Lutomirski wrote:
> >> On Mon, Jul 6, 2020 at 7:21 PM Cathy Zhang  wrote:
> >> >
> >> > This instruction gives software a way to force the processor to complete
> >> > all modifications to flags, registers and memory from previous 
> >> > instructions
> >> > and drain all buffered writes to memory before the next instruction is
> >> > fetched and executed.
> >> >
> >> > The same effect can be obtained using the cpuid instruction. However,
> >> > cpuid causes modification on the eax, ebx, ecx, and ecx regiters; it
> >> > also causes a VM exit.
> >> >
> >> > A processor supports SERIALIZE instruction if CPUID.0x0x.0x0:EDX[14] is
> >> > present. The CPU feature flag is shown as "serialize" in /proc/cpuinfo.
> >> >
> >> > Detailed information on the instructions and CPUID feature flag SERIALIZE
> >> > can be found in the latest Intel Architecture Instruction Set Extensions
> >> > and Future Features Programming Reference and Intel 64 and IA-32
> >> > Architectures Software Developer's Manual.
> >> 
> >> Can you also wire this up so sync_core() uses it?
> >
> > I am cc'ing Fenghua, who has volunteered to work on this. Addind support
> > for SERIALIZE in sync_core() should not block merging these patches,
> > correct?
> 
> Come on. We are not serving KVM first before making this usable on bare
> metal.

Hi Thomas,

I ended up implementing support for SERIALIZE in sync_core() I will be
posting patches for this in the next few days.

Thanks and BR,
Ricardo


Re: [PATCH v5] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-07-13 Thread Ricardo Neri
On Fri, Jul 10, 2020 at 03:45:25PM -0700, Brendan Shanks wrote:
> Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> processes.
> 
> Wine users have found a small number of Windows apps using SLDT that
> were crashing when run on UMIP-enabled systems.
> 
> Reported-by: Andreas Rammhold 
> Originally-by: Ricardo Neri 
> Signed-off-by: Brendan Shanks 

FWIW, tested on hardware with UMIP.

Reviewed-by: Ricardo Neri 
Tested-by: Ricardo Neri 

Thanks and BR,
Ricardo


Re: [PATCH v5] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-07-13 Thread Ricardo Neri
On Sat, Jul 11, 2020 at 02:49:54PM -0700, h...@zytor.com wrote:
> On July 10, 2020 3:45:25 PM PDT, Brendan Shanks  
> wrote:
> >Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> >processes.
> >
> >Wine users have found a small number of Windows apps using SLDT that
> >were crashing when run on UMIP-enabled systems.
> >
> >Reported-by: Andreas Rammhold 
> >Originally-by: Ricardo Neri 
> >Signed-off-by: Brendan Shanks 
> >---
> >
> >v5: Capitalize instruction names in comments.
> >
> > arch/x86/kernel/umip.c | 40 +++-
> > 1 file changed, 27 insertions(+), 13 deletions(-)
> >
> >diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> >index 8d5cbe1bbb3b..2c304fd0bb1a 100644
> >--- a/arch/x86/kernel/umip.c
> >+++ b/arch/x86/kernel/umip.c
> >@@ -45,11 +45,12 @@
> >* value that, lies close to the top of the kernel memory. The limit for
> >the GDT
> >  * and the IDT are set to zero.
> >  *
> >- * Given that SLDT and STR are not commonly used in programs that run
> >on WineHQ
> >- * or DOSEMU2, they are not emulated.
> >- *
> >- * The instruction smsw is emulated to return the value that the
> >register CR0
> >+ * The instruction SMSW is emulated to return the value that the
> >register CR0
> >  * has at boot time as set in the head_32.
> >+ * SLDT and STR are emulated to return the values that the kernel
> >programmatically
> >+ * assigns:
> >+ * - SLDT returns (GDT_ENTRY_LDT * 8) if an LDT has been set, 0 if
> >not.
> >+ * - STR returns (GDT_ENTRY_TSS * 8).
> >  *
> >  * Emulation is provided for both 32-bit and 64-bit processes.
> >  *
> >@@ -244,16 +245,34 @@ static int emulate_umip_insn(struct insn *insn,
> >int umip_inst,
> > *data_size += UMIP_GDT_IDT_LIMIT_SIZE;
> > memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE);
> > 
> >-} else if (umip_inst == UMIP_INST_SMSW) {
> >-unsigned long dummy_value = CR0_STATE;
> >+} else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT
> >||
> >+   umip_inst == UMIP_INST_STR) {
> >+unsigned long dummy_value;
> >+
> >+if (umip_inst == UMIP_INST_SMSW) {
> >+dummy_value = CR0_STATE;
> >+} else if (umip_inst == UMIP_INST_STR) {
> >+dummy_value = GDT_ENTRY_TSS * 8;
> >+} else if (umip_inst == UMIP_INST_SLDT) {
> >+#ifdef CONFIG_MODIFY_LDT_SYSCALL
> >+down_read(>mm->context.ldt_usr_sem);
> >+if (current->mm->context.ldt)
> >+dummy_value = GDT_ENTRY_LDT * 8;
> >+else
> >+dummy_value = 0;
> >+up_read(>mm->context.ldt_usr_sem);
> >+#else
> >+dummy_value = 0;
> >+#endif
> >+}
> > 
> > /*
> >- * Even though the CR0 register has 4 bytes, the number
> >+ * For these 3 instructions, the number
> >  * of bytes to be copied in the result buffer is determined
> >  * by whether the operand is a register or a memory location.
> >  * If operand is a register, return as many bytes as the operand
> >  * size. If operand is memory, return only the two least
> >- * siginificant bytes of CR0.
> >+ * siginificant bytes.
> >  */
> > if (X86_MODRM_MOD(insn->modrm.value) == 3)
> > *data_size = insn->opnd_bytes;
> >@@ -261,7 +280,6 @@ static int emulate_umip_insn(struct insn *insn, int
> >umip_inst,
> > *data_size = 2;
> > 
> > memcpy(data, _value, *data_size);
> >-/* STR and SLDT  are not emulated */
> > } else {
> > return -EINVAL;
> > }
> >@@ -383,10 +401,6 @@ bool fixup_umip_exception(struct pt_regs *regs)
> > umip_pr_warn(regs, "%s instruction cannot be used by applications.\n",
> > umip_insns[umip_inst]);
> > 
> >-/* Do not emulate (spoof) SLDT or STR. */
> >-if (umip_inst == UMIP_INST_STR || umip_inst == UMIP_INST_SLDT)
> >-return false;
> >-
> > umip_pr_warn(regs, "For now, expensive software emulation returns the
> >result.\n");
> > 
> > if (emulate_umip_insn(, umip_inst, dummy_data, _data_size,
> 
> It's there any reason for SLDT to not *always* return a fixed value? "An LDT 
> has been assigned" is formally a kernel internal property, separate from the 
> property of whenever there are user space enteies in the LDT.

But isn't it true that sldt returns 0 if the application has not set an
LDT and non-zero otherwise?

In native_set_ldt() I see that the the LDT register is set to 0 if the
table has no entries and to GDT_ENTRY_LDT*8 otherwise.

Please correct me if I understand this wrong.

Thanks and BR,
Ricardo


Re: [PATCH v2 1/4] x86/cpufeatures: Add enumeration for SERIALIZE instruction

2020-07-07 Thread Ricardo Neri
On Tue, Jul 07, 2020 at 09:36:15AM -0700, Andy Lutomirski wrote:
> On Mon, Jul 6, 2020 at 7:21 PM Cathy Zhang  wrote:
> >
> > This instruction gives software a way to force the processor to complete
> > all modifications to flags, registers and memory from previous instructions
> > and drain all buffered writes to memory before the next instruction is
> > fetched and executed.
> >
> > The same effect can be obtained using the cpuid instruction. However,
> > cpuid causes modification on the eax, ebx, ecx, and ecx regiters; it
> > also causes a VM exit.
> >
> > A processor supports SERIALIZE instruction if CPUID.0x0x.0x0:EDX[14] is
> > present. The CPU feature flag is shown as "serialize" in /proc/cpuinfo.
> >
> > Detailed information on the instructions and CPUID feature flag SERIALIZE
> > can be found in the latest Intel Architecture Instruction Set Extensions
> > and Future Features Programming Reference and Intel 64 and IA-32
> > Architectures Software Developer's Manual.
> 
> Can you also wire this up so sync_core() uses it?

I am cc'ing Fenghua, who has volunteered to work on this. Addind support
for SERIALIZE in sync_core() should not block merging these patches,
correct?

Thanks and BR,
Ricardo


Re: [PATCH v4] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-06-09 Thread Ricardo Neri
On Tue, Jun 09, 2020 at 10:54:23AM -0700, Brendan Shanks wrote:
> Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> processes.
> 
> Wine users have found a small number of Windows apps using SLDT that
> were crashing when run on UMIP-enabled systems.
> 
> Reported-by: Andreas Rammhold 
> Originally-by: Ricardo Neri 
> Signed-off-by: Brendan Shanks 
> ---
> 
> v4: Use braces for every clause of the conditional. I tried a switch(),
> but it takes more lines and looks more cluttered (especially with the
> #ifdef).
> Also replace out-of-date comment at top of file.
> 
>  arch/x86/kernel/umip.c | 38 ++
>  1 file changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> index 8d5cbe1bbb3b..62f4f0afb979 100644
> --- a/arch/x86/kernel/umip.c
> +++ b/arch/x86/kernel/umip.c
> @@ -45,11 +45,12 @@
>   * value that, lies close to the top of the kernel memory. The limit for the 
> GDT
>   * and the IDT are set to zero.
>   *
> - * Given that SLDT and STR are not commonly used in programs that run on 
> WineHQ
> - * or DOSEMU2, they are not emulated.
> - *
>   * The instruction smsw is emulated to return the value that the register CR0
>   * has at boot time as set in the head_32.
> + * sldt and str are emulated to return the values that the kernel 
> programmatically
> + * assigns:
> + * - sldt returns (GDT_ENTRY_LDT * 8) if an LDT has been set, 0 if not.
> + * - str returns (GDT_ENTRY_TSS * 8).

Probably for consistency with the rest of the document the instruction
names should be capitalized (I know that smsw was not capitalized before
:/). Maybe maintainers can fix this when applying?

Other than that, FWIW:

Reviewed-by: Ricardo Neri 
Tested-by: Ricardo Neri 


Re: [PATCH v3] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-06-08 Thread Ricardo Neri
On Mon, Jun 08, 2020 at 05:38:12PM -0700, Ricardo Neri wrote:
> On Mon, Jun 08, 2020 at 03:44:24PM -0700, Brendan Shanks wrote:
> > Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> > processes.
> > 
> > Wine users have found a small number of Windows apps using SLDT that
> > were crashing when run on UMIP-enabled systems.
> > 
> > Reported-by: Andreas Rammhold 
> > Originally-by: Ricardo Neri 
> > Signed-off-by: Brendan Shanks 
> > ---
> > 
> > v3: Use (GDT_ENTRY_TSS * 8) for task register selector instead of
> > harcoding 0x40.
> > 
> >  arch/x86/kernel/umip.c | 32 +++-
> >  1 file changed, 23 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> > index 8d5cbe1bbb3b..166c579b0273 100644
> > --- a/arch/x86/kernel/umip.c
> > +++ b/arch/x86/kernel/umip.c
> > @@ -244,16 +244,35 @@ static int emulate_umip_insn(struct insn *insn, int 
> > umip_inst,
> > *data_size += UMIP_GDT_IDT_LIMIT_SIZE;
> > memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE);
> >  
> > -   } else if (umip_inst == UMIP_INST_SMSW) {
> > -   unsigned long dummy_value = CR0_STATE;
> > +   } else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT ||
> > +  umip_inst == UMIP_INST_STR) {
> > +   unsigned long dummy_value;
> > +
> > +   if (umip_inst == UMIP_INST_SMSW)
> > +   dummy_value = CR0_STATE;
> > +   else if (umip_inst == UMIP_INST_STR)
> > +   dummy_value = GDT_ENTRY_TSS * 8;
> > +   else if (umip_inst == UMIP_INST_SLDT)
> > +   {
> 
> This brace should go in the previous line. Also, if you use braces in
> the last part of the conditional you should probably use them in the
> previous ones. I guess in this case it woudln't improve readability.
> Instead, you can probably have a switch instead of the three ifs. That
> probably does improve readability and solves the dilemma of needing to
> put braces in all the one-line conditionals.
> 
> BTW, you should also delete the comment at the top of the file saying
> that str and sldt will not be emulated:
> 
> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> index 166c579b0273..0984a55eb8c0 100644
> --- a/arch/x86/kernel/umip.c
> +++ b/arch/x86/kernel/umip.c
> @@ -45,9 +45,6 @@
>   * value that, lies close to the top of the kernel memory. The limit for the 
> GDT
>   * and the IDT are set to zero.
>   *
> - * Given that SLDT and STR are not commonly used in programs that run on 
> WineHQ
> - * or DOSEMU2, they are not emulated.
> - *
>   * The instruction smsw is emulated to return the value that the register CR0
>   * has at boot time as set in the head_32.

... And also explain that the emulated values for str and sldt are the
simply the values that Linux assigns programatically.

Thanks and BR,
Ricardo


Re: [PATCH v3] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-06-08 Thread Ricardo Neri
On Mon, Jun 08, 2020 at 03:44:24PM -0700, Brendan Shanks wrote:
> Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> processes.
> 
> Wine users have found a small number of Windows apps using SLDT that
> were crashing when run on UMIP-enabled systems.
> 
> Reported-by: Andreas Rammhold 
> Originally-by: Ricardo Neri 
> Signed-off-by: Brendan Shanks 
> ---
> 
> v3: Use (GDT_ENTRY_TSS * 8) for task register selector instead of
> harcoding 0x40.
> 
>  arch/x86/kernel/umip.c | 32 +++-
>  1 file changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> index 8d5cbe1bbb3b..166c579b0273 100644
> --- a/arch/x86/kernel/umip.c
> +++ b/arch/x86/kernel/umip.c
> @@ -244,16 +244,35 @@ static int emulate_umip_insn(struct insn *insn, int 
> umip_inst,
>   *data_size += UMIP_GDT_IDT_LIMIT_SIZE;
>   memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE);
>  
> - } else if (umip_inst == UMIP_INST_SMSW) {
> - unsigned long dummy_value = CR0_STATE;
> + } else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT ||
> +umip_inst == UMIP_INST_STR) {
> + unsigned long dummy_value;
> +
> + if (umip_inst == UMIP_INST_SMSW)
> + dummy_value = CR0_STATE;
> + else if (umip_inst == UMIP_INST_STR)
> + dummy_value = GDT_ENTRY_TSS * 8;
> + else if (umip_inst == UMIP_INST_SLDT)
> + {

This brace should go in the previous line. Also, if you use braces in
the last part of the conditional you should probably use them in the
previous ones. I guess in this case it woudln't improve readability.
Instead, you can probably have a switch instead of the three ifs. That
probably does improve readability and solves the dilemma of needing to
put braces in all the one-line conditionals.

BTW, you should also delete the comment at the top of the file saying
that str and sldt will not be emulated:

diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 166c579b0273..0984a55eb8c0 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -45,9 +45,6 @@
  * value that, lies close to the top of the kernel memory. The limit for the 
GDT
  * and the IDT are set to zero.
  *
- * Given that SLDT and STR are not commonly used in programs that run on WineHQ
- * or DOSEMU2, they are not emulated.
- *
  * The instruction smsw is emulated to return the value that the register CR0
  * has at boot time as set in the head_32.
  *

Thanks and BR,
Ricardo


Re: [PATCH v2] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-06-08 Thread Ricardo Neri
On Mon, Jun 08, 2020 at 11:14:54AM -0700, Brendan Shanks wrote:
> Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> processes.
> 
> Wine users have found a small number of Windows apps using SLDT that
> were crashing when run on UMIP-enabled systems.
> 
> Reported-by: Andreas Rammhold 
> Originally-by: Ricardo Neri 
> Signed-off-by: Brendan Shanks 
> ---
> 
> v2: Return (GDT_ENTRY_LDT * 8) for SLDT when an LDT is set.
> 
>  arch/x86/kernel/umip.c | 34 +-
>  1 file changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> index 8d5cbe1bbb3b..a85f0b0ec2b9 100644
> --- a/arch/x86/kernel/umip.c
> +++ b/arch/x86/kernel/umip.c
> @@ -64,6 +64,8 @@
>  #define UMIP_DUMMY_GDT_BASE 0xfffeULL
>  #define UMIP_DUMMY_IDT_BASE 0xULL
>  
> +#define UMIP_DUMMY_TASK_REGISTER_SELECTOR 0x40

One more thing. How was this value selected? Would it be possible to use
GDT_ENTRY_TSS*8? Linux already uses this value.

Thanks and BR,
Ricardo
 


Re: [PATCH] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-06-05 Thread Ricardo Neri
On Fri, Jun 05, 2020 at 11:58:13AM -0700, Brendan Shanks wrote:
> 
> > On Jun 3, 2020, at 9:39 PM, Andy Lutomirski  wrote:
> > 
> > On Wed, Jun 3, 2020 at 5:12 PM Ricardo Neri
> >  > <mailto:ricardo.neri-calde...@linux.intel.com>> wrote:
> >> 
> >> On Tue, Jun 02, 2020 at 11:42:12AM -0700, Brendan Shanks wrote:
> >>> Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> >>> processes.
> >>> 
> >>> Wine users have found a small number of Windows apps using SLDT that
> >>> were crashing when run on UMIP-enabled systems.
> >>> 
> >>> Reported-by: Andreas Rammhold 
> >>> Originally-by: Ricardo Neri 
> >>> Signed-off-by: Brendan Shanks 
> >>> ---
> >>> arch/x86/kernel/umip.c | 23 ++-
> >>> 1 file changed, 14 insertions(+), 9 deletions(-)
> >>> 
> >>> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> >>> index 8d5cbe1bbb3b..59dfceac5cc0 100644
> >>> --- a/arch/x86/kernel/umip.c
> >>> +++ b/arch/x86/kernel/umip.c
> >>> @@ -64,6 +64,8 @@
> >>> #define UMIP_DUMMY_GDT_BASE 0xfffeULL
> >>> #define UMIP_DUMMY_IDT_BASE 0xULL
> >>> 
> >>> +#define UMIP_DUMMY_TASK_REGISTER_SELECTOR 0x40
> >>> +
> >>> /*
> >>>  * The SGDT and SIDT instructions store the contents of the global 
> >>> descriptor
> >>>  * table and interrupt table registers, respectively. The destination is a
> >>> @@ -244,16 +246,24 @@ static int emulate_umip_insn(struct insn *insn, int 
> >>> umip_inst,
> >>>  *data_size += UMIP_GDT_IDT_LIMIT_SIZE;
> >>>  memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE);
> >>> 
> >>> - } else if (umip_inst == UMIP_INST_SMSW) {
> >>> - unsigned long dummy_value = CR0_STATE;
> >>> + } else if (umip_inst == UMIP_INST_SMSW || umip_inst == 
> >>> UMIP_INST_SLDT ||
> >>> +umip_inst == UMIP_INST_STR) {
> >>> + unsigned long dummy_value;
> >>> +
> >>> + if (umip_inst == UMIP_INST_SMSW)
> >>> + dummy_value = CR0_STATE;
> >>> + else if (umip_inst == UMIP_INST_STR)
> >>> + dummy_value = UMIP_DUMMY_TASK_REGISTER_SELECTOR;
> >>> + else
> >>> + dummy_value = 0;
> >> 
> >> Perhaps you can return a non-zero value for SLDT if it has an LDT, as
> >> Andy had suggested. Maybe this can be implemented by looking at
> >> current->mm->context.ldt
> >> 
> >> I guess the non-zero value can be (GDT_ENTRY_LDT*8).
> > 
> > You could probably even get away with always returning a nonzero
> > value.  After all, an empty LDT is quite similar to no LDT.
> 
> 
> Is something like this what you both had in mind?

> I don’t have any software handy to test the LDT-present case though.

Perhaps you can insert a test in the kernel selftest. Something like
this (based on Andreas' test program):

--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -220,12 +220,23 @@ static void install_invalid(const struct user_desc *desc, 
bool oldmode)
}
 }

+unsigned long test(void)
+{
+unsigned char ldtr[5] = "\xef\xbe\xad\xde";
+unsigned long ldt = 0;
+asm("sldt %0\n" : "=m" (ldtr));
+ldt = *((unsigned long *)[0]);
+printf ("LDT base: 0x%lx\n", ldt);
+return (ldt);
+}
+
 static int safe_modify_ldt(int func, struct user_desc *ptr,
   unsigned long bytecount)
 {
int ret = syscall(SYS_modify_ldt, 0x11, ptr, bytecount);
if (ret < -1)
errno = -ret;
+   test();
return ret;
 }

Thanks and BR,
Ricardo
> 
> 
> else if (umip_inst == UMIP_INST_STR)
> dummy_value = UMIP_DUMMY_TASK_REGISTER_SELECTOR;
> else if (umip_inst == UMIP_INST_SLDT)
> {
> #ifdef CONFIG_MODIFY_LDT_SYSCALL
> down_read(>mm->context.ldt_usr_sem);
> if (current->mm->context.ldt)
> dummy_value = GDT_ENTRY_LDT * 8;
> else
> dummy_value = 0;
> up_read(>mm->context.ldt_usr_sem);
> #else
> dummy_value = 0;
> #endif
> 

It looks fine to me. Perhaps Andy prefers a simpler, always-non-zero
implementation?

Thanks and BR,
Ricardo


Re: [PATCH] x86/umip: Add emulation/spoofing for SLDT and STR instructions

2020-06-03 Thread Ricardo Neri
On Tue, Jun 02, 2020 at 11:42:12AM -0700, Brendan Shanks wrote:
> Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
> processes.
> 
> Wine users have found a small number of Windows apps using SLDT that
> were crashing when run on UMIP-enabled systems.
> 
> Reported-by: Andreas Rammhold 
> Originally-by: Ricardo Neri 
> Signed-off-by: Brendan Shanks 
> ---
>  arch/x86/kernel/umip.c | 23 ++-
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> index 8d5cbe1bbb3b..59dfceac5cc0 100644
> --- a/arch/x86/kernel/umip.c
> +++ b/arch/x86/kernel/umip.c
> @@ -64,6 +64,8 @@
>  #define UMIP_DUMMY_GDT_BASE 0xfffeULL
>  #define UMIP_DUMMY_IDT_BASE 0xULL
>  
> +#define UMIP_DUMMY_TASK_REGISTER_SELECTOR 0x40
> +
>  /*
>   * The SGDT and SIDT instructions store the contents of the global descriptor
>   * table and interrupt table registers, respectively. The destination is a
> @@ -244,16 +246,24 @@ static int emulate_umip_insn(struct insn *insn, int 
> umip_inst,
>   *data_size += UMIP_GDT_IDT_LIMIT_SIZE;
>   memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE);
>  
> - } else if (umip_inst == UMIP_INST_SMSW) {
> - unsigned long dummy_value = CR0_STATE;
> + } else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT ||
> +umip_inst == UMIP_INST_STR) {
> + unsigned long dummy_value;
> +
> + if (umip_inst == UMIP_INST_SMSW)
> + dummy_value = CR0_STATE;
> + else if (umip_inst == UMIP_INST_STR)
> + dummy_value = UMIP_DUMMY_TASK_REGISTER_SELECTOR;
> + else
> + dummy_value = 0;

Perhaps you can return a non-zero value for SLDT if it has an LDT, as
Andy had suggested. Maybe this can be implemented by looking at
current->mm->context.ldt

I guess the non-zero value can be (GDT_ENTRY_LDT*8).

Thanks and BR,
Ricardo


Re: [PATCH v2 2/3] x86, sched: Bail out of frequency invariance if turbo frequency is unknown

2020-06-01 Thread Ricardo Neri
On Sun, May 31, 2020 at 08:24:52PM +0200, Giovanni Gherdovich wrote:
> There may be CPUs that support turbo boost but don't declare any turbo
> ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition
> scale-invariant calculations can't be performed.
> 
> Signed-off-by: Giovanni Gherdovich 
> Suggested-by: Ricardo Neri 

FWIW,

Tested-by: Ricardo Neri 


Re: umip: AMD Ryzen 3900X, pagefault after emulate SLDT/SIDT instruction

2020-05-26 Thread Ricardo Neri
On Sat, May 23, 2020 at 04:17:39AM +0200, Andreas Rammhold wrote:
> On 12:43 19.05.20, Ricardo Neri wrote:
> > I have a patch for this already that I wrote for testing purposes:
> > https://github.com/ricardon/tip/commit/1692889cb3f8accb523d44b682458e234b93be50
> > Perhaps it can be used as a starting point? Not sure what the spoofing
> > value should be, though. Perhaps 0?
> 
> I tried the above patch (in modified/rebased version; hope that didn't
> kill it [0]). The results are negative, as without the patch.

Ah. My patch above is based on a rather old kernel. There is a check in
fixup_umip_exception() for SLDT and STR. I think this causes the
exception you see. Perhaps you can try by removing such check:

diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
@@ -383,10 +389,6 @@ bool fixup_umip_exception(struct pt_regs *regs)
umip_pr_warn(regs, "%s instruction cannot be used by applications.\n",
umip_insns[umip_inst]);

-   /* Do not emulate (spoof) SLDT or STR. */
-   if (umip_inst == UMIP_INST_STR || umip_inst == UMIP_INST_SLDT)
-   return false;
-
umip_pr_warn(regs, "For now, expensive software emulation returns the 
result.\n");

if (emulate_umip_insn(, umip_inst, dummy_data, _data_size,

You would still need my old patch.

Thanks and BR,
Ricardo


Re: umip: AMD Ryzen 3900X, pagefault after emulate SLDT/SIDT instruction

2020-05-20 Thread Ricardo Neri
On Tue, May 19, 2020 at 05:54:53PM -0700, Andy Lutomirski wrote:
> On Tue, May 19, 2020 at 12:43 PM Ricardo Neri
>  wrote:
> >
> > On Tue, May 19, 2020 at 11:56:40AM -0700, Brendan Shanks wrote:
> > >
> > > > On May 19, 2020, at 7:38 AM, Andreas Rammhold  
> > > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I've been running into a weird problem with UMIP on a current Ryzen
> > > > 3900x with kernel 5.6.11 where a process receives a page fault after the
> > > > kernel handled the SLDT (or SIDT) instruction (emulation).
> > > >
> > > > The program I am running is run through WINE in 32bit mode and tries to
> > > > figure out if it is running in a VMWare machine by comparing the results
> > > > of SLDT against well known constants (basically as shown in the
> > > > [example] linked below).
> > > >
> > > > In dmesg I see the following log lines:
> > > >> [99970.004756] umip: Program.exe[3080] ip:4373fb sp:32f3e0: SIDT 
> > > >> instruction cannot be used by applications.
> > > >> [99970.004757] umip: Program.exe[3080] ip:4373fb sp:32f3e0: For now, 
> > > >> expensive software emulation returns the result.
> > > >> [99970.004758] umip: Program.exe[3080] ip:437415 sp:32f3e0: SLDT 
> > > >> instruction cannot be used by applications.
> > > >
> > > > Following that the process terminates with a page fault:
> > > >> Unhandled exception: page fault on read access to 0x in 32-bit 
> > > >> code (0x00437415).
> > > >
> > > > Assembly at that address:
> > > >> 0x00437415: sldt0xffe8(%ebp)
> > > >
> > > > Running the same executable on the exact same kernel (and userland) but
> > > > on a Intel i7-8565U doesn't crash at this point. I am guessing the
> > > > emulation is supposed to do something different on AMD CPUs?
> >
> > I am surprised you don't see it on the Intel processor. Maybe it does
> > not have UMIP. Do you see umip when you do
> >
> > $ grep umip /proc/cpuinfo
> >
> > ?
> > > >
> > > > On the Ryzen the code executes successfully after setting 
> > > > CONFIG_X86_UMIP=n.
> > >
> > > Hi Andreas,
> > >
> > > The problem is that the kernel does not emulate/spoof the SLDT 
> > > instruction, only SGDT, SIDT, and SMSW.
> > > SLDT and STR weren't thought to be commonly used, so emulation/spoofing 
> > > wasn’t added.
> > > In the last few months I have seen reports of one or two (32-bit) Windows 
> > > games that use SLDT though.
> > > Can you share more information about the application you’re running?
> > >
> > > Maybe the best path is to add kernel emulation/spoofing for SLDT and STR 
> > > on 32 and 64-bit, just to cover all the cases. It should be a pretty 
> > > simple patch, I’ll start working on it.
> >
> > I have a patch for this already that I wrote for testing purposes:
> >
> > https://github.com/ricardon/tip/commit/1692889cb3f8accb523d44b682458e234b93be50
> >
> > Perhaps it can be used as a starting point? Not sure what the spoofing
> > value should be, though. Perhaps 0?
> 
> Possibly SLDT should return nonzero if there's an LDT.

I guess the value should be in the same hole of the x86_64 memory map,
right? Currently sgdt and sidt return 0xfffe and
0x, respectively.

Thanks and BR,
Ricardo


Re: umip: AMD Ryzen 3900X, pagefault after emulate SLDT/SIDT instruction

2020-05-19 Thread Ricardo Neri
On Tue, May 19, 2020 at 11:56:40AM -0700, Brendan Shanks wrote:
> 
> > On May 19, 2020, at 7:38 AM, Andreas Rammhold  wrote:
> > 
> > Hi,
> > 
> > I've been running into a weird problem with UMIP on a current Ryzen
> > 3900x with kernel 5.6.11 where a process receives a page fault after the
> > kernel handled the SLDT (or SIDT) instruction (emulation).
> > 
> > The program I am running is run through WINE in 32bit mode and tries to
> > figure out if it is running in a VMWare machine by comparing the results
> > of SLDT against well known constants (basically as shown in the
> > [example] linked below).
> > 
> > In dmesg I see the following log lines:
> >> [99970.004756] umip: Program.exe[3080] ip:4373fb sp:32f3e0: SIDT 
> >> instruction cannot be used by applications.
> >> [99970.004757] umip: Program.exe[3080] ip:4373fb sp:32f3e0: For now, 
> >> expensive software emulation returns the result.
> >> [99970.004758] umip: Program.exe[3080] ip:437415 sp:32f3e0: SLDT 
> >> instruction cannot be used by applications.
> > 
> > Following that the process terminates with a page fault:
> >> Unhandled exception: page fault on read access to 0x in 32-bit 
> >> code (0x00437415).
> > 
> > Assembly at that address:
> >> 0x00437415: sldt0xffe8(%ebp)
> > 
> > Running the same executable on the exact same kernel (and userland) but
> > on a Intel i7-8565U doesn't crash at this point. I am guessing the
> > emulation is supposed to do something different on AMD CPUs?

I am surprised you don't see it on the Intel processor. Maybe it does
not have UMIP. Do you see umip when you do

$ grep umip /proc/cpuinfo 

?
> > 
> > On the Ryzen the code executes successfully after setting CONFIG_X86_UMIP=n.
> 
> Hi Andreas,
> 
> The problem is that the kernel does not emulate/spoof the SLDT instruction, 
> only SGDT, SIDT, and SMSW.
> SLDT and STR weren't thought to be commonly used, so emulation/spoofing 
> wasn’t added.
> In the last few months I have seen reports of one or two (32-bit) Windows 
> games that use SLDT though.
> Can you share more information about the application you’re running?
> 
> Maybe the best path is to add kernel emulation/spoofing for SLDT and STR on 
> 32 and 64-bit, just to cover all the cases. It should be a pretty simple 
> patch, I’ll start working on it.

I have a patch for this already that I wrote for testing purposes:

https://github.com/ricardon/tip/commit/1692889cb3f8accb523d44b682458e234b93be50

Perhaps it can be used as a starting point? Not sure what the spoofing
value should be, though. Perhaps 0?

Thanks and BR,
Ricardo


Re: [PATCH 1/2] x86, sched: Prevent divisions by zero in frequency invariant accounting

2020-05-18 Thread Ricardo Neri
On Sat, May 02, 2020 at 04:25:00PM +0200, Giovanni Gherdovich wrote:
> > 
> > I've changed the patch like so.. OK?
> > 
> > (ok, perhaps I went a little overboard with the paranoia ;-)
> 
> Right, I wasn't really checking for overflow, only for when the product
> "mcnt * arch_max_freq_ratio" becomes zero.
> 
> Thanks for your edit (I took note of the macros check_*_overflow, didn't know
> them). I fully subscribe to the paranoid approach.
> 
> I understand you've already edited the patches in your tree, so I am not
> resending, just confirming my
> 
> Signed-off-by: Giovanni Gherdovich 

Hi, have these changes been merged? I still don't see them in the tip or
Linus' tree.

Thanks and BR,
Ricardo


Re: [PATCH 2/2] x86, sched: Bail out of frequency invariance if turbo frequency is unknown

2020-05-01 Thread Ricardo Neri
On Fri, May 01, 2020 at 03:04:27PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 03:24:50PM +0200, Giovanni Gherdovich wrote:
> > There may be CPUs that support turbo boost but don't declare any turbo
> > ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition
> > scale-invariant calculations can't be performed.
> > 
> > Signed-off-by: Giovanni Gherdovich 
> > Suggested-by: Ricardo Neri 
> > Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance")
> > ---
> >  arch/x86/kernel/smpboot.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 4718f29a3065..ab2a0df7d1fb 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -1991,9 +1991,11 @@ static bool intel_set_max_freq_ratio(void)
> > /*
> >  * Some hypervisors advertise X86_FEATURE_APERFMPERF
> >  * but then fill all MSR's with zeroes.
> > +* Some CPUs have turbo boost but don't declare any turbo ratio
> > +* in MSR_TURBO_RATIO_LIMIT.
> >  */
> > -   if (!base_freq) {
> > -   pr_debug("Couldn't determine cpu base frequency, necessary for 
> > scale-invariant accounting.\n");
> > +   if (!base_freq || !turbo_freq) {
> > +   pr_debug("Couldn't determine cpu base or turbo frequency, 
> > necessary for scale-invariant accounting.\n");
> > return false;
> > }
> 
> I've added the below, imagine base_freq > turbo_freq *
> SCHED_CAPACITY_SCALE.
> 
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1975,6 +1975,7 @@ static bool core_set_max_freq_ratio(u64
>  static bool intel_set_max_freq_ratio(void)
>  {
>   u64 base_freq, turbo_freq;
> + u64 turbo_ratio;
>  
>   if (slv_set_max_freq_ratio(_freq, _freq))
>   goto out;
> @@ -2008,9 +2009,15 @@ static bool intel_set_max_freq_ratio(voi
>   return false;
>   }
>  
> - arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
> - base_freq);
> + turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
> + if (!turbo_ratio) {
> + pr_debug("Non-zero turbo and base frequencies led to a 0 
> ratio.\n");
> + return false;
> + }
> +
> + arch_turbo_freq_ratio = turbo_ratio;

I guess this covers more cases in which turbo_ratio can be zero.

Also, FWIW

Tested-by: Ricardo Neri 

Thanks and BR,
Ricardo


Re: [PATCH 2/2] x86, sched: Bail out of frequency invariance if turbo frequency is unknown

2020-05-01 Thread Ricardo Neri
On Tue, Apr 28, 2020 at 03:24:50PM +0200, Giovanni Gherdovich wrote:
> There may be CPUs that support turbo boost but don't declare any turbo
> ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition
> scale-invariant calculations can't be performed.
> 
> Signed-off-by: Giovanni Gherdovich 
> Suggested-by: Ricardo Neri 

Thanks for implementing this, Giovanni!

Tested-by: Ricardo Neri 


Re: [PATCH] x86/umip: Add emulation for 64-bit processes

2019-09-07 Thread Ricardo Neri
On Thu, Sep 05, 2019 at 04:22:21PM -0700, Brendan Shanks wrote:
> Add emulation of the sgdt, sidt, and smsw instructions for 64-bit
> processes.
> 
> Wine users have encountered a number of 64-bit Windows games that use
> these instructions (particularly sgdt), and were crashing when run on
> UMIP-enabled systems.

Emulation support for 64-bit processes was not initially included
because no use cases had been identified. Brendan has found one.

Here is the relevant e-mail thread: https://lkml.org/lkml/2017/1/26/12

FWIW,

Reviewed-by: Ricardo Neri 

Only one minor comment below...

> 
> Originally-by: Ricardo Neri 
> Signed-off-by: Brendan Shanks 
> ---
>  arch/x86/kernel/umip.c | 55 +-
>  1 file changed, 33 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
> index 5b345add550f..1812e95d2f55 100644
> --- a/arch/x86/kernel/umip.c
> +++ b/arch/x86/kernel/umip.c
> @@ -51,9 +51,7 @@
>   * The instruction smsw is emulated to return the value that the register CR0
>   * has at boot time as set in the head_32.
>   *
> - * Also, emulation is provided only for 32-bit processes; 64-bit processes
> - * that attempt to use the instructions that UMIP protects will receive the
> - * SIGSEGV signal issued as a consequence of the general protection fault.
> + * Emulation is provided for both 32-bit and 64-bit processes.
>   *
>   * Care is taken to appropriately emulate the results when segmentation is
>   * used. That is, rather than relying on USER_DS and USER_CS, the function
> @@ -63,17 +61,18 @@
>   * application uses a local descriptor table.
>   */
>  
> -#define UMIP_DUMMY_GDT_BASE 0xfffe
> -#define UMIP_DUMMY_IDT_BASE 0x
> +#define UMIP_DUMMY_GDT_BASE 0xfffeULL
> +#define UMIP_DUMMY_IDT_BASE 0xULL
>  
>  /*
>   * The SGDT and SIDT instructions store the contents of the global descriptor
>   * table and interrupt table registers, respectively. The destination is a
>   * memory operand of X+2 bytes. X bytes are used to store the base address of
> - * the table and 2 bytes are used to store the limit. In 32-bit processes, 
> the
> - * only processes for which emulation is provided, X has a value of 4.
> + * the table and 2 bytes are used to store the limit. In 32-bit processes X
> + * has a value of 4, in 64-bit processes X has a value of 8.
>   */
> -#define UMIP_GDT_IDT_BASE_SIZE 4
> +#define UMIP_GDT_IDT_BASE_SIZE_64BIT 8
> +#define UMIP_GDT_IDT_BASE_SIZE_32BIT 4
>  #define UMIP_GDT_IDT_LIMIT_SIZE 2
>  
>  #define  UMIP_INST_SGDT  0   /* 0F 01 /0 */
> @@ -189,6 +188,7 @@ static int identify_insn(struct insn *insn)
>   * @umip_inst:   A constant indicating the instruction to emulate
>   * @data:Buffer into which the dummy result is stored
>   * @data_size:   Size of the emulated result
> + * @x86_64: true if process is 64-bit, false otherwise
>   *
>   * Emulate an instruction protected by UMIP and provide a dummy result. The
>   * result of the emulation is saved in @data. The size of the results depends
> @@ -202,11 +202,8 @@ static int identify_insn(struct insn *insn)
>   * 0 on success, -EINVAL on error while emulating.
>   */
>  static int emulate_umip_insn(struct insn *insn, int umip_inst,
> -  unsigned char *data, int *data_size)
> +  unsigned char *data, int *data_size, bool x86_64)
>  {
> - unsigned long dummy_base_addr, dummy_value;
> - unsigned short dummy_limit = 0;
> -
>   if (!data || !data_size || !insn)
>   return -EINVAL;
>   /*
> @@ -219,6 +216,9 @@ static int emulate_umip_insn(struct insn *insn, int 
> umip_inst,
>* is always returned irrespective of the operand size.
>*/
>   if (umip_inst == UMIP_INST_SGDT || umip_inst == UMIP_INST_SIDT) {
> + u64 dummy_base_addr;
> + u16 dummy_limit = 0;
> +
>   /* SGDT and SIDT do not use registers operands. */
>   if (X86_MODRM_MOD(insn->modrm.value) == 3)
>   return -EINVAL;
> @@ -228,13 +228,24 @@ static int emulate_umip_insn(struct insn *insn, int 
> umip_inst,
>   else
>   dummy_base_addr = UMIP_DUMMY_IDT_BASE;
>  
> - *data_size = UMIP_GDT_IDT_LIMIT_SIZE + UMIP_GDT_IDT_BASE_SIZE;

Maybe a blank line here?

Thanks and BR,
Ricardo


Re: [Kernel BUG?] SMSW operation get success on UMIP KVM guest

2019-07-01 Thread Ricardo Neri
On Mon, Jul 01, 2019 at 08:57:28PM +0800, Li Wang wrote:
> On Mon, Jul 1, 2019 at 8:02 PM Paolo Bonzini  wrote:
> 
> > On 01/07/19 09:50, Li Wang wrote:
> > > Hello there,
> > >
> > > LTP/umip_basic_test get failed on KVM UMIP
> > > system(kernel-v5.2-rc4.x86_64). The test is only trying to do
> > >  asm volatile("smsw %0\n" : "=m" (val));
> > > and expect to get SIGSEGV in this SMSW operation, but it exits with 0
> > > unexpectedly.
> >
> > In addition to what Thomas said, perhaps you are using a host that does
> > *not* have UMIP, and configuring KVM to emulate it(*).  In that case, it
> > is not possible to intercept SMSW, and therefore it will incorrectly
> > succeed.
> >
> 
> Right, I checked the host system, and confirmed that CPU doesn't support
> UMIP.
> 
> >
> > Paolo
> >
> > (*) before the x86 people jump at me, this won't happen unless you
> > explicitly pass an option to QEMU, such as "-cpu host,+umip". :)  The
> > incorrect emulation of SMSW when CR4.UMIP=1 is why.
> >
> Good to know this, is there any document for that declaration? It seems
> neither LTP issue nor kernel bug here. But anyway we'd better do something
> to avoid the error in the test.

The test case already checks for umip in /proc/cpuinfo, right? And in
long mode it always expects a SIGSEGV signal. If you did not add -cpu 
host,+umip,
how come umip was present in /proc/cpuinfo?

Thanks and BR,
Ricardo


Re: [Kernel BUG?] SMSW operation get success on UMIP KVM guest

2019-07-01 Thread Ricardo Neri
On Mon, Jul 01, 2019 at 02:02:35PM +0200, Paolo Bonzini wrote:
> On 01/07/19 09:50, Li Wang wrote:
> > Hello there,
> > 
> > LTP/umip_basic_test get failed on KVM UMIP
> > system(kernel-v5.2-rc4.x86_64). The test is only trying to do
> >      asm volatile("smsw %0\n" : "=m" (val));
> > and expect to get SIGSEGV in this SMSW operation, but it exits with 0
> > unexpectedly.
> 
> In addition to what Thomas said, perhaps you are using a host that does
> *not* have UMIP, and configuring KVM to emulate it(*).  In that case, it
> is not possible to intercept SMSW, and therefore it will incorrectly
> succeed.

Also, emulation for SMSW, SIDT, and SGDT is done only for 32-bit
processes. As Thomas said, the purpose is not on break Wine. In 64-bit
processes, we sould always see a #GP exception.
> 
> Paolo
> 
> (*) before the x86 people jump at me, this won't happen unless you
> explicitly pass an option to QEMU, such as "-cpu host,+umip". :)  The
> incorrect emulation of SMSW when CR4.UMIP=1 is why.

Paolo, what do you mean by the incorrect emulation of SMSW?

Thanks and BR,
Ricardo


[tip:x86/cpu] x86/mtrr: Skip cache flushes on CPUs with cache self-snooping

2019-06-27 Thread tip-bot for Ricardo Neri
Commit-ID:  fd329f276ecaad7a371d6f91b9bbea031d0c3440
Gitweb: https://git.kernel.org/tip/fd329f276ecaad7a371d6f91b9bbea031d0c3440
Author: Ricardo Neri 
AuthorDate: Thu, 27 Jun 2019 19:35:37 -0700
Committer:  Thomas Gleixner 
CommitDate: Fri, 28 Jun 2019 07:21:00 +0200

x86/mtrr: Skip cache flushes on CPUs with cache self-snooping

Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

On some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

  clocksource: timekeeping watchdog on CPU1: Marking clocksource
   'tsc-early' as unstable because the skew is too large:
  clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
   fffedb90 mask: 
  clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
   mask: 
  tsc: Marking TSC unstable due to clocksource watchdog

As per measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers:

PlatformBefore After
104-core (208 Threads) Skylake  1437ms  28ms
  2-core (  4 Threads) Haswell   114ms   2ms

Reported-by: Mohammad Etemadi 
Signed-off-by: Ricardo Neri 
Signed-off-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Alan Cox 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Hans de Goede 
Cc: Greg Kroah-Hartman 
Cc: Jordan Borgner 
Cc: "Ravi V. Shankar" 
Cc: Ricardo Neri 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Peter Feiner 
Cc: "Rafael J. Wysocki" 
Link: 
https://lkml.kernel.org/r/1561689337-19390-3-git-send-email-ricardo.neri-calde...@linux.intel.com
---
 arch/x86/kernel/cpu/mtrr/generic.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c 
b/arch/x86/kernel/cpu/mtrr/generic.c
index 9356c1c9024d..aa5c064a6a22 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -743,7 +743,15 @@ static void prepare_set(void) 
__acquires(set_atomicity_lock)
/* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
cr0 = read_cr0() | X86_CR0_CD;
write_cr0(cr0);
-   wbinvd();
+
+   /*
+* Cache flushing is the most time-consuming step when programming
+* the MTRRs. Fortunately, as per the Intel Software Development
+* Manual, we can skip it if the processor supports cache self-
+* snooping.
+*/
+   if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+   wbinvd();
 
/* Save value of CR4 and clear Page Global Enable (bit 7) */
if (boot_cpu_has(X86_FEATURE_PGE)) {
@@ -760,7 +768,10 @@ static void prepare_set(void) 
__acquires(set_atomicity_lock)
 
/* Disable MTRRs, and set the default type to uncached */
mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi);
-   wbinvd();
+
+   /* Again, only flush caches if we have to. */
+   if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+   wbinvd();
 }
 
 static void post_set(void) __releases(set_atomicity_lock)


[tip:x86/cpu] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata

2019-06-27 Thread tip-bot for Ricardo Neri
Commit-ID:  1e03bff3600101bd9158d005e4313132e55bdec8
Gitweb: https://git.kernel.org/tip/1e03bff3600101bd9158d005e4313132e55bdec8
Author: Ricardo Neri 
AuthorDate: Thu, 27 Jun 2019 19:35:36 -0700
Committer:  Thomas Gleixner 
CommitDate: Fri, 28 Jun 2019 07:20:48 +0200

x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata

Processors which have self-snooping capability can handle conflicting
memory type across CPUs by snooping its own cache. However, there exists
CPU models in which having conflicting memory types still leads to
unpredictable behavior, machine check errors, or hangs.

Clear this feature on affected CPUs to prevent its use.

Suggested-by: Alan Cox 
Signed-off-by: Ricardo Neri 
Signed-off-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Hans de Goede 
Cc: Greg Kroah-Hartman 
Cc: Jordan Borgner 
Cc: "Ravi V. Shankar" 
Cc: Mohammad Etemadi 
Cc: Ricardo Neri 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Peter Feiner 
Cc: "Rafael J. Wysocki" 
Link: 
https://lkml.kernel.org/r/1561689337-19390-2-git-send-email-ricardo.neri-calde...@linux.intel.com

---
 arch/x86/kernel/cpu/intel.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index f17c1a714779..8d6d92ebeb54 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -66,6 +66,32 @@ void check_mpx_erratum(struct cpuinfo_x86 *c)
}
 }
 
+/*
+ * Processors which have self-snooping capability can handle conflicting
+ * memory type across CPUs by snooping its own cache. However, there exists
+ * CPU models in which having conflicting memory types still leads to
+ * unpredictable behavior, machine check errors, or hangs. Clear this
+ * feature to prevent its use on machines with known erratas.
+ */
+static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
+{
+   switch (c->x86_model) {
+   case INTEL_FAM6_CORE_YONAH:
+   case INTEL_FAM6_CORE2_MEROM:
+   case INTEL_FAM6_CORE2_MEROM_L:
+   case INTEL_FAM6_CORE2_PENRYN:
+   case INTEL_FAM6_CORE2_DUNNINGTON:
+   case INTEL_FAM6_NEHALEM:
+   case INTEL_FAM6_NEHALEM_G:
+   case INTEL_FAM6_NEHALEM_EP:
+   case INTEL_FAM6_NEHALEM_EX:
+   case INTEL_FAM6_WESTMERE:
+   case INTEL_FAM6_WESTMERE_EP:
+   case INTEL_FAM6_SANDYBRIDGE:
+   setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
+   }
+}
+
 static bool ring3mwait_disabled __read_mostly;
 
 static int __init ring3mwait_disable(char *__unused)
@@ -304,6 +330,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
}
 
check_mpx_erratum(c);
+   check_memory_type_self_snoop_errata(c);
 
/*
 * Get the number of SMT siblings early from the extended topology


[PATCH v2 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata

2019-06-27 Thread Ricardo Neri
Processors which have self-snooping capability can handle conflicting
memory type across CPUs by snooping its own cache. However, there exists
CPU models in which having conflicting memory types still leads to
unpredictable behavior, machine check errors, or hangs. Clear this feature
to prevent its use. 

Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Hans de Goede 
Cc: Peter Feiner 
Cc: "Rafael J. Wysocki" 
Cc: Greg Kroah-Hartman 
Cc: Jordan Borgner 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Alan Cox 
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/cpu/intel.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index f17c1a714779..62e366ec0812 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -66,6 +66,32 @@ void check_mpx_erratum(struct cpuinfo_x86 *c)
}
 }
 
+/*
+ * Processors which have self-snooping capability can handle conflicting
+ * memory type across CPUs by snooping its own cache. However, there exists
+ * CPU models in which having conflicting memory types still leads to
+ * unpredictable behavior, machine check errors, or hangs. Clear this feature
+ * to prevent its use.
+ */
+static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
+{
+   switch (c->x86_model) {
+   case INTEL_FAM6_CORE_YONAH:
+   case INTEL_FAM6_CORE2_MEROM:
+   case INTEL_FAM6_CORE2_MEROM_L:
+   case INTEL_FAM6_CORE2_PENRYN:
+   case INTEL_FAM6_CORE2_DUNNINGTON:
+   case INTEL_FAM6_NEHALEM:
+   case INTEL_FAM6_NEHALEM_G:
+   case INTEL_FAM6_NEHALEM_EP:
+   case INTEL_FAM6_NEHALEM_EX:
+   case INTEL_FAM6_WESTMERE:
+   case INTEL_FAM6_WESTMERE_EP:
+   case INTEL_FAM6_SANDYBRIDGE:
+   setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
+   }
+}
+
 static bool ring3mwait_disabled __read_mostly;
 
 static int __init ring3mwait_disable(char *__unused)
@@ -304,6 +330,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
}
 
check_mpx_erratum(c);
+   check_memory_type_self_snoop_errata(c);
 
/*
 * Get the number of SMT siblings early from the extended topology
-- 
2.17.1



[PATCH v2 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping

2019-06-27 Thread Ricardo Neri
Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

In some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

  clocksource: timekeeping watchdog on CPU1: Marking clocksource
   'tsc-early' as unstable because the skew is too large:
  clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
   fffedb90 mask: 
  clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
   mask: 
  tsc: Marking TSC unstable due to clocksource watchdog

As per my measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers.

Cc: Alan Cox 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Hans de Goede 
Cc: Peter Feiner 
Cc: "Rafael J. Wysocki" 
Cc: Greg Kroah-Hartman 
Cc: Jordan Borgner 
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Ravi V. Shankar" 
Reported-by: Mohammad Etemadi 
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/cpu/mtrr/generic.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c 
b/arch/x86/kernel/cpu/mtrr/generic.c
index 9356c1c9024d..aa5c064a6a22 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -743,7 +743,15 @@ static void prepare_set(void) 
__acquires(set_atomicity_lock)
/* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
cr0 = read_cr0() | X86_CR0_CD;
write_cr0(cr0);
-   wbinvd();
+
+   /*
+* Cache flushing is the most time-consuming step when programming
+* the MTRRs. Fortunately, as per the Intel Software Development
+* Manual, we can skip it if the processor supports cache self-
+* snooping.
+*/
+   if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+   wbinvd();
 
/* Save value of CR4 and clear Page Global Enable (bit 7) */
if (boot_cpu_has(X86_FEATURE_PGE)) {
@@ -760,7 +768,10 @@ static void prepare_set(void) 
__acquires(set_atomicity_lock)
 
/* Disable MTRRs, and set the default type to uncached */
mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi);
-   wbinvd();
+
+   /* Again, only flush caches if we have to. */
+   if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+   wbinvd();
 }
 
 static void post_set(void) __releases(set_atomicity_lock)
-- 
2.17.1



[PATCH v2 0/2] Speed MTRR programming up when we can

2019-06-27 Thread Ricardo Neri
This is the second iteration of this patchset. The first iteration can be
viewed here [1]. 

Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

In some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

 clocksource: timekeeping watchdog on CPU1: Marking clocksource
  'tsc-early' as unstable because the skew is too large:
 clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
  fffedb90 mask: 
 clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
  mask: 
 tsc: Marking TSC unstable due to clocksource watchdog

As per my measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers.

However, there exist CPU models with errata that affect their self-
snooping capabilities. Such errata may cause unpredictable behavior,
machine check errors, or hangs. For instance:

 "Where two different logical processors have the same code page
  mapped with two different memory types Specifically if one code
  page is mapped by one logical processor as write back and by
  another as uncacheable and certain instruction timing conditions
  occur the system may experience unpredictable behaviour." [2].

Similar errata are reported in other processors as well [3], [4], [5],
[6], and [7].

Thus, in order to confidently leverage self-snooping for the MTRR
programming algorithm, we must first clear such feature in models with
known errata.

By measuring the execution time of mtrr_aps_init() (from which MTRRs
in all CPUs are programmed in lock-step at boot), I find savings in the
time required to program MTRRs as follows:

Platform  time-with-wbinvd(ms) time-no-wbinvd(ms)
104-core (208 LP) Skylake1437 28
2-core (4 LP) Haswell 114  2

LP = Logical Processor

Thanks and BR,
Ricardo

Changes since v1:

 * Relocated comment on the utility of cache self-snooping from
   check_memory_type_self_snoop_errata() to the prepare_set() function
   of the generic MTRR programming ops (Thomas Gleixner).
 * In early_init_intel(), moved check_memory_type_self_snoop_errata()
   next to check_mpx_erratum() for improved readability.

[1]. https://lkml.org/lkml/2019/6/27/828
[2]. Erratum BF52, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf
[3]. Erratum BK47, 
https://www.mouser.com/pdfdocs/2ndgencorefamilymobilespecificationupdate.pdf
[4]. Erratum AAO54, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf
[5]. Errata AZ39, AZ42, 
https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf
[6]. Errata AQ51, AQ102, AQ104, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-desktop-e2000-specification-update.pdf
[7]. Errata AN107, AN109, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf

Ricardo Neri (2):
  x86/cpu/intel: Clear cache self-snoop capability in CPUs with known
errata
  x86, mtrr: generic: Skip cache flushes on CPUs with cache
self-snooping

 arch/x86/kernel/cpu/intel.c| 27 +++
 arch/x86/kernel/cpu/mtrr/generic.c | 15 +--
 2 files changed, 40 insertions(+), 2 deletions(-)

-- 
2.17.1



Re: [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata

2019-06-27 Thread Ricardo Neri
On Thu, Jun 27, 2019 at 10:38:13PM +0200, Thomas Gleixner wrote:
> Ricardo,
> 
> On Thu, 27 Jun 2019, Ricardo Neri wrote:
> >  
> > +/*
> > + * Processors which have self-snooping capability can handle conflicting
> > + * memory type across CPUs by snooping its own cache. However, there exists
> > + * CPU models in which having conflicting memory types still leads to
> > + * unpredictable behavior, machine check errors, or hangs. Clear this 
> > feature
> > + * to prevent its use. For instance, the algorithm to program the Memory 
> > Type
> > + * Region Registers and the Page Attribute Table MSR can skip expensive 
> > cache
> > + * flushes if self-snooping is supported.
> 
> I appreciate informative comments, but this is the part which disables a
> feature on errata inflicted CPUs. So the whole information about what
> self-snooping helps with is not that interesting here. It's broken, we
> disable it and be done with it.

Sure, Thomas. I will move the the usefulness of self-snooping to the
MTRR programming function as you mention below.
> 
> > + */
> > +static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
> > +{
> > +   switch (c->x86_model) {
> > +   case INTEL_FAM6_CORE_YONAH:
> > +   case INTEL_FAM6_CORE2_MEROM:
> > +   case INTEL_FAM6_CORE2_MEROM_L:
> > +   case INTEL_FAM6_CORE2_PENRYN:
> > +   case INTEL_FAM6_CORE2_DUNNINGTON:
> > +   case INTEL_FAM6_NEHALEM:
> > +   case INTEL_FAM6_NEHALEM_G:
> > +   case INTEL_FAM6_NEHALEM_EP:
> > +   case INTEL_FAM6_NEHALEM_EX:
> > +   case INTEL_FAM6_WESTMERE:
> > +   case INTEL_FAM6_WESTMERE_EP:
> > +   case INTEL_FAM6_SANDYBRIDGE:
> > +   setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
> > +   }
> > +}
> > +
> 
> But looking at the actual interesting part of the 2nd patch:
> 
> > @@ -743,7 +743,9 @@ static void prepare_set(void) 
> > __acquires(set_atomicity_lock)
> >/* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
> >cr0 = read_cr0() | X86_CR0_CD;
> >write_cr0(cr0);
> > -   wbinvd();
> > +
> > +   if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
> > +   wbinvd();
> 
> This part lacks any form of explanation. So I'd rather have the comment
> about why we can avoid the wbindv() here. I''d surely never would look at
> that errata handling function to get that information.
> 
> Other than that detail, the patches are well done!

Thank you, Thomas!

BR,
Ricardo


[PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata

2019-06-27 Thread Ricardo Neri
Processors which have self-snooping capability can handle conflicting
memory type across CPUs by snooping its own cache. However, there exists
CPU models in which having conflicting memory types still leads to
unpredictable behavior, machine check errors, or hangs. Clear this feature
to prevent its use. For instance, the algorithm to program the Memory Type
Region Registers and the Page Attribute Table MSR can skip expensive cache
flushes if self-snooping is supported.

Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Hans de Goede 
Cc: Peter Feiner 
Cc: "Rafael J. Wysocki" 
Cc: Greg Kroah-Hartman 
Cc: Jordan Borgner 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Alan Cox 
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/cpu/intel.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index f17c1a714779..8684c66e71e0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -66,6 +66,34 @@ void check_mpx_erratum(struct cpuinfo_x86 *c)
}
 }
 
+/*
+ * Processors which have self-snooping capability can handle conflicting
+ * memory type across CPUs by snooping its own cache. However, there exists
+ * CPU models in which having conflicting memory types still leads to
+ * unpredictable behavior, machine check errors, or hangs. Clear this feature
+ * to prevent its use. For instance, the algorithm to program the Memory Type
+ * Region Registers and the Page Attribute Table MSR can skip expensive cache
+ * flushes if self-snooping is supported.
+ */
+static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
+{
+   switch (c->x86_model) {
+   case INTEL_FAM6_CORE_YONAH:
+   case INTEL_FAM6_CORE2_MEROM:
+   case INTEL_FAM6_CORE2_MEROM_L:
+   case INTEL_FAM6_CORE2_PENRYN:
+   case INTEL_FAM6_CORE2_DUNNINGTON:
+   case INTEL_FAM6_NEHALEM:
+   case INTEL_FAM6_NEHALEM_G:
+   case INTEL_FAM6_NEHALEM_EP:
+   case INTEL_FAM6_NEHALEM_EX:
+   case INTEL_FAM6_WESTMERE:
+   case INTEL_FAM6_WESTMERE_EP:
+   case INTEL_FAM6_SANDYBRIDGE:
+   setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
+   }
+}
+
 static bool ring3mwait_disabled __read_mostly;
 
 static int __init ring3mwait_disable(char *__unused)
@@ -311,6 +339,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 */
if (detect_extended_topology_early(c) < 0)
detect_ht_early(c);
+
+   check_memory_type_self_snoop_errata(c);
 }
 
 #ifdef CONFIG_X86_32
-- 
2.17.1



[PATCH 0/2] Speed MTRR programming up when we can

2019-06-27 Thread Ricardo Neri
Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

In some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

 clocksource: timekeeping watchdog on CPU1: Marking clocksource
  'tsc-early' as unstable because the skew is too large:
 clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
  fffedb90 mask: 
 clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
  mask: 
 tsc: Marking TSC unstable due to clocksource watchdog

As per my measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers.

By measuring the execution time of mtrr_aps_init() (from which MTRRs
in all CPUs are programmed in lock-step at boot), I find savings in the
time required to program MTRRs as follows:

Platform  time-with-wbinvd(ms) time-no-wbinvd(ms)
104-core (208 LP) Skylake1437 28
2-core (4 LP) Haswell 114  2

However, there exist CPU models with errata that affect their self-
snooping capability. Such errata may cause unpredictable behavior, machine
check errors, or hangs. For instance:

 "Where two different logical processors have the same code page
  mapped with two different memory types Specifically if one code
  page is mapped by one logical processor as write back and by
  another as uncacheable and certain instruction timing conditions
  occur the system may experience unpredictable behaviour." [1].

Similar errata are reported in other processors as well [2], [3], [4],
[5], and [6].

Thus, in order to confidently leverage self-snooping for the MTRR
programming algorithm, we must first clear such feature in models with
known errata.


Thanks and BR,
Ricardo

LP = Logical Processor

[1]. Erratum BF52, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf
[2]. Erratum BK47, 
https://www.mouser.com/pdfdocs/2ndgencorefamilymobilespecificationupdate.pdf
[3]. Erratum AAO54, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf
[4]. Errata AZ39, AZ42, 
https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf
[5]. Errata AQ51, AQ102, AQ104, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-desktop-e2000-specification-update.pdf
[6]. Errata AN107, AN109, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf

Ricardo Neri (2):
  x86/cpu/intel: Clear cache self-snoop capability in CPUs with known
errata
  x86, mtrr: generic: Skip cache flushes on CPUs with cache
self-snooping

 arch/x86/kernel/cpu/intel.c| 30 ++
 arch/x86/kernel/cpu/mtrr/generic.c |  8 ++--
 2 files changed, 36 insertions(+), 2 deletions(-)

-- 
2.17.1



[PATCH 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping

2019-06-27 Thread Ricardo Neri
Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

In some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

  clocksource: timekeeping watchdog on CPU1: Marking clocksource
   'tsc-early' as unstable because the skew is too large:
  clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
   fffedb90 mask: 
  clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
   mask: 
  tsc: Marking TSC unstable due to clocksource watchdog

As per my measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers.

Cc: Alan Cox 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Andy Shevchenko 
Cc: Andi Kleen 
Cc: Hans de Goede 
Cc: Peter Feiner 
Cc: "Rafael J. Wysocki" 
Cc: Greg Kroah-Hartman 
Cc: Jordan Borgner 
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Ravi V. Shankar" 
Reported-by: Mohammad Etemadi 
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/cpu/mtrr/generic.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c 
b/arch/x86/kernel/cpu/mtrr/generic.c
index 9356c1c9024d..169672a6935c 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -743,7 +743,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
/* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
cr0 = read_cr0() | X86_CR0_CD;
write_cr0(cr0);
-   wbinvd();
+
+   if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+   wbinvd();
 
/* Save value of CR4 and clear Page Global Enable (bit 7) */
if (boot_cpu_has(X86_FEATURE_PGE)) {
@@ -760,7 +762,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
 
/* Disable MTRRs, and set the default type to uncached */
mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi);
-   wbinvd();
+
+   if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+   wbinvd();
 }
 
 static void post_set(void) __releases(set_atomicity_lock)
-- 
2.17.1



[RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the hardlockup detector

2019-05-23 Thread Ricardo Neri
Implement the initial configuration of the timer to be used by the
hardlockup detector. Return a data structure with a description of the
timer; this information is subsequently used by the hardlockup detector.

Only provide the timer if it supports Front Side Bus interrupt delivery.
This condition greatly simplifies the implementation of the detector.
Specifically, it helps to avoid the complexities of routing the interrupt
via the IO-APIC (e.g., potential race conditions that arise from re-
programming the IO-APIC in NMI context).

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h | 13 +
 arch/x86/kernel/hpet.c  | 35 +++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 6f099e2781ce..20abdaa5372d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -109,6 +109,19 @@ extern void hpet_set_comparator(int num, unsigned int cmp, 
unsigned int period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data {
+   boolhas_periodic;
+   u32 num;
+   u64 ticks_per_second;
+};
+
+extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+#else
+static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{ return NULL; }
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #else /* CONFIG_HPET_TIMER */
 
 static inline int hpet_enable(void) { return 0; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index ff0250831786..5f9209949fc7 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -171,6 +171,41 @@ do {   
\
_hpet_print_config(__func__, __LINE__); \
 } while (0)
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{
+   struct hpet_hld_data *hdata;
+   u64 temp;
+   u32 cfg;
+
+   cfg = hpet_readl(HPET_Tn_CFG(HPET_WD_TIMER_NR));
+
+   if (!(cfg & HPET_TN_FSB_CAP))
+   return NULL;
+
+   hdata = kzalloc(sizeof(*hdata), GFP_KERNEL);
+   if (!hdata)
+   return NULL;
+
+   if (cfg & HPET_TN_PERIODIC_CAP)
+   hdata->has_periodic = true;
+
+   hdata->num = HPET_WD_TIMER_NR;
+
+   cfg = hpet_readl(HPET_PERIOD);
+
+   /*
+* hpet_get_ticks_per_sec() expects the contents of the general
+* capabilities register. The period is in the 32 most significant
+* bits.
+*/
+   temp = (u64)cfg << HPET_COUNTER_CLK_PERIOD_SHIFT;
+   hdata->ticks_per_second = hpet_get_ticks_per_sec(temp);
+
+   return hdata;
+}
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 /*
  * When the hpet driver (/dev/hpet) is enabled, we need to reserve
  * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case
-- 
2.17.1



[RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

2019-05-23 Thread Ricardo Neri
The only direct method to determine whether an HPET timer caused an
interrupt is to read the Interrupt Status register. Unfortunately,
reading HPET registers is slow and, therefore, it is not recommended to
read them while in NMI context. Furthermore, status is not available if
the interrupt is generated vi the Front Side Bus.

An indirect manner to infer if the non-maskable interrupt we see was
caused by the HPET timer is to use the time-stamp counter. Compute the
value that the time-stamp counter should have at the next interrupt of the
HPET timer. Since the hardlockup detector operates in seconds, high
precision is not needed. This implementation considers that the HPET
caused the HMI if the time-stamp counter reads the expected value -/+ 1.5%.
This value is selected as it is equivalent to 1/64 and the division can be
performed using a bit shift operation. Experimentally, the error in the
estimation is consistently less than 1%.

The computation of the expected value of the time-stamp counter must be
performed in relation to watchdog_thresh divided by the number of
monitored CPUs. This quantity is stored in tsc_ticks_per_cpu and must be
updated whenever the number of monitored CPUs changes.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Suggested-by: Andi Kleen 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 27 ++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 64acacce095d..fd99f2390714 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -115,6 +115,8 @@ struct hpet_hld_data {
u32 num;
u64 ticks_per_second;
u64 ticks_per_cpu;
+   u64 tsc_next;
+   u64 tsc_ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
struct msi_msg  msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 74aeb0535d08..dcc50cd29374 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -24,6 +24,7 @@
 
 static struct hpet_hld_data *hld_data;
 static bool hardlockup_use_hpet;
+static u64 tsc_next_error;
 
 /**
  * kick_timer() - Reprogram timer to expire in the future
@@ -33,11 +34,22 @@ static bool hardlockup_use_hpet;
  * Reprogram the timer to expire within watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value. The maximum
+ * deviation is of ~1.5%. This deviation can be easily computed by shifting
+ * by 6 positions the delta between the current and expected time-stamp values.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
+   u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
bool kick_needed = force || !(hdata->has_periodic);
-   u64 new_compare, count, period = 0;
+
+   tsc_curr = rdtsc();
+
+   tsc_delta = (unsigned long)watchdog_thresh * hdata->tsc_ticks_per_cpu;
+   hdata->tsc_next = tsc_curr + tsc_delta;
+   tsc_next_error = tsc_delta >> 6;
 
/*
 * Update the comparator in increments of watch_thresh seconds relative
@@ -93,6 +105,15 @@ static void enable_timer(struct hpet_hld_data *hdata)
  */
 static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
 {
+   if (smp_processor_id() == hdata->handling_cpu) {
+   u64 tsc_curr;
+
+   tsc_curr = rdtsc();
+
+   return (tsc_curr - hdata->tsc_next) + tsc_next_error <
+  2 * tsc_next_error;
+   }
+
return false;
 }
 
@@ -260,6 +281,10 @@ static void update_ticks_per_cpu(struct hpet_hld_data 
*hdata)
 
do_div(temp, hdata->enabled_cpus);
hdata->ticks_per_cpu = temp;
+
+   temp = (unsigned long)tsc_khz * 1000L;
+   do_div(temp, hdata->enabled_cpus);
+   hdata->tsc_ticks_per_cpu = temp;
 }
 
 /**
-- 
2.17.1



[RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes

2019-05-23 Thread Ricardo Neri
Instead of setting the timer period directly in hpet_set_periodic(), add a
new helper function hpet_set_comparator() that only sets the accumulator
and comparator. hpet_set_periodic() will only prepare the timer for
periodic mode and leave the expiration programming to
hpet_set_comparator().

This new function can also be used by other components (e.g., the HPET-
based hardlockup detector) which also need to configure HPET timers. Thus,
add its declaration into the hpet header file.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Originally-by: Suravee Suthikulpanit 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  1 +
 arch/x86/kernel/hpet.c  | 57 +
 2 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index f132fbf984d4..e7098740f5ee 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -102,6 +102,7 @@ extern int hpet_rtc_timer_init(void);
 extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id);
 extern int hpet_register_irq_handler(rtc_irq_handler handler);
 extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
+extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int 
period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 5e86e024c489..1723d55219e8 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -290,6 +290,47 @@ static void hpet_legacy_clockevent_register(void)
printk(KERN_DEBUG "hpet clockevent registered\n");
 }
 
+/**
+ * hpet_set_comparator() - Helper function for setting comparator register
+ * @num:   The timer ID
+ * @cmp:   The value to be written to the comparator/accumulator
+ * @period:The value to be written to the period (0 = oneshot mode)
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * For one-shot mode, HPET_TN_SETVAL does not need to be set.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
+{
+   if (period) {
+   unsigned int v = hpet_readl(HPET_Tn_CFG(num));
+
+   hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
+   }
+
+   hpet_writel(cmp, HPET_Tn_CMP(num));
+
+   if (!period)
+   return;
+
+   /*
+* This delay is seldom used: never in one-shot mode and in periodic
+* only when reprogramming the timer.
+*/
+   udelay(1);
+   hpet_writel(period, HPET_Tn_CMP(num));
+}
+EXPORT_SYMBOL_GPL(hpet_set_comparator);
+
 static int hpet_set_periodic(struct clock_event_device *evt, int timer)
 {
unsigned int cfg, cmp, now;
@@ -301,19 +342,11 @@ static int hpet_set_periodic(struct clock_event_device 
*evt, int timer)
now = hpet_readl(HPET_COUNTER);
cmp = now + (unsigned int)delta;
cfg = hpet_readl(HPET_Tn_CFG(timer));
-   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-  HPET_TN_32BIT;
+   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
hpet_writel(cfg, HPET_Tn_CFG(timer));
-   hpet_writel(cmp, HPET_Tn_CMP(timer));
-   udelay(1);
-   /*
-* HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-* cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-* bit is automatically cleared after the first write.
-* (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-* Publication # 24674)
-*/
-   hpet_writel((unsigned int)delta, HPET_Tn_CMP(timer));
+
+   hpet_set_comparator(timer, cmp, (unsigned int)delta);
+
hpet_start_counter();
hpet_print_config();
 
-- 
2.17.1



[RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2019-05-23 Thread Ricardo Neri
The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC.

In such case, permanently stop the HPET-based hardlockup detector and
start the perf-based detector.

Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h| 2 ++
 arch/x86/kernel/tsc.c  | 2 ++
 arch/x86/kernel/watchdog_hld.c | 7 +++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index fd99f2390714..a82cbe17479d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
 extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
+extern void hardlockup_detector_switch_to_perf(void);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
@@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
 static inline void hardlockup_detector_hpet_stop(void) {}
 static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
 static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
+static void harrdlockup_detector_switch_to_perf(void) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 59b57605e66c..b2210728ce3d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1158,6 +1158,8 @@ void mark_tsc_unstable(char *reason)
 
clocksource_mark_unstable(_tsc_early);
clocksource_mark_unstable(_tsc);
+
+   hardlockup_detector_switch_to_perf();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index c2512d4c79c5..c8547c227a41 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -76,3 +76,10 @@ void watchdog_nmi_stop(void)
if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
hardlockup_detector_hpet_stop();
 }
+
+void hardlockup_detector_switch_to_perf(void)
+{
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   hardlockup_detector_hpet_stop();
+   hardlockup_start_all();
+}
-- 
2.17.1



[RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2019-05-23 Thread Ricardo Neri
This is the initial implementation of a hardlockup detector driven by an
HPET timer. This initial implementation includes functions to control the
timer via its registers. It also requests such timer, installs an NMI
interrupt handler and performs the initial configuration of the timer.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces provides by this detector as well as functionality
to determine if the HPET timer caused the NMI.

In order to detect hardlockups in all the monitored CPUs, move the
interrupt to the next monitored CPU while handling the NMI interrupt; wrap
around when reaching the highest CPU in the mask. This rotation is
achieved by setting the affinity mask to only contain the next CPU to
monitor. A cpumask keeps track of all the CPUs that need to be monitored.
Such cpumask is updated when the watchdog is enabled or disabled in a
particular CPU.

This detector relies on an HPET timer that is capable of using Front Side
Bus interrupts. In order to avoid using the generic interrupt code,
program directly the MSI message register of the HPET timer.

HPET registers are only accessed to kick the timer after looking for
hardlockups. This happens every watchdog_thresh seconds. A subsequent
changeset will determine whether the HPET timer caused the interrupt based
on the value of the time-stamp counter. For now, just add a stub function.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/Kconfig.debug  |  11 +
 arch/x86/include/asm/hpet.h |  13 ++
 arch/x86/kernel/Makefile|   1 +
 arch/x86/kernel/hpet.c  |   3 +-
 arch/x86/kernel/watchdog_hld_hpet.c | 335 
 5 files changed, 362 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index f730680dc818..445bbb188f10 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -169,6 +169,17 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
def_bool y
 
+config X86_HARDLOCKUP_DETECTOR_HPET
+   bool "Use HPET Timer for Hard Lockup Detection"
+   select SOFTLOCKUP_DETECTOR
+   select HARDLOCKUP_DETECTOR
+   select HARDLOCKUP_DETECTOR_CORE
+   depends on HPET_TIMER && HPET && X86_64
+   help
+ Say y to enable a hardlockup detector that is driven by a High-
+ Precision Event Timer. This option is helpful to not use counters
+ from the Performance Monitoring Unit to drive the detector.
+
 config X86_DECODER_SELFTEST
bool "x86 instruction decoder selftest"
depends on DEBUG_KERNEL && KPROBES
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 20abdaa5372d..31fc27508cf3 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,12 +114,25 @@ struct hpet_hld_data {
boolhas_periodic;
u32 num;
u64 ticks_per_second;
+   u32 handling_cpu;
+   u32 enabled_cpus;
+   struct msi_msg  msi_msg;
+   unsigned long   cpu_monitored_mask[0];
 };
 
 extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+extern int hardlockup_detector_hpet_init(void);
+extern void hardlockup_detector_hpet_stop(void);
+extern void hardlockup_detector_hpet_enable(unsigned int cpu);
+extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
+static inline int hardlockup_detector_hpet_init(void)
+{ return -ENODEV; }
+static inline void hardlockup_detector_hpet_stop(void) {}
+static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
+static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3578ad248bc9..3ad55de67e8b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_VM86)  += vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 obj-$(CONFIG_APB_TIMER)+= apb_timer.o
 
 obj-$(CONFIG_AMD_NB)   += amd_nb.o
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 5f9209949fc7..dd3bb664a188 100644
--- a/arch

[RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode

2019-05-23 Thread Ricardo Neri
A recent change introduced a new member to struct irq_cfg to specify the
delivery mode of an interrupt. Supporting the configuration of the
delivery mode would require adding a third argument to prepare_irte().
Instead, simply take a pointer to a irq_cfg data structure as a the only
argument.

Internally, configure the delivery mode of the Interrupt Remapping Table
Entry as specified in the irq_cfg data structure and not as the APIC
setting.

This change does not change the existing behavior, as the delivery mode
of the APIC is used to configure irq_cfg data structure.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Jan Kiszka 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: io...@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 drivers/iommu/intel_irq_remapping.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 4160aa9f3f80..2e61eaca7d7e 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1072,7 +1072,7 @@ static int reenable_irq_remapping(int eim)
return -1;
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
memset(irte, 0, sizeof(*irte));
 
@@ -1086,9 +1086,9 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
 * irq migration in the presence of interrupt-remapping.
*/
irte->trigger_mode = 0;
-   irte->dlvry_mode = apic->irq_delivery_mode;
-   irte->vector = vector;
-   irte->dest_id = IRTE_DEST(dest);
+   irte->dlvry_mode = irq_cfg->delivery_mode;
+   irte->vector = irq_cfg->vector;
+   irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
irte->redir_hint = 1;
 }
 
@@ -1265,7 +1265,7 @@ static void intel_irq_remapping_prepare_irte(struct 
intel_ir_data *data,
struct irte *irte = >irte_entry;
struct msi_msg *msg = >msi_entry;
 
-   prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
+   prepare_irte(irte, irq_cfg);
switch (info->type) {
case X86_IRQ_ALLOC_TYPE_IOAPIC:
/* Set source-id of interrupt request */
-- 
2.17.1



[RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter

2019-05-23 Thread Ricardo Neri
Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the hpet-based hardlockup detector fails and the NMI
watchdog will fallback to use the perf-based implementation.

Given that __setup("nmi_watchdog=") is already used to control the behavior
of the NMI watchdog (via hardlockup_panic_setup()), it cannot be used to
control of the hpet-based implementation. Instead, use a new
early_param("nmi_watchdog").

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 

--
checkpatch gives the following warning:

CHECK: __setup appears un-documented -- check 
Documentation/admin-guide/kernel-parameters.rst
+__setup("nmi_watchdog=", hardlockup_detector_hpet_setup);

This is a false-positive as the option nmi_watchdog is already
documented. The option is re-evaluated in this file as well.
---
 .../admin-guide/kernel-parameters.txt |  8 ++-
 arch/x86/kernel/watchdog_hld_hpet.c   | 22 +++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 138f6664b2e2..17ed3dcda13e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2831,7 +2831,7 @@
Format: [state][,regs][,debounce][,die]
 
nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
-   Format: [panic,][nopanic,][num]
+   Format: [panic,][nopanic,][num,][hpet]
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
@@ -2841,6 +2841,12 @@
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
+   When hpet is specified, the NMI watchdog will be driven
+   by an HPET timer, if available in the system. Otherwise,
+   it falls back to the default implementation (perf or
+   architecture-specific). Specifying hpet has no effect
+   if the NMI watchdog is not enabled (either at build time
+   or via the command line).
 
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index dcc50cd29374..76eed714a1cb 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -351,6 +351,28 @@ void hardlockup_detector_hpet_stop(void)
disable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:   A string containing the kernel command line
+ *
+ * Parse the nmi_watchdog parameter from the kernel command line. If
+ * selected by the user, use this implementation to detect hardlockups.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+   if (!str)
+   return -EINVAL;
+
+   if (parse_option_str(str, "hpet"))
+   hardlockup_use_hpet = true;
+
+   if (!nmi_watchdog_user_enabled && hardlockup_use_hpet)
+   pr_warn("Selecting HPET NMI watchdog has no effect with NMI 
watchdog disabled\n");
+
+   return 0;
+}
+early_param("nmi_watchdog", hardlockup_detector_hpet_setup);
+
 /**
  * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
  *
-- 
2.17.1



Re: [RFC PATCH v2 12/14] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

2019-04-10 Thread Ricardo Neri
On Wed, Apr 10, 2019 at 09:01:52AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 09, 2019 at 06:19:57PM -0700, Ricardo Neri wrote:
> > On Tue, Apr 09, 2019 at 01:28:17PM +0200, Peter Zijlstra wrote:
> > > > @@ -147,6 +161,14 @@ static void set_periodic(struct hpet_hld_data 
> > > > *hdata)
> > > >   */
> > > >  static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
> > > >  {
> > > > +   if (smp_processor_id() == hdata->handling_cpu) {
> > > > +   unsigned long tsc_curr;
> > > 
> > > TSC is u64
> > 
> > In x86_64, isn't u64 an unsigned long? Do you mean to consider the
> > 32-bit case?
> 
> Unless none of this code is available for x86_32, you pretty much have
> to consider 32bit.
> 
> But even then, using u64 for 64bit values is the right thing.

OK, I'll implement this change.

Thanks and BR,
Ricardo


  1   2   3   4   5   6   7   8   9   10   >