[tip: perf/core] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit
The following commit has been merged into the perf/core branch of tip: Commit-ID: a161545ab53b174c016b0eb63c289525d2f6 Gitweb: https://git.kernel.org/tip/a161545ab53b174c016b0eb63c289525d2f6 Author:Ricardo Neri AuthorDate:Mon, 12 Apr 2021 07:30:41 -07:00 Committer: Peter Zijlstra CommitterDate: Mon, 19 Apr 2021 20:03:23 +02:00 x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit Add feature enumeration to identify a processor with Intel Hybrid Technology: one in which CPUs of more than one type are the same package. On a hybrid processor, all CPUs support the same homogeneous (i.e., symmetric) instruction set. All CPUs enumerate the same features in CPUID. Thus, software (user space and kernel) can run and migrate to any CPU in the system as well as utilize any of the enumerated features without any change or special provisions. The main difference among CPUs in a hybrid processor are power and performance properties. Signed-off-by: Ricardo Neri Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Tony Luck Reviewed-by: Len Brown Acked-by: Borislav Petkov Link: https://lkml.kernel.org/r/1618237865-33448-2-git-send-email-kan.li...@linux.intel.com --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index cc96e26..1ba4a6e 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -374,6 +374,7 @@ #define X86_FEATURE_MD_CLEAR (18*32+10) /* VERW clears CPU buffers */ #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */ #define X86_FEATURE_SERIALIZE (18*32+14) /* SERIALIZE instruction */ +#define X86_FEATURE_HYBRID_CPU (18*32+15) /* "" This part has CPUs of more than one type */ #define X86_FEATURE_TSXLDTRK (18*32+16) /* TSX Suspend Load Address Tracking */ #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */ #define X86_FEATURE_ARCH_LBR (18*32+19) /* Intel ARCH LBR */
[tip: perf/core] x86/cpu: Add helper function to get the type of the current hybrid CPU
The following commit has been merged into the perf/core branch of tip: Commit-ID: 250b3c0d79d1f4a55e54d8a9ef48058660483fef Gitweb: https://git.kernel.org/tip/250b3c0d79d1f4a55e54d8a9ef48058660483fef Author:Ricardo Neri AuthorDate:Mon, 12 Apr 2021 07:30:42 -07:00 Committer: Peter Zijlstra CommitterDate: Mon, 19 Apr 2021 20:03:23 +02:00 x86/cpu: Add helper function to get the type of the current hybrid CPU On processors with Intel Hybrid Technology (i.e., one having more than one type of CPU in the same package), all CPUs support the same instruction set and enumerate the same features on CPUID. Thus, all software can run on any CPU without restrictions. However, there may be model-specific differences among types of CPUs. For instance, each type of CPU may support a different number of performance counters. Also, machine check error banks may be wired differently. Even though most software will not care about these differences, kernel subsystems dealing with these differences must know. Add and expose a new helper function get_this_hybrid_cpu_type() to query the type of the current hybrid CPU. The function will be used later in the perf subsystem. The Intel Software Developer's Manual defines the CPU type as 8-bit identifier. Signed-off-by: Ricardo Neri Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Tony Luck Reviewed-by: Len Brown Acked-by: Borislav Petkov Link: https://lkml.kernel.org/r/1618237865-33448-3-git-send-email-kan.li...@linux.intel.com --- arch/x86/include/asm/cpu.h | 6 ++ arch/x86/kernel/cpu/intel.c | 16 2 files changed, 22 insertions(+) diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index da78ccb..610905d 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -45,6 +45,7 @@ extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); extern void switch_to_sld(unsigned long tifn); extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); extern bool handle_guest_split_lock(unsigned long ip); +u8 get_this_hybrid_cpu_type(void); #else static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} static inline void switch_to_sld(unsigned long tifn) {} @@ -57,6 +58,11 @@ static inline bool handle_guest_split_lock(unsigned long ip) { return false; } + +static inline u8 get_this_hybrid_cpu_type(void) +{ + return 0; +} #endif #ifdef CONFIG_IA32_FEAT_CTL void init_ia32_feat_ctl(struct cpuinfo_x86 *c); diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 0e422a5..26fb626 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1195,3 +1195,19 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) cpu_model_supports_sld = true; split_lock_setup(); } + +#define X86_HYBRID_CPU_TYPE_ID_SHIFT 24 + +/** + * get_this_hybrid_cpu_type() - Get the type of this hybrid CPU + * + * Returns the CPU type [31:24] (i.e., Atom or Core) of a CPU in + * a hybrid processor. If the processor is not hybrid, returns 0. + */ +u8 get_this_hybrid_cpu_type(void) +{ + if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU)) + return 0; + + return cpuid_eax(0x001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT; +}
[PATCH v2 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to check for the idle state of the CPU doing the load balancing, but also of its SMT siblings. If the CPU doing the balancing is idle but its SMT siblings are not busy, performance suffers if it pulls tasks from a medium priority CPU that does not have SMT siblings. The decision to label a group for asymmetric packing balancing is done in update_sg_lb_stats(). However, for SMT, that code does not account for idle SMT siblings. Implement asym_can_pull_tasks() to revisit the early decision on whether the CPU doing the balance can pull tasks once the needed information is available. arch_sched_asym_prefer_early() and arch_asym_check_smt_siblings() are used to conserve the legacy behavior. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- Changes since v1: * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull tasks. Instead, reclassify the candidate busiest group, as it may still be selected. (PeterZ) * Avoid an expensive and unnecessary call to cpumask_weight() when determining if a sched_group is comprised of SMT siblings. (PeterZ). --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c| 127 + 2 files changed, 128 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 663b98959305..6487953b24e8 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -58,6 +58,7 @@ static inline int cpu_numa_flags(void) extern int arch_asym_cpu_priority(int cpu); extern bool arch_sched_asym_prefer_early(int a, int b); +extern bool arch_asym_check_smt_siblings(void); struct sched_domain_attr { int relax_domain_level; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e74da853b046..2a33f93646b2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -115,6 +115,14 @@ bool __weak arch_sched_asym_prefer_early(int a, int b) return sched_asym_prefer(a, b); } +/* + * For asym packing, first check the state of SMT siblings before deciding to + * pull tasks. + */ +bool __weak arch_asym_check_smt_siblings(void) +{ + return false; +} /* * The margin used when comparing utilization with CPU capacity. * @@ -8483,6 +8491,107 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_capacity; } +static bool cpu_group_is_smt(int cpu, struct sched_group *sg) +{ +#ifdef CONFIG_SCHED_SMT + if (!static_branch_likely(_smt_present)) + return false; + + if (sg->group_weight == 1) + return false; + + return cpumask_equal(sched_group_span(sg), cpu_smt_mask(cpu)); +#else + return false; +#endif +} + +/** + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks + * @dst_cpu: CPU doing the load balancing + * @sds: Load-balancing data with statistics of the local group + * @sgs: Load-balancing statistics of the candidate busiest group + * @sg:The candidate busiet group + * + * Check the state of the SMT siblings of both @sds::local and @sg and decide + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority. + * + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference + * between the number of busy CPUs is 2 or more. If the difference is of 1, + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg + * has lower priority. + */ +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds, + struct sg_lb_stats *sgs, struct sched_group *sg) +{ +#ifdef CONFIG_SCHED_SMT + int cpu, local_busy_cpus, sg_busy_cpus; + bool local_is_smt, sg_is_smt; + + if (!arch_asym_check_smt_siblings()) + return true; + + cpu = group_first_cpu(sg); + local_is_smt = cpu_group_is_smt(dst_cpu, sds->local); + sg_is_smt = cpu_group_is_smt(cpu, sg); + + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus; + + if (!local_is_smt) { + /* +* If we are here, @dst_cpu is idle and does not have SMT +* siblings. Pull tasks if candidate group has two or more +* busy CPUs. +*/ + if (sg_is_smt && sg_busy_cpus >= 2) + return true; + + /* +
[PATCH v2 4/4] x86/sched: Enable checks of the state of SMT siblings in load balancing
ITMT relies on asymmetric packing of tasks to ensure CPUs are populated in priority order. When balancing load, the scheduler compares scheduling groups in pairs, and compares only the priority of the CPUs of highest priority in the group. This may result on CPUs with medium priority being overlooked. A recent change introduced logic to also consider the idle state of the SMT siblings of the CPU doing the load balance. Enable those checks for x86 when using ITMT. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- Changes since v1: * None --- arch/x86/kernel/itmt.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c index 1afbdd1dd777..1407120af82d 100644 --- a/arch/x86/kernel/itmt.c +++ b/arch/x86/kernel/itmt.c @@ -28,6 +28,8 @@ DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority); /* Boolean to track if system has ITMT capabilities */ static bool __read_mostly sched_itmt_capable; +/* Boolean to activate checks on the state of SMT siblings */ +static bool __read_mostly sched_itmt_smt_checks; /* * Boolean to control whether we want to move processes to cpu capable @@ -124,6 +126,8 @@ int sched_set_itmt_support(void) sysctl_sched_itmt_enabled = 1; + sched_itmt_smt_checks = true; + x86_topology_update = true; rebuild_sched_domains(); @@ -160,6 +164,7 @@ void sched_clear_itmt_support(void) if (sysctl_sched_itmt_enabled) { /* disable sched_itmt if we are no longer ITMT capable */ sysctl_sched_itmt_enabled = 0; + sched_itmt_smt_checks = false; x86_topology_update = true; rebuild_sched_domains(); } @@ -167,6 +172,16 @@ void sched_clear_itmt_support(void) mutex_unlock(_update_mutex); } +bool arch_asym_check_smt_siblings(void) +{ + return sched_itmt_smt_checks; +} + +bool arch_sched_asym_prefer_early(int a, int b) +{ + return sched_itmt_smt_checks; +} + int arch_asym_cpu_priority(int cpu) { return per_cpu(sched_core_priority, cpu); -- 2.17.1
[PATCH v2 2/4] sched/fair: Introduce arch_sched_asym_prefer_early()
Introduce arch_sched_asym_prefer_early() so that architectures with SMT can delay the decision to label a candidate busiest group as group_asym_packing. When using asymmetric packing, high priority idle CPUs pull tasks from scheduling groups with low priority CPUs. The decision on using asymmetric packing for load balancing is done after collecting the statistics of a candidate busiest group. However, this decision needs to consider the state of SMT siblings of dst_cpu. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- Changes since v1: * None --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c| 11 ++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8f0f778b7c91..663b98959305 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void) #endif extern int arch_asym_cpu_priority(int cpu); +extern bool arch_sched_asym_prefer_early(int a, int b); struct sched_domain_attr { int relax_domain_level; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4ef3fa0d5e8d..e74da853b046 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu) return -cpu; } +/* + * For asym packing, early check if CPUs with higher priority should be + * preferred. On some architectures, more data is needed to make a decision. + */ +bool __weak arch_sched_asym_prefer_early(int a, int b) +{ + return sched_asym_prefer(a, b); +} + /* * The margin used when comparing utilization with CPU capacity. * @@ -8458,7 +8467,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!local_group && env->sd->flags & SD_ASYM_PACKING && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running && - sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { + arch_sched_asym_prefer_early(env->dst_cpu, group->asym_prefer_cpu)) { sgs->group_asym_packing = 1; } -- 2.17.1
[PATCH v2 0/4] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING
0) threads-sockets group-4 1.00 ( 0.70) +0.61 ( 0.88) threads-sockets group-8 1.00 ( 1.20) +0.29 ( 2.20) netperf === caseloadbaseline(std%) compare%( std%) TCP_RR thread-2 1.00 ( 0.72) +3.48 ( 0.40) TCP_RR thread-4 1.00 ( 0.38) +4.11 ( 0.35) TCP_RR thread-8 1.00 ( 9.29) -0.67 ( 14.12) TCP_RR thread-161.00 ( 21.18) -0.10 ( 20.92) UDP_RR thread-2 1.00 ( 0.23) +4.08 ( 0.16) UDP_RR thread-4 1.00 ( 0.19) +3.53 ( 0.27) UDP_RR thread-8 1.00 ( 6.83) -1.63 ( 11.09) UDP_RR thread-161.00 ( 20.85) -1.02 ( 21.20) tbench == caseloadbaseline(std%) compare%( std%) loopbackthread-2 1.00 ( 0.15) +4.49 ( 0.20) loopbackthread-4 1.00 ( 0.25) +3.99 ( 0.32) loopbackthread-8 1.00 ( 0.16) +0.49 ( 0.30) loopbackthread-161.00 ( 0.32) +0.49 ( 0.21) schbench caseloadbaseline(std%) compare%( std%) normal mthread-11.00 ( 15.87) -17.76 ( 8.38) normal mthread-21.00 ( 0.10) -0.05 ( 0.00) normal mthread-41.00 ( 0.00) +0.00 ( 0.00) normal mthread-81.00 ( 4.81) +5.57 ( 2.61) Table 2. Test results of patches with HWP === hackbench = caseloadbaseline(std%) compare%( std%) process-pipegroup-1 1.00 ( 0.76) +0.14 ( 1.11) process-pipegroup-2 1.00 ( 2.59) +0.88 ( 3.16) process-pipegroup-4 1.00 ( 4.05) +1.13 ( 4.86) process-pipegroup-8 1.00 ( 7.37) -2.34 ( 14.43) process-sockets group-1 1.00 ( 15.98) +7.77 ( 1.44) process-sockets group-2 1.00 ( 1.64) +0.00 ( 1.59) process-sockets group-4 1.00 ( 0.95) -1.54 ( 0.85) process-sockets group-8 1.00 ( 1.84) -3.27 ( 4.86) threads-pipegroup-1 1.00 ( 3.27) +0.64 ( 2.91) threads-pipegroup-2 1.00 ( 3.02) -0.09 ( 1.50) threads-pipegroup-4 1.00 ( 5.39) -6.34 ( 3.11) threads-pipegroup-8 1.00 ( 5.56) +4.61 ( 14.91) threads-sockets group-1 1.00 ( 2.76) +4.70 ( 0.94) threads-sockets group-2 1.00 ( 1.10) +3.56 ( 1.41) threads-sockets group-4 1.00 ( 0.45) +2.11 ( 1.32) threads-sockets group-8 1.00 ( 3.56) +3.62 ( 2.43) netperf === caseloadbaseline(std%) compare%( std%) TCP_RR thread-2 1.00 ( 0.36) +9.85 ( 2.09) TCP_RR thread-4 1.00 ( 0.31) +1.30 ( 0.53) TCP_RR thread-8 1.00 ( 11.70) -0.42 ( 13.31) TCP_RR thread-161.00 ( 23.49) -0.55 ( 21.79) UDP_RR thread-2 1.00 ( 0.19) +13.11 ( 7.48) UDP_RR thread-4 1.00 ( 0.13) +2.69 ( 0.26) UDP_RR thread-8 1.00 ( 8.95) -1.39 ( 12.39) UDP_RR thread-161.00 ( 21.54) -0.77 ( 20.97) tbench == caseloadbaseline(std%) compare%( std%) loopbackthread-2 1.00 ( 0.22) +5.26 ( 0.46) loopbackthread-4 1.00 ( 2.56) +52.11 ( 0.73) loopbackthread-8 1.00 ( 0.41) +0.53 ( 0.42) loopbackthread-161.00 ( 0.60) +1.39 ( 0.33) schbench caseloadbaseline(std%) compare%( std%) normal mthread-11.00 ( 12.01) -1.72 ( 2.08) normal mthread-21.00 ( 0.00) +0.00 ( 0.00) normal mthread-41.00 ( 0.00) +0.00 ( 0.00) normal mthread-81.00 ( 0.00) +0.00 ( 0.00) [1]. https://lore.kernel.org/lkml/20210406041108.7416-1-ricardo.neri-calde...@linux.intel.com/ Ricardo Neri (4): sched/fair: Optimize checking for group_asym_packing sched/fair: Introduce arch_sched_asym_prefer_early() sched/fair: Consider SMT in ASYM_PACKING load balance x86/sched: Enable checks of the state of SMT siblings in load balancing arch/x86/kernel/itmt.c | 15 include/linux/sched/topology.h | 2 + kernel/sched/fair.c| 140 - 3 files changed, 155 insertions(+), 2 deletions(-) -- 2.17.1
[PATCH v2 1/4] sched/fair: Optimize checking for group_asym_packing
By checking local_group, we can avoid additional checks and invoking sched_asmy_prefer() when it is not needed. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- Changes since v1: * None --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..4ef3fa0d5e8d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8455,7 +8455,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, } /* Check if dst CPU is idle and preferred to this group */ - if (env->sd->flags & SD_ASYM_PACKING && + if (!local_group && env->sd->flags & SD_ASYM_PACKING && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running && sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { -- 2.17.1
Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance
On Thu, Apr 08, 2021 at 01:21:22PM +0200, Peter Zijlstra wrote: > On Tue, Apr 06, 2021 at 04:17:51PM -0700, Ricardo Neri wrote: > > On Tue, Apr 06, 2021 at 01:18:09PM +0200, Peter Zijlstra wrote: > > > On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote: > > > > +static bool cpu_group_is_smt(int cpu, struct sched_group *sg) > > > > +{ > > > > +#ifdef CONFIG_SCHED_SMT > > > > + if (!static_branch_likely(_smt_present)) > > > > + return false; > > > > + > > > > + if (sg->group_weight == 1) > > > > + return false; > > > > + > > > > + if (cpumask_weight(cpu_smt_mask(cpu)) == 1) > > > > + return false; > > > > > > Please explain this condition. Why is it required? > > > > Thank you for your quick review Peter! > > > > Probably this is not required since the previous check verifies the > > group weight, and the subsequent check makes sure that @sg matches the > > SMT siblings of @cpu. > > So the thing is that cpumask_weight() can be fairly expensive, depending > on how large the machine is. > > Now I suppose this mixing of SMT and !SMT cores is typical for 'small' > machines (for now), but this is enabled for everything with ITMT on, > which might very well include large systems. > > So yes, if it can go away, that'd be good. Sure Peter, I think this check can be removed. I'll post a v2 with the updates. Thanks and BR, Ricardo
Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance
On Thu, Apr 08, 2021 at 01:10:39PM +0200, Peter Zijlstra wrote: > On Tue, Apr 06, 2021 at 04:17:10PM -0700, Ricardo Neri wrote: > > On Tue, Apr 06, 2021 at 01:17:28PM +0200, Peter Zijlstra wrote: > > > On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote: > > > > @@ -8507,6 +8619,10 @@ static bool update_sd_pick_busiest(struct lb_env > > > > *env, > > > > if (!sgs->sum_h_nr_running) > > > > return false; > > > > > > > > + if (sgs->group_type == group_asym_packing && > > > > + !asym_can_pull_tasks(env->dst_cpu, sds, sgs, sg)) > > > > + return false; > > > > > > All of this makes my head hurt; but afaict this isn't right. > > > > > > Your update_sg_lb_stats() change makes that we unconditionally set > > > sgs->group_asym_packing, and then this is to undo that. But it's not > > > clear this covers all cases right. > > > > We could not make a decision to set sgs->group_asym_packing in > > update_sg_lb_stats() because we don't have information about the dst_cpu > > and its SMT siblings if any. That is the reason I proposed to delay the > > decision to update_sd_pick_busiest(), where we can compare local and > > sgs. > > Yeah, I sorta got that. > > > > Even if !sched_asym_prefer(), we could end up selecting this sg as > > > busiest, but you're just bailing out here. > > > > Even if sgs->group_asym_packing is unconditionally set, sgs can still > > be classified as group_overloaded and group_imbalanced. In such cases > > we wouldn't bailout. sgs could not be classified as group_fully_busy > > or group_has_spare and we would bailout, though. Is your concern about > > these? I can fixup these two cases. > > Yes. Either explain (in a comment) why those cases are not relevant, or > handle them properly. > > Because when reading this, it wasn't at all obvious that this is correct > or as intended. Sure Peter, I will post a v2 handling the remaining cases properly. Thanks and BR, Ricardo
Re: [PATCH 2/4] sched/fair: Introduce arch_sched_asym_prefer_early()
On Tue, Apr 06, 2021 at 04:31:35PM +0200, Vincent Guittot wrote: > On Tue, 6 Apr 2021 at 06:11, Ricardo Neri > wrote: > > > > Introduce arch_sched_asym_prefer_early() so that architectures with SMT > > can delay the decision to label a candidate busiest group as > > group_asym_packing. > > > > When using asymmetric packing, high priority idle CPUs pull tasks from > > scheduling groups with low priority CPUs. The decision on using asymmetric > > packing for load balancing is done after collecting the statistics of a > > candidate busiest group. However, this decision needs to consider the > > state of SMT siblings of dst_cpu. > > > > Cc: Aubrey Li > > Cc: Ben Segall > > Cc: Daniel Bristot de Oliveira > > Cc: Dietmar Eggemann > > Cc: Joel Fernandes (Google) > > Cc: Mel Gorman > > Cc: Quentin Perret > > Cc: Srinivas Pandruvada > > Cc: Steven Rostedt > > Cc: Tim Chen > > Reviewed-by: Len Brown > > Signed-off-by: Ricardo Neri > > --- > > include/linux/sched/topology.h | 1 + > > kernel/sched/fair.c| 11 ++- > > 2 files changed, 11 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > > index 8f0f778b7c91..663b98959305 100644 > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void) > > #endif > > > > extern int arch_asym_cpu_priority(int cpu); > > +extern bool arch_sched_asym_prefer_early(int a, int b); > > > > struct sched_domain_attr { > > int relax_domain_level; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 4ef3fa0d5e8d..e74da853b046 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu) > > return -cpu; > > } > > > > +/* > > + * For asym packing, early check if CPUs with higher priority should be > > + * preferred. On some architectures, more data is needed to make a > > decision. > > + */ > > +bool __weak arch_sched_asym_prefer_early(int a, int b) > > +{ > > + return sched_asym_prefer(a, b); > > +} > > + > > /* > > * The margin used when comparing utilization with CPU capacity. > > * > > @@ -8458,7 +8467,7 @@ static inline void update_sg_lb_stats(struct lb_env > > *env, > > if (!local_group && env->sd->flags & SD_ASYM_PACKING && > > env->idle != CPU_NOT_IDLE && > > sgs->sum_h_nr_running && > > - sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { > > + arch_sched_asym_prefer_early(env->dst_cpu, > > group->asym_prefer_cpu)) { > > If itmt set arch_sched_asym_prefer_early to true all groups will be > set as group_asym_packing unconditionally which is wrong. The state > has to be set only when we want asym packing migration Thanks for your quick review Vincent! Indeed, sgs->group_asym_packing should only be set when we want asym_packing migration. However, for ITMT we also need to know the state of the SMT siblings of dst_cpu if any. update_sg_lb_stats() does not anything about the local group. Thus, arch_sched_asym_prefer_early() intends to give an early decision that can be revoked later when statistics of the local group are known. I also thought about checking the state of the SMT siblings from update_sg_lb_stats() but that would duplicate the functionality of update_sg_lb_stats() when called on the local group. The group can still be classified as group_overloaded and group_imbalanced as they take precedence over group_asym_packing. We would miss the group_has_spare and group_fully_busy classifications, though. I can look into fixing that. Thanks and BR, Ricardo
Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance
On Tue, Apr 06, 2021 at 01:18:09PM +0200, Peter Zijlstra wrote: > On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote: > > +static bool cpu_group_is_smt(int cpu, struct sched_group *sg) > > +{ > > +#ifdef CONFIG_SCHED_SMT > > + if (!static_branch_likely(_smt_present)) > > + return false; > > + > > + if (sg->group_weight == 1) > > + return false; > > + > > + if (cpumask_weight(cpu_smt_mask(cpu)) == 1) > > + return false; > > Please explain this condition. Why is it required? Thank you for your quick review Peter! Probably this is not required since the previous check verifies the group weight, and the subsequent check makes sure that @sg matches the SMT siblings of @cpu. Thanks and BR, Ricardo
Re: [PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance
On Tue, Apr 06, 2021 at 01:17:28PM +0200, Peter Zijlstra wrote: > On Mon, Apr 05, 2021 at 09:11:07PM -0700, Ricardo Neri wrote: > > @@ -8507,6 +8619,10 @@ static bool update_sd_pick_busiest(struct lb_env > > *env, > > if (!sgs->sum_h_nr_running) > > return false; > > > > + if (sgs->group_type == group_asym_packing && > > + !asym_can_pull_tasks(env->dst_cpu, sds, sgs, sg)) > > + return false; > > All of this makes my head hurt; but afaict this isn't right. > > Your update_sg_lb_stats() change makes that we unconditionally set > sgs->group_asym_packing, and then this is to undo that. But it's not > clear this covers all cases right. We could not make a decision to set sgs->group_asym_packing in update_sg_lb_stats() because we don't have information about the dst_cpu and its SMT siblings if any. That is the reason I proposed to delay the decision to update_sd_pick_busiest(), where we can compare local and sgs. > > Even if !sched_asym_prefer(), we could end up selecting this sg as > busiest, but you're just bailing out here. Even if sgs->group_asym_packing is unconditionally set, sgs can still be classified as group_overloaded and group_imbalanced. In such cases we wouldn't bailout. sgs could not be classified as group_fully_busy or group_has_spare and we would bailout, though. Is your concern about these? I can fixup these two cases. Thanks and BR, Ricardo >
[PATCH 4/4] x86/sched: Enable checks of the state of SMT siblings in load balancing
ITMT relies on asymmetric packing of tasks to ensure CPUs are populated in priority order. When balancing load, the scheduler compares scheduling groups in pairs, and compares only the priority of the CPUs of highest priority in the group. This may result on CPUs with medium priority being overlooked. A recent change introduced logic to also consider the idle state of the SMT siblings of the CPU doing the load balance. Enable those checks for x86 when using ITMT. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- arch/x86/kernel/itmt.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c index 1afbdd1dd777..1407120af82d 100644 --- a/arch/x86/kernel/itmt.c +++ b/arch/x86/kernel/itmt.c @@ -28,6 +28,8 @@ DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority); /* Boolean to track if system has ITMT capabilities */ static bool __read_mostly sched_itmt_capable; +/* Boolean to activate checks on the state of SMT siblings */ +static bool __read_mostly sched_itmt_smt_checks; /* * Boolean to control whether we want to move processes to cpu capable @@ -124,6 +126,8 @@ int sched_set_itmt_support(void) sysctl_sched_itmt_enabled = 1; + sched_itmt_smt_checks = true; + x86_topology_update = true; rebuild_sched_domains(); @@ -160,6 +164,7 @@ void sched_clear_itmt_support(void) if (sysctl_sched_itmt_enabled) { /* disable sched_itmt if we are no longer ITMT capable */ sysctl_sched_itmt_enabled = 0; + sched_itmt_smt_checks = false; x86_topology_update = true; rebuild_sched_domains(); } @@ -167,6 +172,16 @@ void sched_clear_itmt_support(void) mutex_unlock(_update_mutex); } +bool arch_asym_check_smt_siblings(void) +{ + return sched_itmt_smt_checks; +} + +bool arch_sched_asym_prefer_early(int a, int b) +{ + return sched_itmt_smt_checks; +} + int arch_asym_cpu_priority(int cpu) { return per_cpu(sched_core_priority, cpu); -- 2.17.1
[PATCH 2/4] sched/fair: Introduce arch_sched_asym_prefer_early()
Introduce arch_sched_asym_prefer_early() so that architectures with SMT can delay the decision to label a candidate busiest group as group_asym_packing. When using asymmetric packing, high priority idle CPUs pull tasks from scheduling groups with low priority CPUs. The decision on using asymmetric packing for load balancing is done after collecting the statistics of a candidate busiest group. However, this decision needs to consider the state of SMT siblings of dst_cpu. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c| 11 ++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8f0f778b7c91..663b98959305 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void) #endif extern int arch_asym_cpu_priority(int cpu); +extern bool arch_sched_asym_prefer_early(int a, int b); struct sched_domain_attr { int relax_domain_level; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4ef3fa0d5e8d..e74da853b046 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu) return -cpu; } +/* + * For asym packing, early check if CPUs with higher priority should be + * preferred. On some architectures, more data is needed to make a decision. + */ +bool __weak arch_sched_asym_prefer_early(int a, int b) +{ + return sched_asym_prefer(a, b); +} + /* * The margin used when comparing utilization with CPU capacity. * @@ -8458,7 +8467,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!local_group && env->sd->flags & SD_ASYM_PACKING && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running && - sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { + arch_sched_asym_prefer_early(env->dst_cpu, group->asym_prefer_cpu)) { sgs->group_asym_packing = 1; } -- 2.17.1
[PATCH 1/4] sched/fair: Optimize checking for group_asym_packing
By checking local_group, we can avoid additional checks and invoking sched_asmy_prefer() when it is not needed. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..4ef3fa0d5e8d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8455,7 +8455,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, } /* Check if dst CPU is idle and preferred to this group */ - if (env->sd->flags & SD_ASYM_PACKING && + if (!local_group && env->sd->flags & SD_ASYM_PACKING && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running && sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { -- 2.17.1
[PATCH 3/4] sched/fair: Consider SMT in ASYM_PACKING load balance
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to check for the idle state of the CPU doing the load balancing, but also of its SMT siblings. If the CPU doing the balancing is idle but its SMT siblings are not busy, performance suffers if it pulls tasks from a medium priority CPU that does not have SMT siblings. The decision to label a group for asymmetric packing balancing is done in update_sg_lb_stats(). However, for SMT, that code does not account for idle SMT siblings. Implement asym_can_pull_tasks() to revisit the early decision on whether the CPU doing the balance can pull tasks once the needed information is available. arch_sched_asym_prefer_early() and arch_asym_check_smt_siblings() are used to conserve the legacy behavior. Cc: Aubrey Li Cc: Ben Segall Cc: Daniel Bristot de Oliveira Cc: Dietmar Eggemann Cc: Joel Fernandes (Google) Cc: Mel Gorman Cc: Quentin Perret Cc: Srinivas Pandruvada Cc: Steven Rostedt Cc: Tim Chen Reviewed-by: Len Brown Signed-off-by: Ricardo Neri --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c| 122 + 2 files changed, 123 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 663b98959305..6487953b24e8 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -58,6 +58,7 @@ static inline int cpu_numa_flags(void) extern int arch_asym_cpu_priority(int cpu); extern bool arch_sched_asym_prefer_early(int a, int b); +extern bool arch_asym_check_smt_siblings(void); struct sched_domain_attr { int relax_domain_level; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e74da853b046..14b18dfea5b1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -115,6 +115,14 @@ bool __weak arch_sched_asym_prefer_early(int a, int b) return sched_asym_prefer(a, b); } +/* + * For asym packing, first check the state of SMT siblings before deciding to + * pull tasks. + */ +bool __weak arch_asym_check_smt_siblings(void) +{ + return false; +} /* * The margin used when comparing utilization with CPU capacity. * @@ -8483,6 +8491,110 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_capacity; } +static bool cpu_group_is_smt(int cpu, struct sched_group *sg) +{ +#ifdef CONFIG_SCHED_SMT + if (!static_branch_likely(_smt_present)) + return false; + + if (sg->group_weight == 1) + return false; + + if (cpumask_weight(cpu_smt_mask(cpu)) == 1) + return false; + + return cpumask_equal(sched_group_span(sg), cpu_smt_mask(cpu)); +#else + return false; +#endif +} + +/** + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks + * @dst_cpu: CPU doing the load balancing + * @sds: Load-balancing data with statistics of the local group + * @sgs: Load-balancing statistics of the candidate busiest group + * @sg:The candidate busiet group + * + * Check the state of the SMT siblings of both @sds::local and @sg and decide + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority. + * + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference + * between the number of busy CPUs is 2 or more. If the difference is of 1, + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg + * has lower priority. + */ +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds, + struct sg_lb_stats *sgs, struct sched_group *sg) +{ +#ifdef CONFIG_SCHED_SMT + int cpu, local_busy_cpus, sg_busy_cpus; + bool local_is_smt, sg_is_smt; + + if (!arch_asym_check_smt_siblings()) + return true; + + cpu = group_first_cpu(sg); + local_is_smt = cpu_group_is_smt(dst_cpu, sds->local); + sg_is_smt = cpu_group_is_smt(cpu, sg); + + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus; + + if (!local_is_smt) { + /* +* If we are here, @dst_cpu is idle and does not have SMT +* siblings. Pull tasks if candidate group has two or more +* busy CPUs. +*/ + if (sg_is_smt && sg_busy_cpus >= 2) + return true; + + /* +* @dst_cpu does not have SMT siblings. @sg may have SMT +* siblings and only one is busy. In such case, @dst_cpu +* can help if it has higher priority and is idle. +*/ + return !
[PATCH 0/4] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING
ead-1 1.00 ( 0.13) -0.24 ( 0.20) UDP_RR thread-3 1.00 ( 0.32) +3.82 ( 0.46) UDP_RR thread-7 1.00 ( 8.85) +0.43 ( 13.60) UDP_RR thread-141.00 ( 20.71) +0.58 ( 21.85) tbench == caseloadbaseline(std%) compare%( std%) loopbackthread-1 1.00 ( 0.10) +0.06 ( 0.27) loopbackthread-3 1.00 ( 0.27) +3.87 ( 0.11) loopbackthread-7 1.00 ( 0.28) -0.27 ( 1.29) loopbackthread-141.00 ( 0.07) +0.52 ( 0.24) schbench caseloadbaseline(std%) compare%( std%) normal mthread-11.00 ( 2.33) +32.38 ( 1.99) normal mthread-21.00 ( 0.00) +0.00 ( 0.00) normal mthread-41.00 ( 0.00) +0.00 ( 0.00) normal mthread-81.00 ( 3.01) -2.16 ( 3.04) Table 2. Test results of patches with HWP === hackbench = caseloadbaseline(std%) compare%( std%) process-pipegroup-1 1.00 ( 4.73) -3.23 ( 2.62) process-pipegroup-2 1.00 ( 2.32) +0.13 ( 0.36) process-pipegroup-4 1.00 ( 3.85) +1.95 ( 3.92) process-pipegroup-8 1.00 ( 14.31) +2.12 ( 13.44) process-sockets group-1 1.00 ( 3.10) -1.22 ( 1.95) process-sockets group-2 1.00 ( 0.91) -1.30 ( 1.20) process-sockets group-4 1.00 ( 0.87) -1.29 ( 1.35) process-sockets group-8 1.00 ( 4.98) +1.21 ( 4.57) threads-pipegroup-1 1.00 ( 0.76) +2.87 ( 0.89) threads-pipegroup-2 1.00 ( 1.41) -3.19 ( 1.18) threads-pipegroup-4 1.00 ( 2.37) -1.22 ( 1.41) threads-pipegroup-8 1.00 ( 15.12) +6.64 ( 5.82) threads-sockets group-1 1.00 ( 2.73) +0.89 ( 0.92) threads-sockets group-2 1.00 ( 1.59) -4.32 ( 1.23) threads-sockets group-4 1.00 ( 0.35) -1.42 ( 1.13) threads-sockets group-8 1.00 ( 0.77) +0.23 ( 1.35) netperf === caseloadbaseline(std%) compare%( std%) TCP_RR thread-1 1.00 ( 0.27) +0.05 ( 0.41) TCP_RR thread-3 1.00 ( 4.71) +8.43 ( 0.45) TCP_RR thread-7 1.00 ( 10.15) -1.16 ( 16.95) TCP_RR thread-141.00 ( 23.46) +0.21 ( 23.69) UDP_RR thread-1 1.00 ( 0.16) +1.52 ( 0.81) UDP_RR thread-3 1.00 ( 4.00) +8.33 ( 0.50) UDP_RR thread-7 1.00 ( 7.39) -0.99 ( 11.34) UDP_RR thread-141.00 ( 23.74) -0.51 ( 21.87) tbench == caseloadbaseline(std%) compare%( std%) loopbackthread-1 1.00 ( 0.51) -0.86 ( 0.08) loopbackthread-3 1.00 ( 1.40) +52.18 ( 1.13) loopbackthread-7 1.00 ( 0.94) +0.31 ( 0.20) loopbackthread-141.00 ( 0.21) -0.24 ( 0.47) schbench caseloadbaseline(std%) compare%( std%) normal mthread-11.00 ( 20.14) +61.19 ( 5.44) normal mthread-21.00 ( 0.00) +0.00 ( 0.00) normal mthread-41.00 ( 0.00) +0.00 ( 0.00) normal mthread-81.00 ( 3.05) -2.16 ( 2.99) Ricardo Neri (4): sched/fair: Optimize checking for group_asym_packing sched/fair: Introduce arch_sched_asym_prefer_early() sched/fair: Consider SMT in ASYM_PACKING load balance x86/sched: Enable checks of the state of SMT siblings in load balancing arch/x86/kernel/itmt.c | 15 include/linux/sched/topology.h | 2 + kernel/sched/fair.c| 135 - 3 files changed, 150 insertions(+), 2 deletions(-) -- 2.17.1
Re: [PATCH V2 1/25] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit
On Wed, Mar 10, 2021 at 09:01:47PM +0100, Borislav Petkov wrote: > On Wed, Mar 10, 2021 at 11:46:44AM -0800, Ricardo Neri wrote: > > But this series provides the use case, right? Kan's patches handle PMU > > counters > > that may differ cross types of CPUs. In patch 2, get_hybrid_params() > > needs to check first if X86_FEATURE_HYBRID_CPU is enabled before > > querying the hybrid parameters. Otherwise, we would need to rely on the > > maximum level of CPUID, which may not be reliable. > > On Wed, Mar 10, 2021 at 11:33:54AM -0800, Srinivas Pandruvada wrote: > > We are working on changes to P-State driver for hybrid CPUs using this > > define. They are still work in progress. > > But this patch can be submitted later with our set of changes. > > Answering to both with a single mail: > > I don't have a problem with X86_FEATURE_HYBRID_CPU - I simply don't want > to show "hybrid_cpu" in /proc/cpuinfo unless there's a valid use case > for userspace to know that it is running on a hybrid CPU. Ah, I get your point now. You would like to see #define X86_FEATURE_HYBRID_CPU (18*32+15) /* "" This part has CPUs of more than one type */ Right? Now your first comment makes sense. Srinivas, Kan, I don't think we need to expose "hybrid_cpu" in /proc/cpuinfo, do we? Thanks and BR, Ricardo
Re: [PATCH V2 1/25] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit
On Wed, Mar 10, 2021 at 05:53:58PM +0100, Borislav Petkov wrote: > On Wed, Mar 10, 2021 at 08:37:37AM -0800, kan.li...@linux.intel.com wrote: > > From: Ricardo Neri > > > > Add feature enumeration to identify a processor with Intel Hybrid > > Technology: one in which CPUs of more than one type are the same package. > > On a hybrid processor, all CPUs support the same homogeneous (i.e., > > symmetric) instruction set. All CPUs enumerate the same features in CPUID. > > Thus, software (user space and kernel) can run and migrate to any CPU in > > the system as well as utilize any of the enumerated features without any > > change or special provisions. The main difference among CPUs in a hybrid > > processor are power and performance properties. > > > > Cc: Andi Kleen > > Cc: Kan Liang > > Cc: "Peter Zijlstra (Intel)" > > Cc: "Rafael J. Wysocki" > > Cc: "Ravi V. Shankar" > > Cc: Srinivas Pandruvada > > Cc: linux-kernel@vger.kernel.org > > Reviewed-by: Len Brown > > Reviewed-by: Tony Luck > > Signed-off-by: Ricardo Neri > > --- > > Changes since v1 (as part of patchset for perf change for Alderlake) > > * None > > > > Changes since v1 (in a separate posting): > > * Reworded commit message to clearly state what is Intel Hybrid > >Technology. Stress that all CPUs can run the same instruction > >set and support the same features. > > --- > > arch/x86/include/asm/cpufeatures.h | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/arch/x86/include/asm/cpufeatures.h > > b/arch/x86/include/asm/cpufeatures.h > > index cc96e26d69f7..e7cfc9eedf8d 100644 > > --- a/arch/x86/include/asm/cpufeatures.h > > +++ b/arch/x86/include/asm/cpufeatures.h > > @@ -374,6 +374,7 @@ > > #define X86_FEATURE_MD_CLEAR (18*32+10) /* VERW clears CPU > > buffers */ > > #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" > > TSX_FORCE_ABORT */ > > #define X86_FEATURE_SERIALIZE (18*32+14) /* SERIALIZE > > instruction */ > > +#define X86_FEATURE_HYBRID_CPU (18*32+15) /* This part has > > CPUs of more than one type */ > > /* "" This ... > > unless you have a valid use case for "hybrid_cpu" being present there. But this series provides the use case, right? Kan's patches handle PMU counters that may differ cross types of CPUs. In patch 2, get_hybrid_params() needs to check first if X86_FEATURE_HYBRID_CPU is enabled before querying the hybrid parameters. Otherwise, we would need to rely on the maximum level of CPUID, which may not be reliable. Thanks and BR, Ricardo
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
On Wed, Oct 07, 2020 at 07:15:46AM +0200, Greg Kroah-Hartman wrote: > On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote: > > On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote: > > > On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote: > > > > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote: > > > > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote: > > > > > > Hybrid CPU topologies combine CPUs of different microarchitectures > > > > > > in the > > > > > > same die. Thus, even though the instruction set is compatible among > > > > > > all > > > > > > CPUs, there may still be differences in features (e.g., some CPUs > > > > > > may > > > > > > have counters that others CPU do not). There may be applications > > > > > > interested in knowing the type of micro-architecture topology of the > > > > > > system to make decisions about process affinity. > > > > > > > > > > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/ > > > > > > cpu_capacity) may be used to infer the types of micro-architecture > > > > > > of the > > > > > > CPUs in the platform, it may not be entirely accurate. For > > > > > > instance, two > > > > > > subsets of CPUs with different types of micro-architecture may have > > > > > > the > > > > > > same capacity due to power or thermal constraints. > > > > > > > > > > > > Create the new directory /sys/devices/system/cpu/types. Under such > > > > > > directory, create individual subdirectories for each type of CPU > > > > > > micro- > > > > > > architecture. Each subdirectory will have cpulist and cpumap files. > > > > > > This > > > > > > makes it convenient for user space to read all the CPUs of the same > > > > > > type > > > > > > at once without having to inspect each CPU individually. > > > > > > > > > > > > Implement a generic interface using weak functions that > > > > > > architectures can > > > > > > override to indicate a) support for CPU types, b) the CPU type > > > > > > number, and > > > > > > c) a string to identify the CPU vendor and type. > > > > > > > > > > > > For example, an x86 system with one Intel Core and four Intel Atom > > > > > > CPUs > > > > > > would look like this (other architectures have the hooks to use > > > > > > whatever > > > > > > directory naming convention below "types" that meets their needs): > > > > > > > > > > > > user@host:~$: ls /sys/devices/system/cpu/types > > > > > > intel_atom_0 intel_core_0 > > > > > > > > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0 > > > > > > cpulist cpumap > > > > > > > > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0 > > > > > > cpulist cpumap > > > > > > > > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap > > > > > > 0f > > > > > > > > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist > > > > > > 0-3 > > > > > > > > > > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap > > > > > > 10 > > > > > > > > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist > > > > > > 4 > > > > > > > > Thank you for the quick and detailed Greg! > > > > > > > > > > > > > > The output of 'tree' sometimes makes it easier to see here, or: > > > > > grep -R . * > > > > > also works well. > > > > > > > > Indeed, this would definitely make it more readable. > > > > > > > > > > > > > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory > > > > > > is not > > > > > > created. Add a hook for this purpose. > > > > > > > > > > Why
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote: > On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote: > > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote: > > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote: > > > > Hybrid CPU topologies combine CPUs of different microarchitectures in > > > > the > > > > same die. Thus, even though the instruction set is compatible among all > > > > CPUs, there may still be differences in features (e.g., some CPUs may > > > > have counters that others CPU do not). There may be applications > > > > interested in knowing the type of micro-architecture topology of the > > > > system to make decisions about process affinity. > > > > > > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/ > > > > cpu_capacity) may be used to infer the types of micro-architecture of > > > > the > > > > CPUs in the platform, it may not be entirely accurate. For instance, two > > > > subsets of CPUs with different types of micro-architecture may have the > > > > same capacity due to power or thermal constraints. > > > > > > > > Create the new directory /sys/devices/system/cpu/types. Under such > > > > directory, create individual subdirectories for each type of CPU micro- > > > > architecture. Each subdirectory will have cpulist and cpumap files. This > > > > makes it convenient for user space to read all the CPUs of the same type > > > > at once without having to inspect each CPU individually. > > > > > > > > Implement a generic interface using weak functions that architectures > > > > can > > > > override to indicate a) support for CPU types, b) the CPU type number, > > > > and > > > > c) a string to identify the CPU vendor and type. > > > > > > > > For example, an x86 system with one Intel Core and four Intel Atom CPUs > > > > would look like this (other architectures have the hooks to use whatever > > > > directory naming convention below "types" that meets their needs): > > > > > > > > user@host:~$: ls /sys/devices/system/cpu/types > > > > intel_atom_0 intel_core_0 > > > > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0 > > > > cpulist cpumap > > > > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0 > > > > cpulist cpumap > > > > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap > > > > 0f > > > > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist > > > > 0-3 > > > > > > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap > > > > 10 > > > > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist > > > > 4 > > > > Thank you for the quick and detailed Greg! > > > > > > > > The output of 'tree' sometimes makes it easier to see here, or: > > > grep -R . * > > > also works well. > > > > Indeed, this would definitely make it more readable. > > > > > > > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is > > > > not > > > > created. Add a hook for this purpose. > > > > > > Why should these not show up if the system is not "hybrid"? > > > > My thinking was that on a non-hybrid system, it does not make sense to > > create this interface, as all the CPUs will be of the same type. > > Why not just have this an attribute type in the existing cpuX directory? > Why do this have to be a totally separate directory and userspace has to > figure out to look in two different spots for the same cpu to determine > what it is? But if the type is located under cpuX, usespace would need to traverse all the CPUs and create its own cpu masks. Under the types directory it would only need to look once for each type of CPU, IMHO. > > That feels wasteful, it should be much simpler to use the existing > object, right? > > That way, you also show the "type" of all cpus, no matter if they are > "hybrid" or not, again, making userspace deal with things in a much > simpler manner. Indeed, that would be simpler to implement, and perhaps a natural extension of the existing interface. Lastly, legacy and non-hybrid p
Re: [PATCH 0/4] drivers core: Introduce CPU type sysfs interface
On Tue, Oct 06, 2020 at 09:51:53AM +0100, Qais Yousef wrote: > Hi Ricardo Hi Qais, Thanks for chiming in. > > Adding some people who might be interested. > > On 10/02/20 18:17, Ricardo Neri wrote: > > Hybrid CPU topologies combine processors with more than one type of > > micro-architecture. Hence, it is possible that individual CPUs support > > slightly different features (e.g., performance counters) and different > > performance properties. Thus, there may be user space entities interested > > in knowing the topology of the system based on the types of available > > CPUs. > > > > Currently, there exists an interface for the CPU capacity (/sys/devices/ > > system/cpu/cpuX/cpu_capacity). However, CPU capacity does not always map > > to CPU types (by the way, I will submit a separate series to bring such > > interface to x86). > > Why do you need to do this mapping? > > > > > This series proposes the new interface /sys/devices/system/cpu/types > > which, in hybrid parts, creates a subdirectory for each type of CPU. > > Each subdirectory contains a CPU list and a CPU map that user space can > > query. > > Why user space needs to query this info? > > The rationale is missing the intention behind all of this. It seems you're > expecting software to parse this info and take decisions based on that? I propose this interface to be consumed for application or libraries wanting to know the topology of the system. Perhaps, a utility program that wants to depict CPUs by type. Another example is a policy manager wanting to create a cgroup cpuset of CPUs of a given type for performance reasons. Similarly, a program may want to affinitize itself to a type of CPU, if for instance certain performance events are only counted on a given CPU type. Also, an interface with cpumasks makes it convenient for userspace to not have to traverse each CPU individually. a) add the type of each CPU in /proc/cpuinfo b) add the type of each CPU in /sys/devices/system/cpu/cpu#/type c) for performance, derive the CPU type from /sys/devices/system/cpu/cpu#/cpu_capacity d) add an interface similar to what I propose in this series. None of this imply that certain instructions will be able to run on one type of CPU but not on another. Instead, better performance may be obtained on one type CPU vs another. Thanks and BR, Ricardo
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
On Fri, Oct 02, 2020 at 08:27:41PM -0700, Randy Dunlap wrote: > On 10/2/20 6:17 PM, Ricardo Neri wrote: > > +/** > > + * arch_get_cpu_type() - Get the CPU type number > > + * @cpu: Index of the CPU of which the index is needed > > + * > > + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that Thank you for your feedback Randy! > Are you sure that @cpu is non-zero? This is the intent. Maybe it can be reworked to return -EINVAL instead? I gues it is plausible but less likely that an arch defines a type as 0xffea; > > > + * uniquely identifies a type of CPU micro-architecture. All CPUs of the > > same > > + * type have the same type number. Type numbers are defined by each CPU > > + * architecture. > > + */ > > +u32 __weak arch_get_cpu_type(int cpu) > > +{ > > + return 0; > > +} > > arch_get_cpu_type() in patch 4/4 allows @cpu to be 0. It should not return 0 if the system does not have X86_FEATURE_HYBRID_CPU. The currently existing CPU types are all non-zero as per the Intel SDM. Am I missing anything? Thanks and BR, Ricardo
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
On Sat, Oct 03, 2020 at 01:05:48PM +0200, Greg Kroah-Hartman wrote: > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote: > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote: > > > +/** > > > + * arch_get_cpu_type_name() - Get the CPU type name > > > + * @cpu_type:Type of CPU micro-architecture. > > > + * > > > + * Returns a string name associated with the CPU micro-architecture type > > > as > > > + * indicated in @cpu_type. The format shall be _. > > > Returns > > > + * NULL if the CPU type is not known. > > > + */ > > > +const char __weak *arch_get_cpu_type_name(u32 cpu_type) > > > +{ > > > + return NULL; > > > +} > > > > Why is vendor part of this? Shouldn't it just be arch? > > > > I say this as "vendor" is kind of "interesting" when it comes to other > > arches... > > > > Speaking of other arches, we all know that other arches have this > > feature as well, have you worked with any other groups to verify that > > this interface will also work with them? > > Here's one set of patches for ARM64 for much the same type of cpu > design: > https://android-review.googlesource.com/c/kernel/common/+/1437098/3 > Yes, it's not been posted to any kernel lists, but this is public so you > need to work with the ARM developers to come up with an interface that > works for everyone please. Thanks for the pointer, Greg! I will study this proposal and work with the ARM engineers. BR, Ricardo
Re: [PATCH 4/4] x86/cpu/topology: Implement the CPU type sysfs interface
On Sat, Oct 03, 2020 at 10:55:06AM +0200, Greg Kroah-Hartman wrote: > On Fri, Oct 02, 2020 at 06:17:45PM -0700, Ricardo Neri wrote: > > Recent Intel processors combine CPUs with different types of micro- > > architecture in the same package. There may be applications interested in > > knowing the type topology of the system. For instance, it can be used to > > to determine which subsets of CPUs share a common feature. > > > > Implement cpu_type sysfs interfaces for Intel processors. > > > > For example, in a system with four Intel Atom CPUs and one Intel Core CPU, > > these entries look as below. In this example, the native model IDs for > > both types of CPUs are 0: > > > > user@host:~$: ls /sys/devices/system/cpu/types > > intel_atom_0 intel_core_0 > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0 > > cpulist cpumap > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0 > > cpulist cpumap > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpumap > > 0f > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpulist > > 0-3 > > > > user@nost:~$ cat /sys/devices/system/cpu/types/intel_core/cpumap > > 10 > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core/cpulist > > 4 > > You used the same changelog text here as you did in patch 1/4, why? In both changesets, if merged, somebody could conveniently do git show on either commit quickly see the result intent of the changeset. Would it make it better if in patch 1/4 I put an hypothetical generic example? Something like: user@host:~$: ls /sys/devices/system/cpu/types _ Thanks and BR, Ricardo
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote: > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote: > > Hybrid CPU topologies combine CPUs of different microarchitectures in the > > same die. Thus, even though the instruction set is compatible among all > > CPUs, there may still be differences in features (e.g., some CPUs may > > have counters that others CPU do not). There may be applications > > interested in knowing the type of micro-architecture topology of the > > system to make decisions about process affinity. > > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/ > > cpu_capacity) may be used to infer the types of micro-architecture of the > > CPUs in the platform, it may not be entirely accurate. For instance, two > > subsets of CPUs with different types of micro-architecture may have the > > same capacity due to power or thermal constraints. > > > > Create the new directory /sys/devices/system/cpu/types. Under such > > directory, create individual subdirectories for each type of CPU micro- > > architecture. Each subdirectory will have cpulist and cpumap files. This > > makes it convenient for user space to read all the CPUs of the same type > > at once without having to inspect each CPU individually. > > > > Implement a generic interface using weak functions that architectures can > > override to indicate a) support for CPU types, b) the CPU type number, and > > c) a string to identify the CPU vendor and type. > > > > For example, an x86 system with one Intel Core and four Intel Atom CPUs > > would look like this (other architectures have the hooks to use whatever > > directory naming convention below "types" that meets their needs): > > > > user@host:~$: ls /sys/devices/system/cpu/types > > intel_atom_0 intel_core_0 > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0 > > cpulist cpumap > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0 > > cpulist cpumap > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap > > 0f > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist > > 0-3 > > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap > > 10 > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist > > 4 Thank you for the quick and detailed Greg! > > The output of 'tree' sometimes makes it easier to see here, or: > grep -R . * > also works well. Indeed, this would definitely make it more readable. > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not > > created. Add a hook for this purpose. > > Why should these not show up if the system is not "hybrid"? My thinking was that on a non-hybrid system, it does not make sense to create this interface, as all the CPUs will be of the same type. > > > > > Cc: Andi Kleen > > Cc: Dave Hansen > > Cc: "Gautham R. Shenoy" > > Cc: Kan Liang > > Cc: Len Brown > > Cc: "Rafael J. Wysocki" > > Cc: "Ravi V. Shankar" > > Cc: Srinivas Pandruvada > > Cc: linux-kernel@vger.kernel.org > > Reviewed-by: Tony Luck > > Suggested-by: Len Brown # Necessity of the interface > > Suggested-by: Dave Hansen # Details of the interface > > Signed-off-by: Ricardo Neri > > --- > > .../ABI/testing/sysfs-devices-system-cpu | 13 ++ > > drivers/base/cpu.c| 214 ++ > > include/linux/cpu.h | 12 + > > include/linux/cpuhotplug.h| 1 + > > 4 files changed, 240 insertions(+) > > > > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu > > b/Documentation/ABI/testing/sysfs-devices-system-cpu > > index b555df825447..bcb3d7195102 100644 > > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu > > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu > > @@ -614,3 +614,16 @@ Description: SPURR ticks for cpuX when it was idle. > > > > This sysfs interface exposes the number of SPURR ticks > > for cpuX when it was idle. > > + > > +What: /sys/devices/system/cpu/types > > + /sys/devices/system/cpu/types/_/cpumap > > + /sys/devices/system/cpu/types/_/cpulist > > Please split these up into different entries, because: Sure, I can split these into two entries. > > > +Date: Oct 2020 > > +
Re: [PATCH 0/4] drivers core: Introduce CPU type sysfs interface
On Sat, Oct 03, 2020 at 10:49:34AM +0200, Borislav Petkov wrote: > On Fri, Oct 02, 2020 at 06:17:41PM -0700, Ricardo Neri wrote: > > Patch 1 of the series proposes the generic interface, with hooks > > that architectures can override to suit their needs. The three patches > > patches implement such interface for x86 (as per request from Boris, > > I pulled patch 2 from a separate submission [1]). > > So I ask you to show me the whole thing, how this is supposed to be used > in a *real* use case and you're sending me a couple of patches which > report these heterogeneous or whatever they're gonna be called CPUs. > > Are you telling me that all this development effort was done so that > you can report heterogeneity in sysfs? Or you just had to come up with > *something*? > > Let me try again: please show me the *big* *picture* with all the code > how this is supposed to be used. In the patches I read a bunch of "may" > formulations of what is possible and what userspace could do and so on. > > Not that - show me the *full* and *real* use cases which you are > enabling and which justify all that churn. Instead of leaving it all to > userspace CPUID and the kernel not caring one bit. > > Does that make more sense? Yes Boris, thanks for the clarification. The proposed sysfs interface is one instance in which we use cpuinfo_x86.x86_cpu_type. I have other changes that use this new member. I will post them. > > > [1]. https://lkml.org/lkml/2020/10/2/1013 > > For supplying links, we use lore.kernel.org/r/ solely. > Please use that from now on. Sure Boris, I will use lore.kernel.org in the future. Thanks and BR, Ricardo
Re: [PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs
On Sat, Oct 03, 2020 at 11:04:29AM +0200, Borislav Petkov wrote: > On Fri, Oct 02, 2020 at 07:17:30PM -0700, Luck, Tony wrote: > > On Sat, Oct 03, 2020 at 03:39:29AM +0200, Thomas Gleixner wrote: > > > On Fri, Oct 02 2020 at 13:19, Ricardo Neri wrote: > > > > Add support to discover and enumerate CPUs in Intel hybrid parts. A > > > > hybrid > > > > part has CPUs with more than one type of micro-architecture. Thus, > > > > certain > > > > features may only be present in a specific CPU type. > > > > > > > > It is useful to know the type of CPUs present in a system. For instance, > > > > perf may need to handle CPUs differently depending on the type of micro- > > > > architecture. Decoding machine check error logs may need the additional > > > > micro-architecture type information, so include that in the log. > > > > > > 'It is useful' as justification just makes me barf. > > > > This isn't "hetero" ... all of the cores are architecturally the same. > > But it says above "A hybrid part has CPUs with more than one type of > micro-architecture." > > So which is it? Yes, even though they have different micro-architectures, all instructions and features will be the same across CPUs. > > > If CPUID says that some feature is supported, then it will be supported > > on all of the cores. > > Ok. > > > There might be some model specific performance counter events that only > > apply to some cores. > > That sounds like the perf counter scheduling code would have to pay > attention to what is supported. I think we have some functionality for > that due to some AMD parts but I'd prefer if Peter comments here. > > > Or a machine check error code that is logged in the model specific > > MSCOD field of IA32_MCi_STATUS. But any and all code can run on any > > core. > > As long as that is consumed only by userspace I guess that's ok. The > moment someone starts to want to differentiate on what kind of CPU > kernel code runs and acts accordingly, then it becomes ugly so we better > hash it out upfront. We are not planning to implement changes as such. Thanks and BR, Ricardo
Re: [PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs
On Sat, Oct 03, 2020 at 12:46:29PM +0200, Thomas Gleixner wrote: > On Fri, Oct 02 2020 at 19:17, Tony Luck wrote: > > > On Sat, Oct 03, 2020 at 03:39:29AM +0200, Thomas Gleixner wrote: > >> On Fri, Oct 02 2020 at 13:19, Ricardo Neri wrote: > >> > Add support to discover and enumerate CPUs in Intel hybrid parts. A > >> > hybrid > >> > part has CPUs with more than one type of micro-architecture. Thus, > >> > certain > >> > features may only be present in a specific CPU type. > >> > > >> > It is useful to know the type of CPUs present in a system. For instance, > >> > perf may need to handle CPUs differently depending on the type of micro- > >> > architecture. Decoding machine check error logs may need the additional > >> > micro-architecture type information, so include that in the log. > >> > >> 'It is useful' as justification just makes me barf. > > > > This isn't "hetero" ... all of the cores are architecturally the same. > > The above clearly says: > > >> > A hybrid part has CPUs with more than one type of micro-architecture. > > Can you folks talk to each other and chose non-ambigous wording in > changelogs and cover letters? > > > If CPUID says that some feature is supported, then it will be supported > > on all of the cores. > > That's a different story. Thank you for the quick reply, Thomas. I am sorry if my cover letter was not sufficiently clear. I see now that I should have done a more detailed discussion of the terminology. Yes, all features and instructions as enumerated in CPUID will be supported in all CPUs. Thus, the kernel will not have to check if a feature is supported before running. The hetero/hybrid part means to reflect that more than one types of CPUs will be present in the package, and the main difference is power and performance. The same kernel and user code will run on all CPUs without any other further check. > > > There might be some model specific performance counter events that only > > apply to some cores. Or a machine check error code that is logged in the > > model specific MSCOD field of IA32_MCi_STATUS. But any and all code can run > > on any core. > > Ok. The perf side should be doable, IIRC we already have something like > that, but Peter should know better. > > > Sure there will be some different power/performance tradeoffs on some > > cores. But we already have that with some cores able to achieve higher > > turbo frequencies than others. > > Right, that's not a problem. We are also working this front. Thanks and BR, Ricardo
[PATCH 4/4] x86/cpu/topology: Implement the CPU type sysfs interface
Recent Intel processors combine CPUs with different types of micro- architecture in the same package. There may be applications interested in knowing the type topology of the system. For instance, it can be used to to determine which subsets of CPUs share a common feature. Implement cpu_type sysfs interfaces for Intel processors. For example, in a system with four Intel Atom CPUs and one Intel Core CPU, these entries look as below. In this example, the native model IDs for both types of CPUs are 0: user@host:~$: ls /sys/devices/system/cpu/types intel_atom_0 intel_core_0 user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0 cpulist cpumap user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0 cpulist cpumap user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpumap 0f user@host:~$ cat /sys/devices/system/cpu/types/intel_atom/cpulist 0-3 user@nost:~$ cat /sys/devices/system/cpu/types/intel_core/cpumap 10 user@host:~$ cat /sys/devices/system/cpu/types/intel_core/cpulist 4 Cc: Andi Kleen Cc: Dave Hansen Cc: Kan Liang Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Cc: Srinivas Pandruvada Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Suggested-by: Len Brown # Necessity of the interface Suggested-by: Dave Hansen # Details of the interface Signed-off-by: Ricardo Neri --- arch/x86/include/asm/topology.h | 2 ++ arch/x86/kernel/cpu/topology.c | 23 +++ 2 files changed, 25 insertions(+) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index f4234575f3fd..d4a3e1ce8338 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -218,4 +218,6 @@ static inline void arch_set_max_freq_ratio(bool turbo_disabled) } #endif +#define CPUTYPES_MAX_NR 2 + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index d3a0791bc052..709fc473f905 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -153,3 +153,26 @@ int detect_extended_topology(struct cpuinfo_x86 *c) #endif return 0; } + +u32 arch_get_cpu_type(int cpu) +{ + struct cpuinfo_x86 *c = _data(cpu); + + if (cpu < 0 || cpu >= nr_cpu_ids) + return 0; + + return c->x86_cpu_type; +} + +bool arch_has_cpu_type(void) +{ + return boot_cpu_has(X86_FEATURE_HYBRID_CPU); +} + +const char *arch_get_cpu_type_name(u32 cpu_type) +{ + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + return intel_get_hybrid_cpu_type_name(cpu_type); + + return NULL; +} -- 2.17.1
[PATCH 1/4] drivers core: Introduce CPU type sysfs interface
Hybrid CPU topologies combine CPUs of different microarchitectures in the same die. Thus, even though the instruction set is compatible among all CPUs, there may still be differences in features (e.g., some CPUs may have counters that others CPU do not). There may be applications interested in knowing the type of micro-architecture topology of the system to make decisions about process affinity. While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/ cpu_capacity) may be used to infer the types of micro-architecture of the CPUs in the platform, it may not be entirely accurate. For instance, two subsets of CPUs with different types of micro-architecture may have the same capacity due to power or thermal constraints. Create the new directory /sys/devices/system/cpu/types. Under such directory, create individual subdirectories for each type of CPU micro- architecture. Each subdirectory will have cpulist and cpumap files. This makes it convenient for user space to read all the CPUs of the same type at once without having to inspect each CPU individually. Implement a generic interface using weak functions that architectures can override to indicate a) support for CPU types, b) the CPU type number, and c) a string to identify the CPU vendor and type. For example, an x86 system with one Intel Core and four Intel Atom CPUs would look like this (other architectures have the hooks to use whatever directory naming convention below "types" that meets their needs): user@host:~$: ls /sys/devices/system/cpu/types intel_atom_0 intel_core_0 user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0 cpulist cpumap user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0 cpulist cpumap user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap 0f user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist 0-3 user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap 10 user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist 4 On non-hybrid systems, the /sys/devices/system/cpu/types directory is not created. Add a hook for this purpose. Cc: Andi Kleen Cc: Dave Hansen Cc: "Gautham R. Shenoy" Cc: Kan Liang Cc: Len Brown Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Cc: Srinivas Pandruvada Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Suggested-by: Len Brown # Necessity of the interface Suggested-by: Dave Hansen # Details of the interface Signed-off-by: Ricardo Neri --- .../ABI/testing/sysfs-devices-system-cpu | 13 ++ drivers/base/cpu.c| 214 ++ include/linux/cpu.h | 12 + include/linux/cpuhotplug.h| 1 + 4 files changed, 240 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index b555df825447..bcb3d7195102 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -614,3 +614,16 @@ Description: SPURR ticks for cpuX when it was idle. This sysfs interface exposes the number of SPURR ticks for cpuX when it was idle. + +What: /sys/devices/system/cpu/types + /sys/devices/system/cpu/types/_/cpumap + /sys/devices/system/cpu/types/_/cpulist +Date: Oct 2020 +Contact: Linux kernel mailing list +Description: Types of CPUs in hybrid systems + + If a platform has CPUs with more than one type of micro- + architecture, there is one directory for each type of CPU. + Inside each directory, the files cpumap and cpulist contain + a mask and a list representing CPUs of the same type of micro- + architecture. These files only contain CPUs currently online. diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index d2136ab9b14a..7e98b11b15a1 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -607,11 +607,225 @@ static void __init cpu_register_vulnerabilities(void) static inline void cpu_register_vulnerabilities(void) { } #endif +static ssize_t cpulist_show(struct device *device, + struct device_attribute *attr, + char *buf) +{ + struct cpumask *mask = dev_get_drvdata(device); + + if (!mask) + return -EINVAL; + + return cpumap_print_to_pagebuf(true, buf, mask); +} + +static ssize_t cpumap_show(struct device *device, + struct device_attribute *attr, + char *buf) +{ + struct cpumask *mask = dev_get_drvdata(device); + + if (!mask) + return -EINVAL; + + return cpumap_print_to_pagebuf(false, buf, mask); +} + +static DEVICE_ATTR_RO(cpumap); +static DEVICE_ATTR_RO(cpulist); + +static struct attribute *cpu_type_attrs[] = { + _attr
[PATCH 2/4] x86/cpu: Describe hybrid CPUs in cpuinfo_x86
When Linux runs on Intel hybrid parts (i.e., having more than one type of CPU in the same package), subsystems that deal with specific CPU features may need to know the type of CPU in which they run. Instead of having each subsystem to inspect CPUID leaves on its own, add a new member to cpuinfo_x86 that can be queried to know the type of CPU. Also, hybrid parts have a native model ID to uniquely identify the micro-architecture of each CPU. Please note that the native model ID is not related with the existing x86_model_id read from CPUID leaf 0x1. In order to uniquely identify a CPU by type and micro-architecture, combine the aforementioned identifiers into a single new member, x86_cpu_type. The Intel manual (SDM) defines the CPU type and the CPU native model ID as 8-bit and 24-bit identifiers, respectively; they are packed in %eax when read from CPUID. Define also masks that subsystems can use to obtain the CPU type or the native model separately. The native model ID only requires only a bit mask as it uses the 24 least significant bits of %eax. The CPU type identifier requires only a shift value as it uses the 8 most significant bytes of %eax. Cc: Andi Kleen Cc: Andy Lutomirski Cc: Dave Hansen Cc: Kan Liang Cc: Len Brown Cc: "Peter Zijlstra (Intel)" Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: Srinivas Pandruvada Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- I pulled this patch from a separate series[1] as per request from Boris. Changes wrt to such series: * Use cpuid_eax() instead of cpuid_count() to read %eax result from CPUID. (Boris) [1]. https://lkml.org/lkml/2020/10/2/1013 --- arch/x86/include/asm/processor.h | 13 + arch/x86/kernel/cpu/common.c | 3 +++ 2 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index f88c74d7dbd4..d86cdf2b1562 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -141,6 +141,16 @@ struct cpuinfo_x86 { u32 microcode; /* Address space bits used by the cache internally */ u8 x86_cache_bits; + /* +* In hybrid parts, there is a CPU type and a native model ID. The +* CPU type (x86_cpu_type[31:24]) describes the type of micro- +* architecture families. The native model ID (x86_cpu_type[23:0]) +* describes a specific microarchitecture version. Combining both +* allows to uniquely identify a CPU. +* +* Please note that the native model ID is not related to x86_model. +*/ + u32 x86_cpu_type; unsignedinitialized : 1; } __randomize_layout; @@ -168,6 +178,9 @@ enum cpuid_regs_idx { #define X86_VENDOR_UNKNOWN 0xff +#define X86_HYBRID_CPU_TYPE_ID_SHIFT 24 +#define X86_HYBRID_CPU_NATIVE_MODEL_ID_MASK0xff + /* * capabilities of CPUs */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 35ad8480c464..a66c1fdc0e27 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -932,6 +932,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c) c->x86_capability[CPUID_D_1_EAX] = eax; } + if (cpu_has(c, X86_FEATURE_HYBRID_CPU)) + c->x86_cpu_type = cpuid_eax(0x001a); + /* AMD-defined flags: level 0x8001 */ eax = cpuid_eax(0x8000); c->extended_cpuid_level = eax; -- 2.17.1
[PATCH 3/4] x86/cpu/intel: Add function to get name of hybrid CPU types
Provided a human-friendly string name for each type of CPU micro- architecture in Intel hybrid parts. This string is to be used in the CPU type sysfs interface. In order to uniquely identify CPUs of the same type, compose the name string as _. Cc: Andi Kleen Cc: Dave Hansen Cc: Kan Liang Cc: Len Brown Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Cc: Srinivas Pandruvada Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- arch/x86/include/asm/intel-family.h | 4 arch/x86/kernel/cpu/cpu.h | 3 +++ arch/x86/kernel/cpu/intel.c | 23 +++ 3 files changed, 30 insertions(+) diff --git a/arch/x86/include/asm/intel-family.h b/arch/x86/include/asm/intel-family.h index 5e658ba2654a..4ec2272e0049 100644 --- a/arch/x86/include/asm/intel-family.h +++ b/arch/x86/include/asm/intel-family.h @@ -133,4 +133,8 @@ /* Family 5 */ #define INTEL_FAM5_QUARK_X1000 0x09 /* Quark X1000 SoC */ +/* Types of CPUs in hybrid parts. */ +#define INTEL_HYBRID_TYPE_ATOM 0x20 +#define INTEL_HYBRID_TYPE_CORE 0x40 + #endif /* _ASM_X86_INTEL_FAMILY_H */ diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h index 9d033693519a..b4474238e1f3 100644 --- a/arch/x86/kernel/cpu/cpu.h +++ b/arch/x86/kernel/cpu/cpu.h @@ -56,8 +56,11 @@ extern __ro_after_init enum tsx_ctrl_states tsx_ctrl_state; extern void __init tsx_init(void); extern void tsx_enable(void); extern void tsx_disable(void); +extern const char *intel_get_hybrid_cpu_type_name(u32 cpu_type); #else static inline void tsx_init(void) { } +static inline const char *intel_get_hybrid_cpu_type_name(u32 cpu_type) +{ return NULL; } #endif /* CONFIG_CPU_SUP_INTEL */ extern void get_cpu_cap(struct cpuinfo_x86 *c); diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 59a1e3ce3f14..e1dee382cf98 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1191,3 +1191,26 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) cpu_model_supports_sld = true; split_lock_setup(); } + +static char hybrid_name[64]; + +const char *intel_get_hybrid_cpu_type_name(u32 cpu_type) +{ + u32 native_model_id = cpu_type & X86_HYBRID_CPU_NATIVE_MODEL_ID_MASK; + u8 type = cpu_type >> X86_HYBRID_CPU_TYPE_ID_SHIFT; + + switch (type) { + case INTEL_HYBRID_TYPE_ATOM: + snprintf(hybrid_name, sizeof(hybrid_name), "intel_atom_%u", +native_model_id); + break; + case INTEL_HYBRID_TYPE_CORE: + snprintf(hybrid_name, sizeof(hybrid_name), "intel_core_%u", +native_model_id); + break; + default: + return NULL; + } + + return hybrid_name; +} -- 2.17.1
[PATCH 0/4] drivers core: Introduce CPU type sysfs interface
Hybrid CPU topologies combine processors with more than one type of micro-architecture. Hence, it is possible that individual CPUs support slightly different features (e.g., performance counters) and different performance properties. Thus, there may be user space entities interested in knowing the topology of the system based on the types of available CPUs. Currently, there exists an interface for the CPU capacity (/sys/devices/ system/cpu/cpuX/cpu_capacity). However, CPU capacity does not always map to CPU types (by the way, I will submit a separate series to bring such interface to x86). This series proposes the new interface /sys/devices/system/cpu/types which, in hybrid parts, creates a subdirectory for each type of CPU. Each subdirectory contains a CPU list and a CPU map that user space can query. Patch 1 of the series proposes the generic interface, with hooks that architectures can override to suit their needs. The three patches patches implement such interface for x86 (as per request from Boris, I pulled patch 2 from a separate submission [1]). Thanks and BR, Ricardo [1]. https://lkml.org/lkml/2020/10/2/1013 Ricardo Neri (4): drivers core: Introduce CPU type sysfs interface x86/cpu: Describe hybrid CPUs in cpuinfo_x86 x86/cpu/intel: Add function to get name of hybrid CPU types x86/cpu/topology: Implement the CPU type sysfs interface .../ABI/testing/sysfs-devices-system-cpu | 13 ++ arch/x86/include/asm/intel-family.h | 4 + arch/x86/include/asm/processor.h | 13 ++ arch/x86/include/asm/topology.h | 2 + arch/x86/kernel/cpu/common.c | 3 + arch/x86/kernel/cpu/cpu.h | 3 + arch/x86/kernel/cpu/intel.c | 23 ++ arch/x86/kernel/cpu/topology.c| 23 ++ drivers/base/cpu.c| 214 ++ include/linux/cpu.h | 12 + include/linux/cpuhotplug.h| 1 + 11 files changed, 311 insertions(+) -- 2.17.1
Re: [PATCH 2/3] x86/cpu: Describe hybrid CPUs in cpuinfo_x86
On Fri, Oct 02, 2020 at 11:03:06PM +0200, Borislav Petkov wrote: > On Fri, Oct 02, 2020 at 02:02:31PM -0700, Ricardo Neri wrote: > > What about patches 1 and 3? Should I resubmit the series with only > > those? > > Why would you need to resubmit? They're good to go as is, AFAICT. Thanks for clarifying Boris. Just wanted to check if there was any action required from me regarding patches 1 & 3. Thanks and BR, Ricardo
Re: [PATCH 2/3] x86/cpu: Describe hybrid CPUs in cpuinfo_x86
On Fri, Oct 02, 2020 at 10:34:52PM +0200, Borislav Petkov wrote: > On Fri, Oct 02, 2020 at 01:19:30PM -0700, Ricardo Neri wrote: > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > > index 35ad8480c464..0778b3ad26b3 100644 > > --- a/arch/x86/kernel/cpu/common.c > > +++ b/arch/x86/kernel/cpu/common.c > > @@ -932,6 +932,11 @@ void get_cpu_cap(struct cpuinfo_x86 *c) > > c->x86_capability[CPUID_D_1_EAX] = eax; > > } > > > > + if (cpu_has(c, X86_FEATURE_HYBRID_CPU)) { > > + cpuid_count(0x001a, 0, , , , ); > > + c->x86_cpu_type = eax; > > c->x86_cpu_type = cpuid_eax(0x001a); > > should do too. Thank you for the quick feedback Boris! Sure I can implement it as you suggest. > > But send this patch together with the code that uses it. Sure I can submit it along with the code using it. What about patches 1 and 3? Should I resubmit the series with only those? Thanks and BR, Ricardo
[PATCH 2/3] x86/cpu: Describe hybrid CPUs in cpuinfo_x86
When Linux runs on Intel hybrid parts (i.e., having more than one type of CPU in the same package), subsystems that deal with specific CPU features may need to know the type of CPU in which they run. Instead of having each subsystem to inspect CPUID leaves on its own, add a new member to cpuinfo_x86 that can be queried to know the type of CPU. Also, hybrid parts have a native model ID to uniquely identify the micro-architecture of each CPU. Please note that the native model ID is not related with the existing x86_model_id read from CPUID leaf 0x1. In order to uniquely identify a CPU by type and micro-architecture, combine the aforementioned identifiers into a single new member, x86_cpu_type. The Intel manual (SDM) defines the CPU type and the CPU native model ID as 8-bit and 24-bit identifiers, respectively; they are packed in %eax when read from CPUID. Define also masks that subsystems can use to obtain the CPU type or the native model separately. The native model ID only requires only a bit mask as it uses the 24 least significant bits of %eax. The CPU type identifier requires only a shift value as it uses the 8 most significant bytes of %eax. Cc: Andi Kleen Cc: Andy Lutomirski Cc: Dave Hansen Cc: Kan Liang Cc: Len Brown Cc: "Peter Zijlstra (Intel)" Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: Srinivas Pandruvada Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- arch/x86/include/asm/processor.h | 13 + arch/x86/kernel/cpu/common.c | 5 + 2 files changed, 18 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index f88c74d7dbd4..d86cdf2b1562 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -141,6 +141,16 @@ struct cpuinfo_x86 { u32 microcode; /* Address space bits used by the cache internally */ u8 x86_cache_bits; + /* +* In hybrid parts, there is a CPU type and a native model ID. The +* CPU type (x86_cpu_type[31:24]) describes the type of micro- +* architecture families. The native model ID (x86_cpu_type[23:0]) +* describes a specific microarchitecture version. Combining both +* allows to uniquely identify a CPU. +* +* Please note that the native model ID is not related to x86_model. +*/ + u32 x86_cpu_type; unsignedinitialized : 1; } __randomize_layout; @@ -168,6 +178,9 @@ enum cpuid_regs_idx { #define X86_VENDOR_UNKNOWN 0xff +#define X86_HYBRID_CPU_TYPE_ID_SHIFT 24 +#define X86_HYBRID_CPU_NATIVE_MODEL_ID_MASK0xff + /* * capabilities of CPUs */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 35ad8480c464..0778b3ad26b3 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -932,6 +932,11 @@ void get_cpu_cap(struct cpuinfo_x86 *c) c->x86_capability[CPUID_D_1_EAX] = eax; } + if (cpu_has(c, X86_FEATURE_HYBRID_CPU)) { + cpuid_count(0x001a, 0, , , , ); + c->x86_cpu_type = eax; + } + /* AMD-defined flags: level 0x8001 */ eax = cpuid_eax(0x8000); c->extended_cpuid_level = eax; -- 2.17.1
[PATCH 3/3] x86/mce: include type of core when reporting a machine check error
In hybrid parts, each type of core reports different types of machine check errors as the machine check error blocks are tied to different parts of the hardware. Furthermore, errors may be different across micro-architecture versions. Thus, in order to decode errors, userspace tools need to know the type of core as well as the native model ID of the CPU which reported the error. This extra information is only included in the error report only when running on hybrid parts. This conserves the existing behavior when running on non-hybrid parts. Hence, legacy userspace tools running on new kernels and hybrid hardware can still understand the format of the reported error format. Cc: "Ravi V Shankar" Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- arch/x86/include/uapi/asm/mce.h | 1 + arch/x86/kernel/cpu/mce/core.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/arch/x86/include/uapi/asm/mce.h b/arch/x86/include/uapi/asm/mce.h index db9adc081c5a..e730572186d6 100644 --- a/arch/x86/include/uapi/asm/mce.h +++ b/arch/x86/include/uapi/asm/mce.h @@ -36,6 +36,7 @@ struct mce { __u64 ppin; /* Protected Processor Inventory Number */ __u32 microcode;/* Microcode revision */ __u64 kflags; /* Internal kernel use */ + __u32 hybrid_info; /* Type and native model ID in hybrid parts */ }; #define MCE_GET_RECORD_LEN _IOR('M', 1, int) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index a6ff407dec71..ecac8d9b6070 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -143,6 +143,9 @@ noinstr void mce_setup(struct mce *m) m->apicid = cpu_data(m->extcpu).initial_apicid; m->mcgcap = __rdmsr(MSR_IA32_MCG_CAP); + if (this_cpu_has(X86_FEATURE_HYBRID_CPU)) + m->hybrid_info = cpuid_eax(0x1a); + if (this_cpu_has(X86_FEATURE_INTEL_PPIN)) m->ppin = __rdmsr(MSR_PPIN); else if (this_cpu_has(X86_FEATURE_AMD_PPIN)) @@ -264,6 +267,10 @@ static void __print_mce(struct mce *m) pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x microcode %x\n", m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid, m->microcode); + + if (this_cpu_has(X86_FEATURE_HYBRID_CPU)) + pr_emerg(HW_ERR "HYBRID_TYPE %x HYBRID_NATIVE_MODEL_ID %x\n", +m->hybrid_info >> 24, m->hybrid_info & 0xff); } static void print_mce(struct mce *m) -- 2.17.1
[PATCH 1/3] x86/cpufeatures: Enumerate hybrid CPU feature bit
Add feature enumeration to identify a hybrid part: one in which CPUs with more than one type of micro-architecture exists in the same package. Cc: Andi Kleen Cc: Kan Liang Cc: Len Brown Cc: "Peter Zijlstra (Intel)" Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Cc: Srinivas Pandruvada Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index dad350d42ecf..26ecc0f2a6fd 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -371,6 +371,7 @@ #define X86_FEATURE_MD_CLEAR (18*32+10) /* VERW clears CPU buffers */ #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */ #define X86_FEATURE_SERIALIZE (18*32+14) /* SERIALIZE instruction */ +#define X86_FEATURE_HYBRID_CPU (18*32+15) /* This part has CPUs of more than one type */ #define X86_FEATURE_TSXLDTRK (18*32+16) /* TSX Suspend Load Address Tracking */ #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */ #define X86_FEATURE_ARCH_LBR (18*32+19) /* Intel ARCH LBR */ -- 2.17.1
[PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs
Add support to discover and enumerate CPUs in Intel hybrid parts. A hybrid part has CPUs with more than one type of micro-architecture. Thus, certain features may only be present in a specific CPU type. It is useful to know the type of CPUs present in a system. For instance, perf may need to handle CPUs differently depending on the type of micro- architecture. Decoding machine check error logs may need the additional micro-architecture type information, so include that in the log. A hybrid part can be identified by reading a new CPUID feature bit. Likewise, CPUID contains information about the CPU type as well as a new native model ID. Details can be found in the Intel manual (SDM, [1]). This series adds support for Intel hybrid parts in two areas: a) adding the hybrid feature bit as well as struct cpuinfo_x86; and b) decode machine check errors on hybrid parts. A later submission will use the proposed functionality to expose the CPU topology to user space. Thanks and BR, Ricardo [1]. https://software.intel.com/content/dam/develop/public/us/en/documents/325462-sdm-vol-1-2abcd-3abcd.pdf Vol 2. Section 3.2.CPUID leaf 0x1a Ricardo Neri (3): x86/cpufeatures: Enumerate hybrid CPU feature bit x86/cpu: Describe hybrid CPUs in cpuinfo_x86 x86/mce: include type of core when reporting a machine check error arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/processor.h | 13 + arch/x86/include/uapi/asm/mce.h| 1 + arch/x86/kernel/cpu/common.c | 5 + arch/x86/kernel/cpu/mce/core.c | 7 +++ 5 files changed, 27 insertions(+) -- 2.17.1
[PATCH] powercap/rapl: Add support for Lakefield
Simply add Lakefield model ID. No additional changes are needed. Cc: Zhang Rui Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Signed-off-by: Ricardo Neri --- drivers/powercap/intel_rapl_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index 6f55aaef8afc..d85a3d95ef20 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -1036,6 +1036,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = { X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, _defaults_core), X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, _defaults_core), X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, _defaults_spr_server), + X86_MATCH_INTEL_FAM6_MODEL(LAKEFIELD, _defaults_core), X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT, _defaults_byt), X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT,_defaults_cht), -- 2.17.1
Re: [PATCH] x86/cpu: Fix typos and improve the comments in sync_core()
On Tue, Aug 18, 2020 at 07:31:30AM +0200, Ingo Molnar wrote: > > * tip-bot2 for Ricardo Neri wrote: > > > --- a/arch/x86/include/asm/sync_core.h > > +++ b/arch/x86/include/asm/sync_core.h > > @@ -5,6 +5,7 @@ > > #include > > #include > > #include > > +#include > > > > #ifdef CONFIG_X86_32 > > static inline void iret_to_self(void) > > @@ -54,14 +55,23 @@ static inline void iret_to_self(void) > > static inline void sync_core(void) > > { > > /* > > +* The SERIALIZE instruction is the most straightforward way to > > +* do this but it not universally available. > > +*/ > > + if (static_cpu_has(X86_FEATURE_SERIALIZE)) { > > + serialize(); > > + return; > > + } > > + > > + /* > > +* For all other processors, there are quite a few ways to do this. > > +* IRET-to-self is nice because it works on every CPU, at any CPL > > +* (so it's compatible with paravirtualization), and it never exits > > +* to a hypervisor. The only down sides are that it's a bit slow > > +* (it seems to be a bit more than 2x slower than the fastest > > +* options) and that it unmasks NMIs. The "push %cs" is needed > > +* because, in paravirtual environments, __KERNEL_CS may not be a > > +* valid CS value when we do IRET directly. > > So there's two typos in the new comments, there are at least two > misapplied commas, it departs from existing style, and there's a typo > in the existing comments as well. > > Also, before this patch the 'compiler barrier' comment was valid for > the whole function (there was no branching), but after this patch it > reads of it was only valid for the legacy IRET-to-self branch. > > Which together broke my detector and triggered a bit of compulsive > bike-shed painting. ;-) See the resulting patch below. > > Thanks, > > Ingo > > > > From: Ingo Molnar > Date: Tue, 18 Aug 2020 07:24:05 +0200 > Subject: [PATCH] x86/cpu: Fix typos and improve the comments in sync_core() > > - Fix typos. > > - Move the compiler barrier comment to the top, because it's valid for the > whole function, not just the legacy branch. > > Signed-off-by: Ingo Molnar > --- > arch/x86/include/asm/sync_core.h | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/include/asm/sync_core.h > b/arch/x86/include/asm/sync_core.h > index 4631c0f969d4..0fd4a9dfb29c 100644 > --- a/arch/x86/include/asm/sync_core.h > +++ b/arch/x86/include/asm/sync_core.h > @@ -47,16 +47,19 @@ static inline void iret_to_self(void) > * > * b) Text was modified on a different CPU, may subsequently be > * executed on this CPU, and you want to make sure the new version > - * gets executed. This generally means you're calling this in a IPI. > + * gets executed. This generally means you're calling this in an IPI. > * > * If you're calling this for a different reason, you're probably doing > * it wrong. > + * > + * Like all of Linux's memory ordering operations, this is a > + * compiler barrier as well. > */ > static inline void sync_core(void) > { > /* >* The SERIALIZE instruction is the most straightforward way to > - * do this but it not universally available. > + * do this, but it is not universally available. Indeed, I missed this grammar error. >*/ > if (static_cpu_has(X86_FEATURE_SERIALIZE)) { > serialize(); > @@ -67,10 +70,10 @@ static inline void sync_core(void) >* For all other processors, there are quite a few ways to do this. >* IRET-to-self is nice because it works on every CPU, at any CPL >* (so it's compatible with paravirtualization), and it never exits > - * to a hypervisor. The only down sides are that it's a bit slow > + * to a hypervisor. The only downsides are that it's a bit slow >* (it seems to be a bit more than 2x slower than the fastest > - * options) and that it unmasks NMIs. The "push %cs" is needed > - * because, in paravirtual environments, __KERNEL_CS may not be a > + * options) and that it unmasks NMIs. The "push %cs" is needed, > + * because in paravirtual environments __KERNEL_CS may not be a I didn't realize that the double spaces after the period were part of the style. FWIW, Reviewed-by: Ricardo Neri
[tip: x86/cpu] x86/cpu: Use SERIALIZE in sync_core() when available
The following commit has been merged into the x86/cpu branch of tip: Commit-ID: bf9c912f9a649776c2d741310486a6984edaac72 Gitweb: https://git.kernel.org/tip/bf9c912f9a649776c2d741310486a6984edaac72 Author:Ricardo Neri AuthorDate:Thu, 06 Aug 2020 20:28:33 -07:00 Committer: Borislav Petkov CommitterDate: Mon, 17 Aug 2020 17:23:04 +02:00 x86/cpu: Use SERIALIZE in sync_core() when available The SERIALIZE instruction gives software a way to force the processor to complete all modifications to flags, registers and memory from previous instructions and drain all buffered writes to memory before the next instruction is fetched and executed. Thus, it serves the purpose of sync_core(). Use it when available. Suggested-by: Andy Lutomirski Signed-off-by: Ricardo Neri Signed-off-by: Borislav Petkov Reviewed-by: Tony Luck Link: https://lkml.kernel.org/r/20200807032833.17484-1-ricardo.neri-calde...@linux.intel.com --- arch/x86/include/asm/special_insns.h | 6 ++ arch/x86/include/asm/sync_core.h | 26 ++ 2 files changed, 24 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 59a3e13..5999b0b 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -234,6 +234,12 @@ static inline void clwb(volatile void *__p) #define nop() asm volatile ("nop") +static inline void serialize(void) +{ + /* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */ + asm volatile(".byte 0xf, 0x1, 0xe8" ::: "memory"); +} + #endif /* __KERNEL__ */ #endif /* _ASM_X86_SPECIAL_INSNS_H */ diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index fdb5b35..4631c0f 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -5,6 +5,7 @@ #include #include #include +#include #ifdef CONFIG_X86_32 static inline void iret_to_self(void) @@ -54,14 +55,23 @@ static inline void iret_to_self(void) static inline void sync_core(void) { /* -* There are quite a few ways to do this. IRET-to-self is nice -* because it works on every CPU, at any CPL (so it's compatible -* with paravirtualization), and it never exits to a hypervisor. -* The only down sides are that it's a bit slow (it seems to be -* a bit more than 2x slower than the fastest options) and that -* it unmasks NMIs. The "push %cs" is needed because, in -* paravirtual environments, __KERNEL_CS may not be a valid CS -* value when we do IRET directly. +* The SERIALIZE instruction is the most straightforward way to +* do this but it not universally available. +*/ + if (static_cpu_has(X86_FEATURE_SERIALIZE)) { + serialize(); + return; + } + + /* +* For all other processors, there are quite a few ways to do this. +* IRET-to-self is nice because it works on every CPU, at any CPL +* (so it's compatible with paravirtualization), and it never exits +* to a hypervisor. The only down sides are that it's a bit slow +* (it seems to be a bit more than 2x slower than the fastest +* options) and that it unmasks NMIs. The "push %cs" is needed +* because, in paravirtual environments, __KERNEL_CS may not be a +* valid CS value when we do IRET directly. * * In case NMI unmasking or performance ever becomes a problem, * the next best option appears to be MOV-to-CR2 and an
[PATCH v4] x86/cpu: Use SERIALIZE in sync_core() when available
The SERIALIZE instruction gives software a way to force the processor to complete all modifications to flags, registers and memory from previous instructions and drain all buffered writes to memory before the next instruction is fetched and executed. Thus, it serves the purpose of sync_core(). Use it when available. Cc: Andy Lutomirski Cc: Cathy Zhang Cc: Dave Hansen Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Kyung Min Park Cc: Peter Zijlstra Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Suggested-by: Andy Lutomirski Signed-off-by: Ricardo Neri --- This is v4 from my three previous submission [1], [2], and [3]. The first three patches of the series have been merged in Linus' tree. Hence, I am submitting only this patch for review. [1]. https://lkml.org/lkml/2020/7/27/8 [2]. https://lkml.org/lkml/2020/8/4/1090 [3]. https://lkml.org/lkml/2020/8/6/808 Changes since v3: * Reworked comments in sync_core() for better readability. (Dave Hansen) * Reworked the implementation to align with the style in special_insns.h. No functional changes were introduced. (Tony Luck) Changes since v2: * Support serialize with static_cpu_has() instead of using alternative runtime patching directly. (Borislav Petkov) Changes since v1: * Support SERIALIZE using alternative runtime patching. (Peter Zijlstra, H. Peter Anvin) * Added a note to specify which version of binutils supports SERIALIZE. (Peter Zijlstra) * Verified that (::: "memory") is used. (H. Peter Anvin) --- arch/x86/include/asm/special_insns.h | 6 ++ arch/x86/include/asm/sync_core.h | 26 ++ 2 files changed, 24 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 59a3e13204c3..5999b0b3dd4a 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -234,6 +234,12 @@ static inline void clwb(volatile void *__p) #define nop() asm volatile ("nop") +static inline void serialize(void) +{ + /* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */ + asm volatile(".byte 0xf, 0x1, 0xe8" ::: "memory"); +} + #endif /* __KERNEL__ */ #endif /* _ASM_X86_SPECIAL_INSNS_H */ diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index fdb5b356e59b..089712777fd9 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -5,6 +5,7 @@ #include #include #include +#include #ifdef CONFIG_X86_32 static inline void iret_to_self(void) @@ -54,14 +55,23 @@ static inline void iret_to_self(void) static inline void sync_core(void) { /* -* There are quite a few ways to do this. IRET-to-self is nice -* because it works on every CPU, at any CPL (so it's compatible -* with paravirtualization), and it never exits to a hypervisor. -* The only down sides are that it's a bit slow (it seems to be -* a bit more than 2x slower than the fastest options) and that -* it unmasks NMIs. The "push %cs" is needed because, in -* paravirtual environments, __KERNEL_CS may not be a valid CS -* value when we do IRET directly. +* The SERIALIZE instruction is the most straightforward way to +* do this but it not universally available. +*/ + if (static_cpu_has(X86_FEATURE_SERIALIZE)) { + serialize(); + return; + } + + /* +* For all other processors, there are quite a few ways to do this +* IRET-to-self is nice because it works on every CPU, at any CPL +* (so it's compatible with paravirtualization), and it never exits +* to a hypervisor. The only down sides are that it's a bit slow +* (it seems to be a bit more than 2x slower than the fastest +* options) and that it unmasks NMIs. The "push %cs" is needed +* because, in paravirtual environments, __KERNEL_CS may not be a +* valid CS value when we do IRET directly. * * In case NMI unmasking or performance ever becomes a problem, * the next best option appears to be MOV-to-CR2 and an -- 2.17.1
Re: [PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available
On Thu, Aug 06, 2020 at 04:08:47PM -0700, Dave Hansen wrote: > On 8/6/20 4:04 PM, Ricardo Neri wrote: > > * CPUID is the conventional way, but it's nasty: it doesn't > > * exist on some 486-like CPUs, and it usually exits to a > > * hypervisor. > > * > > * The SERIALIZE instruction is the most straightforward way to > > * do this as it does not clobber registers or exit to a > > * hypervisor. However, it is not universally available. > > * > > * Like all of Linux's memory ordering operations, this is a > > * compiler barrier as well. > > */ > > > > What do you think? > > I like what I suggested. :) > > SERIALIZE is best where available. Do it first, comment it by itself. > > Then, go into the long discussion of the other alternatives. They only > make sense when SERIALIZE isn't there, and the logic for selection there > is substantially more complicated. Sure Dave, I think this layout makes sense. I will rework the comments. Thanks and BR, Ricardo
Re: [PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available
On Thu, Aug 06, 2020 at 12:57:26PM -0700, Dave Hansen wrote: > On 8/6/20 12:25 PM, Ricardo Neri wrote: > > static inline void sync_core(void) > > { > > /* > > -* There are quite a few ways to do this. IRET-to-self is nice > > +* Hardware can do this for us if SERIALIZE is available. Otherwise, > > +* there are quite a few ways to do this. IRET-to-self is nice > > This seems to imply that things other than SERIALIZE aren't the hardware > doing this. All of these methods are equally architecturally > serializing *and* provided by the hardware. Indeed, I can see how the wording may imply that. > > I also don't quite get the point of separating the comments from the > code. Shouldn't this be: > > /* >* The SERIALIZE instruction is the most straightforward way to >* do this but it not universally available. >*/ I regard the comment as describing all the considered options to for architectural serialization. What about if I add SERIALIZE as another option? I propose to discuss it towards the end of the comment: /* * There are quite a few ways to do this. IRET-to-self is nice * because it works on every CPU, at any CPL (so it's compatible * with paravirtualization), and it never exits to a hypervisor. * The only down sides are that it's a bit slow (it seems to be * a bit more than 2x slower than the fastest options) and that * it unmasks NMIs. The "push %cs" is needed because, in * paravirtual environments, __KERNEL_CS may not be a valid CS * value when we do IRET directly. * * In case NMI unmasking or performance ever becomes a problem, * the next best option appears to be MOV-to-CR2 and an * unconditional jump. That sequence also works on all CPUs, * but it will fault at CPL3 (i.e. Xen PV). * * CPUID is the conventional way, but it's nasty: it doesn't * exist on some 486-like CPUs, and it usually exits to a * hypervisor. * * The SERIALIZE instruction is the most straightforward way to * do this as it does not clobber registers or exit to a * hypervisor. However, it is not universally available. * * Like all of Linux's memory ordering operations, this is a * compiler barrier as well. */ What do you think? > if (static_cpu_has(X86_FEATURE_SERIALIZE)) { > asm volatile(__ASM_SERIALIZE ::: "memory"); > return; > } > > /* >* For all other processors, IRET-to-self is nice ... >*/ > iret_to_self();
[PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available
The SERIALIZE instruction gives software a way to force the processor to complete all modifications to flags, registers and memory from previous instructions and drain all buffered writes to memory before the next instruction is fetched and executed. Thus, it serves the purpose of sync_core(). Use it when available. Cc: Andy Lutomirski Cc: Cathy Zhang Cc: Dave Hansen Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Kyung Min Park Cc: Peter Zijlstra Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Suggested-by: Andy Lutomirski Signed-off-by: Ricardo Neri --- This is a v3 from my two previous submissions [1], [2]. The first three patches of the series have been merged into Linus' tree. Hence, I am submitting only this patch for review. [1]. https://lkml.org/lkml/2020/7/27/8 [2]. https://lkml.org/lkml/2020/8/4/1090 Changes since v2: * Support serialize with static_cpu_has() instead of using alternative runtime patching directly. (Borislav Petkov) Changes since v1: * Support SERIALIZE using alternative runtime patching. (Peter Zijlstra, H. Peter Anvin) * Added a note to specify which version of binutils supports SERIALIZE. (Peter Zijlstra) * Verified that (::: "memory") is used. (H. Peter Anvin) --- arch/x86/include/asm/special_insns.h | 2 ++ arch/x86/include/asm/sync_core.h | 9 - 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 59a3e13204c3..25cd67801dda 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -10,6 +10,8 @@ #include #include +/* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */ +#define __ASM_SERIALIZE ".byte 0xf, 0x1, 0xe8" /* * Volatile isn't enough to prevent the compiler from reordering the * read/write functions for the control registers and messing everything up. diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index fdb5b356e59b..b74580413a0b 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -5,6 +5,7 @@ #include #include #include +#include #ifdef CONFIG_X86_32 static inline void iret_to_self(void) @@ -54,7 +55,8 @@ static inline void iret_to_self(void) static inline void sync_core(void) { /* -* There are quite a few ways to do this. IRET-to-self is nice +* Hardware can do this for us if SERIALIZE is available. Otherwise, +* there are quite a few ways to do this. IRET-to-self is nice * because it works on every CPU, at any CPL (so it's compatible * with paravirtualization), and it never exits to a hypervisor. * The only down sides are that it's a bit slow (it seems to be @@ -75,6 +77,11 @@ static inline void sync_core(void) * Like all of Linux's memory ordering operations, this is a * compiler barrier as well. */ + if (static_cpu_has(X86_FEATURE_SERIALIZE)) { + asm volatile(__ASM_SERIALIZE ::: "memory"); + return; + } + iret_to_self(); } -- 2.17.1
Re: [PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available
On Wed, Aug 05, 2020 at 06:48:40AM +0200, Borislav Petkov wrote: > On Tue, Aug 04, 2020 at 07:10:59PM -0700, Ricardo Neri wrote: > > The SERIALIZE instruction gives software a way to force the processor to > > complete all modifications to flags, registers and memory from previous > > instructions and drain all buffered writes to memory before the next > > instruction is fetched and executed. Thus, it serves the purpose of > > sync_core(). Use it when available. > > > > Commit 7117f16bf460 ("objtool: Fix ORC vs alternatives") enforced stack > > invariance in alternatives. The iret-to-self does not comply with such > > invariance. Thus, it cannot be used inside alternative code. Instead, use > > an alternative that jumps to SERIALIZE when available. > > > > Cc: Andy Lutomirski > > Cc: Cathy Zhang > > Cc: Dave Hansen > > Cc: Fenghua Yu > > Cc: "H. Peter Anvin" > > Cc: Kyung Min Park > > Cc: Peter Zijlstra > > Cc: "Ravi V. Shankar" > > Cc: Sean Christopherson > > Cc: linux-e...@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > > Suggested-by: Andy Lutomirski > > Signed-off-by: Ricardo Neri > > --- > > This is a v2 from my initial submission [1]. The first three patches of > > the series have been merged in Linus' tree. Hence, I am submitting only > > this patch for review. > > > > [1]. https://lkml.org/lkml/2020/7/27/8 > > > > Changes since v1: > > * Support SERIALIZE using alternative runtime patching. > >(Peter Zijlstra, H. Peter Anvin) > > * Added a note to specify which version of binutils supports SERIALIZE. > >(Peter Zijlstra) > > * Verified that (::: "memory") is used. (H. Peter Anvin) > > --- > > arch/x86/include/asm/special_insns.h | 2 ++ > > arch/x86/include/asm/sync_core.h | 10 +- > > 2 files changed, 11 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/special_insns.h > > b/arch/x86/include/asm/special_insns.h > > index 59a3e13204c3..25cd67801dda 100644 > > --- a/arch/x86/include/asm/special_insns.h > > +++ b/arch/x86/include/asm/special_insns.h > > @@ -10,6 +10,8 @@ > > #include > > #include > > > > +/* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */ > > +#define __ASM_SERIALIZE ".byte 0xf, 0x1, 0xe8" > > /* > > * Volatile isn't enough to prevent the compiler from reordering the > > * read/write functions for the control registers and messing everything > > up. > > diff --git a/arch/x86/include/asm/sync_core.h > > b/arch/x86/include/asm/sync_core.h > > index fdb5b356e59b..201ea3d9a6bd 100644 > > --- a/arch/x86/include/asm/sync_core.h > > +++ b/arch/x86/include/asm/sync_core.h > > @@ -5,15 +5,19 @@ > > #include > > #include > > #include > > +#include > > > > #ifdef CONFIG_X86_32 > > static inline void iret_to_self(void) > > { > > asm volatile ( > > + ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE) > > "pushfl\n\t" > > "pushl %%cs\n\t" > > "pushl $1f\n\t" > > "iret\n\t" > > + "2:\n\t" > > + __ASM_SERIALIZE "\n" > > "1:" > > : ASM_CALL_CONSTRAINT : : "memory"); > > } > > @@ -23,6 +27,7 @@ static inline void iret_to_self(void) > > unsigned int tmp; > > > > asm volatile ( > > + ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE) > > Why is this and above stuck inside the asm statement? > > Why can't you simply do: > > if (static_cpu_has(X86_FEATURE_SERIALIZE)) { > asm volatile(__ASM_SERIALIZE ::: "memory"); > return; > } > > on function entry instead of making it more unreadable for no particular > reason? My my first submission I had implemented it as you describe. The only difference was that I used boot_cpu_has() instead of static_cpu_has() as the latter has a comment stating that: "Use static_cpu_has() only in fast paths (...) boot_cpu_has() is already fast enough for the majority of cases..." sync_core_before_usermode() already handles what I think are the critical paths. Thanks and BR, Ricardo
Re: [PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available
On Wed, Aug 05, 2020 at 07:08:08AM +0200, Borislav Petkov wrote: > On Tue, Aug 04, 2020 at 09:58:25PM -0700, h...@zytor.com wrote: > > Because why use an alternative to jump over one instruction? > > > > I personally would prefer to have the IRET put out of line > > Can't yet - SERIALIZE CPUs are a minority at the moment. > > > and have the call/jmp replaced by SERIALIZE inline. > > Well, we could do: > > alternative_io("... IRET bunch", __ASM_SERIALIZE, > X86_FEATURE_SERIALIZE, ...); > > and avoid all kinds of jumping. Alternatives get padded so there > would be a couple of NOPs following when SERIALIZE gets patched in > but it shouldn't be a problem. I guess one needs to look at what gcc > generates... But the IRET-TO-SELF code has instruction which modify the stack. This would violate stack invariance in alternatives as enforced in commit 7117f16bf460 ("objtool: Fix ORC vs alternatives"). As a result, objtool gives warnings as follows: arch/x86/kernel/alternative.o: warning: objtool: do_sync_core()+0xe: alternative modifies stack Perhaps in this specific case it does not matter as the changes in the stack will be undone by IRET. However, using alternative_io would require adding the macro STACK_FRAME_NON_STANDARD to functions using sync_core(). IMHO, it wouldn't look good. So maybe the best approach is to implement as you suggested using static_cpu_has()? Thanks and BR, Ricardo
Re: [PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available
On Wed, Aug 05, 2020 at 11:28:31AM -0700, Andy Lutomirski wrote: > On Wed, Aug 5, 2020 at 10:07 AM Ricardo Neri > wrote: > > > > On Wed, Aug 05, 2020 at 07:08:08AM +0200, Borislav Petkov wrote: > > > On Tue, Aug 04, 2020 at 09:58:25PM -0700, h...@zytor.com wrote: > > > > Because why use an alternative to jump over one instruction? > > > > > > > > I personally would prefer to have the IRET put out of line > > > > > > Can't yet - SERIALIZE CPUs are a minority at the moment. > > > > > > > and have the call/jmp replaced by SERIALIZE inline. > > > > > > Well, we could do: > > > > > > alternative_io("... IRET bunch", __ASM_SERIALIZE, > > > X86_FEATURE_SERIALIZE, ...); > > > > > > and avoid all kinds of jumping. Alternatives get padded so there > > > would be a couple of NOPs following when SERIALIZE gets patched in > > > but it shouldn't be a problem. I guess one needs to look at what gcc > > > generates... > > > > But the IRET-TO-SELF code has instruction which modify the stack. This > > would violate stack invariance in alternatives as enforced in commit > > 7117f16bf460 ("objtool: Fix ORC vs alternatives"). As a result, objtool > > gives warnings as follows: > > > > arch/x86/kernel/alternative.o: warning: objtool: do_sync_core()+0xe: > > alternative modifies stack > > > > Perhaps in this specific case it does not matter as the changes in the > > stack will be undone by IRET. However, using alternative_io would require > > adding the macro STACK_FRAME_NON_STANDARD to functions using sync_core(). > > IMHO, it wouldn't look good. > > > > So maybe the best approach is to implement as you suggested using > > static_cpu_has()? > > I agree. Let's keep it simple. > > Honestly, I think the right solution is to have iret_to_self() in > actual asm and invoke it from C as needed. Do you mean anything different from what we have already [1]? If I understand your comment correctly, we have exactly that: an iret_to_self() asm implementation invoked from C. Thanks and BR, Ricardo [1]. https://lore.kernel.org/lkml/20200727043132.15082-4-ricardo.neri-calde...@linux.intel.com/ Thanks and BR, Ricardo
[PATCH v2] x86/cpu: Use SERIALIZE in sync_core() when available
The SERIALIZE instruction gives software a way to force the processor to complete all modifications to flags, registers and memory from previous instructions and drain all buffered writes to memory before the next instruction is fetched and executed. Thus, it serves the purpose of sync_core(). Use it when available. Commit 7117f16bf460 ("objtool: Fix ORC vs alternatives") enforced stack invariance in alternatives. The iret-to-self does not comply with such invariance. Thus, it cannot be used inside alternative code. Instead, use an alternative that jumps to SERIALIZE when available. Cc: Andy Lutomirski Cc: Cathy Zhang Cc: Dave Hansen Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Kyung Min Park Cc: Peter Zijlstra Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Andy Lutomirski Signed-off-by: Ricardo Neri --- This is a v2 from my initial submission [1]. The first three patches of the series have been merged in Linus' tree. Hence, I am submitting only this patch for review. [1]. https://lkml.org/lkml/2020/7/27/8 Changes since v1: * Support SERIALIZE using alternative runtime patching. (Peter Zijlstra, H. Peter Anvin) * Added a note to specify which version of binutils supports SERIALIZE. (Peter Zijlstra) * Verified that (::: "memory") is used. (H. Peter Anvin) --- arch/x86/include/asm/special_insns.h | 2 ++ arch/x86/include/asm/sync_core.h | 10 +- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 59a3e13204c3..25cd67801dda 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -10,6 +10,8 @@ #include #include +/* Instruction opcode for SERIALIZE; supported in binutils >= 2.35. */ +#define __ASM_SERIALIZE ".byte 0xf, 0x1, 0xe8" /* * Volatile isn't enough to prevent the compiler from reordering the * read/write functions for the control registers and messing everything up. diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index fdb5b356e59b..201ea3d9a6bd 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -5,15 +5,19 @@ #include #include #include +#include #ifdef CONFIG_X86_32 static inline void iret_to_self(void) { asm volatile ( + ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE) "pushfl\n\t" "pushl %%cs\n\t" "pushl $1f\n\t" "iret\n\t" + "2:\n\t" + __ASM_SERIALIZE "\n" "1:" : ASM_CALL_CONSTRAINT : : "memory"); } @@ -23,6 +27,7 @@ static inline void iret_to_self(void) unsigned int tmp; asm volatile ( + ALTERNATIVE("", "jmp 2f", X86_FEATURE_SERIALIZE) "mov %%ss, %0\n\t" "pushq %q0\n\t" "pushq %%rsp\n\t" @@ -32,6 +37,8 @@ static inline void iret_to_self(void) "pushq %q0\n\t" "pushq $1f\n\t" "iretq\n\t" + "2:\n\t" + __ASM_SERIALIZE "\n" "1:" : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); } @@ -54,7 +61,8 @@ static inline void iret_to_self(void) static inline void sync_core(void) { /* -* There are quite a few ways to do this. IRET-to-self is nice +* Hardware can do this for us if SERIALIZE is available. Otherwise, +* there are quite a few ways to do this. IRET-to-self is nice * because it works on every CPU, at any CPL (so it's compatible * with paravirtualization), and it never exits to a hypervisor. * The only down sides are that it's a bit slow (it seems to be -- 2.17.1
Re: [PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available
On Mon, Jul 27, 2020 at 03:30:20PM +0200, pet...@infradead.org wrote: > On Mon, Jul 27, 2020 at 03:05:36PM +0200, pet...@infradead.org wrote: > > Yeah, I'm not sure.. the 'funny' thing is that typically call > > sync_core() from an IPI anyway. And the synchronous broadcast IPI is by > > far the most expensive part of that. > > > > Something like this... > > > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > > index 20e07feb4064..528e049ee1d9 100644 > > --- a/arch/x86/kernel/alternative.c > > +++ b/arch/x86/kernel/alternative.c > > @@ -989,12 +989,13 @@ void *text_poke_kgdb(void *addr, const void *opcode, > > size_t len) > > > > static void do_sync_core(void *info) > > { > > - sync_core(); > > + /* IRET implies sync_core() */ > > } > > > > void text_poke_sync(void) > > { > > on_each_cpu(do_sync_core, NULL, 1); > > + sync_core(); > > } > > > > struct text_poke_loc { > > So 'people' have wanted to optimize this for NOHZ_FULL and I suppose > virt as well. > > IFF VMENTER is serializing, I suppose we can simply do something like: > > bool text_poke_cond(int cpu, void *info) > { > /* >* If we observe the vCPU is preempted, it will do VMENTER >* no point in sending an IPI to SERIALIZE. >*/ > return !vcpu_is_preempted(cpu); > } > > void text_poke_sync(void) > { > smp_call_function_many_cond(cpu_possible_mask, > do_sync_core, NULL, 1, text_poke_cond); > sync_core(); > } > > The 'same' for NOHZ_FULL, except we need to cmpxchg a value such that > if the cmpxchg() succeeds we know the CPU is in userspace and will > SERIALIZE on the next entry. Much like kvm_flush_tlb_others(). > > > Anyway, that's all hand-wavey.. I'll let someone that cares about those > things write actual patches :-) I think I got a little lost here. If I understand correctly, there are two alternatives to implement support for serialize better: a) alternative(IRET_TO_SELF, SERIALIZE, X86_FEATURE_SERIALIZE); or b) asm volatile("1:.byte 0xf, 0x1, 0xe8;2:" _ASM_EXTABLE(1b:2b) a) would be the traditional and simpler solution. b) would rely on causing an #UD and getting an IRET on existing hardware b) would need some more optimization work when handling the exception and a few reworks on the poke patching code. Which option should I focus on? Which option would be more desirable/better? Thanks and BR, Ricardo
Re: [PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available
On Mon, Jul 27, 2020 at 05:47:32AM -0700, h...@zytor.com wrote: > On July 27, 2020 1:20:03 AM PDT, pet...@infradead.org wrote: > >On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote: > >> +static inline void serialize(void) > >> +{ > >> + asm volatile(".byte 0xf, 0x1, 0xe8"); > >> +} > > > >Can we pretty please have a comment with the binutils version that has > >the mnomic? Such that when we increase the required binutils version we > >can grep for this. > > It also needs : : : "memory" to be a barrier. Sure Peter, I will make this change. Thanks and BR, Ricardo
Re: [PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available
On Mon, Jul 27, 2020 at 10:20:03AM +0200, pet...@infradead.org wrote: > On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote: > > +static inline void serialize(void) > > +{ > > + asm volatile(".byte 0xf, 0x1, 0xe8"); > > +} > > Can we pretty please have a comment with the binutils version that has > the mnomic? Such that when we increase the required binutils version we > can grep for this. This is supported since binutils 2.35[1]. I'll add a comment with the clarification. Thanks and BR, Ricardo [1]. https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob_plain;f=gas/NEWS;;hb=refs/tags/binutils-2_35
Re: [PATCH 0/4] x86/cpu: Use SERIALIZE in sync_core()
On Mon, Jul 27, 2020 at 01:07:08PM +0200, Ingo Molnar wrote: > > * Ricardo Neri wrote: > > > A recent submission to LKML introduced a CPU feature flag for a new > > Intel architecture Serializing Instruction, SERIALIZE [1]. Unlike the > > existing Serializing Instructions, this new instruction does not have > > side effects such as clobbering registers or exiting to a hypervisor. > > > > As stated in the Intel "extensions" (ISE) manual [2], this instruction > > will appear first in Sapphire Rapids and Alder Lake. > > > > Andy Lutomirski suggested to use this instruction in sync_core() as it > > serves the very purpose of this function [3]. > > > > For completeness, I picked patch #3 from Cathy's series (and has become > > patch #1 here) [1]. Her series depends on such patch to build correctly. > > Maybe it can be merged independently while the discussion continues? > > > > Thanks and BR, > > Ricardo > > > > [1]. > > https://lore.kernel.org/kvm/1594088183-7187-1-git-send-email-cathy.zh...@intel.com/ > > [2]. > > https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf > > [3]. > > https://lore.kernel.org/kvm/CALCETrWudiF8G8r57r5i4JefuP5biG1kHg==0o8yxb-bys-...@mail.gmail.com/ > > > > Ricardo Neri (4): > > x86/cpufeatures: Add enumeration for SERIALIZE instruction > > x86/cpu: Relocate sync_core() to sync_core.h > > x86/cpu: Refactor sync_core() for readability > > x86/cpu: Use SERIALIZE in sync_core() when available > > I've picked up the first 3 preparatory patches into tip:x86/cpu, as > they are valid improvements even without the 4th patch. The actual > functionality in the 4th patch still needs work. Thank you Ingo! I'll work on the fourth patch. BR, Ricardo
[tip: x86/cpu] x86/cpufeatures: Add enumeration for SERIALIZE instruction
The following commit has been merged into the x86/cpu branch of tip: Commit-ID: 85b23fbc7d88f8c6e3951721802d7845bc39663d Gitweb: https://git.kernel.org/tip/85b23fbc7d88f8c6e3951721802d7845bc39663d Author:Ricardo Neri AuthorDate:Sun, 26 Jul 2020 21:31:29 -07:00 Committer: Ingo Molnar CommitterDate: Mon, 27 Jul 2020 12:42:06 +02:00 x86/cpufeatures: Add enumeration for SERIALIZE instruction The Intel architecture defines a set of Serializing Instructions (a detailed definition can be found in Vol.3 Section 8.3 of the Intel "main" manual, SDM). However, these instructions do more than what is required, have side effects and/or may be rather invasive. Furthermore, some of these instructions are only available in kernel mode or may cause VMExits. Thus, software using these instructions only to serialize execution (as defined in the manual) must handle the undesired side effects. As indicated in the name, SERIALIZE is a new Intel architecture Serializing Instruction. Crucially, it does not have any of the mentioned side effects. Also, it does not cause VMExit and can be used in user mode. This new instruction is currently documented in the latest "extensions" manual (ISE). It will appear in the "main" manual in the future. Signed-off-by: Ricardo Neri Signed-off-by: Ingo Molnar Reviewed-by: Tony Luck Acked-by: Dave Hansen Link: https://lore.kernel.org/r/20200727043132.15082-2-ricardo.neri-calde...@linux.intel.com --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 02dabc9..adf45cf 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -365,6 +365,7 @@ #define X86_FEATURE_SRBDS_CTRL (18*32+ 9) /* "" SRBDS mitigation MSR available */ #define X86_FEATURE_MD_CLEAR (18*32+10) /* VERW clears CPU buffers */ #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */ +#define X86_FEATURE_SERIALIZE (18*32+14) /* SERIALIZE instruction */ #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */ #define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */ #define X86_FEATURE_INTEL_STIBP(18*32+27) /* "" Single Thread Indirect Branch Predictors */
[tip: x86/cpu] x86/cpu: Relocate sync_core() to sync_core.h
The following commit has been merged into the x86/cpu branch of tip: Commit-ID: 9998a9832c4027e907353e5e05fde730cf624b77 Gitweb: https://git.kernel.org/tip/9998a9832c4027e907353e5e05fde730cf624b77 Author:Ricardo Neri AuthorDate:Sun, 26 Jul 2020 21:31:30 -07:00 Committer: Ingo Molnar CommitterDate: Mon, 27 Jul 2020 12:42:06 +02:00 x86/cpu: Relocate sync_core() to sync_core.h Having sync_core() in processor.h is problematic since it is not possible to check for hardware capabilities via the *cpu_has() family of macros. The latter needs the definitions in processor.h. It also looks more intuitive to relocate the function to sync_core.h. This changeset does not make changes in functionality. Signed-off-by: Ricardo Neri Signed-off-by: Ingo Molnar Reviewed-by: Tony Luck Link: https://lore.kernel.org/r/20200727043132.15082-3-ricardo.neri-calde...@linux.intel.com --- arch/x86/include/asm/processor.h| 64 + arch/x86/include/asm/sync_core.h| 64 - arch/x86/kernel/alternative.c | 1 +- arch/x86/kernel/cpu/mce/core.c | 1 +- drivers/misc/sgi-gru/grufault.c | 1 +- drivers/misc/sgi-gru/gruhandles.c | 1 +- drivers/misc/sgi-gru/grukservices.c | 1 +- 7 files changed, 69 insertions(+), 64 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 03b7c4c..68ba42f 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -678,70 +678,6 @@ static inline unsigned int cpuid_edx(unsigned int op) return edx; } -/* - * This function forces the icache and prefetched instruction stream to - * catch up with reality in two very specific cases: - * - * a) Text was modified using one virtual address and is about to be executed - * from the same physical page at a different virtual address. - * - * b) Text was modified on a different CPU, may subsequently be - * executed on this CPU, and you want to make sure the new version - * gets executed. This generally means you're calling this in a IPI. - * - * If you're calling this for a different reason, you're probably doing - * it wrong. - */ -static inline void sync_core(void) -{ - /* -* There are quite a few ways to do this. IRET-to-self is nice -* because it works on every CPU, at any CPL (so it's compatible -* with paravirtualization), and it never exits to a hypervisor. -* The only down sides are that it's a bit slow (it seems to be -* a bit more than 2x slower than the fastest options) and that -* it unmasks NMIs. The "push %cs" is needed because, in -* paravirtual environments, __KERNEL_CS may not be a valid CS -* value when we do IRET directly. -* -* In case NMI unmasking or performance ever becomes a problem, -* the next best option appears to be MOV-to-CR2 and an -* unconditional jump. That sequence also works on all CPUs, -* but it will fault at CPL3 (i.e. Xen PV). -* -* CPUID is the conventional way, but it's nasty: it doesn't -* exist on some 486-like CPUs, and it usually exits to a -* hypervisor. -* -* Like all of Linux's memory ordering operations, this is a -* compiler barrier as well. -*/ -#ifdef CONFIG_X86_32 - asm volatile ( - "pushfl\n\t" - "pushl %%cs\n\t" - "pushl $1f\n\t" - "iret\n\t" - "1:" - : ASM_CALL_CONSTRAINT : : "memory"); -#else - unsigned int tmp; - - asm volatile ( - "mov %%ss, %0\n\t" - "pushq %q0\n\t" - "pushq %%rsp\n\t" - "addq $8, (%%rsp)\n\t" - "pushfq\n\t" - "mov %%cs, %0\n\t" - "pushq %q0\n\t" - "pushq $1f\n\t" - "iretq\n\t" - "1:" - : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); -#endif -} - extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void amd_e400_c1e_apic_setup(void); diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index c67caaf..9c5573f 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -7,6 +7,70 @@ #include /* + * This function forces the icache and prefetched instruction stream to + * catch up with reality in two very specific cases: + * + * a) Text was modified using one virtual address and is about to be executed + * from the same physical page at a different virtual address. + * + * b) Text was modified on a different CPU, may subsequently be + * executed on this CPU, and yo
[tip: x86/cpu] x86/cpu: Refactor sync_core() for readability
The following commit has been merged into the x86/cpu branch of tip: Commit-ID: f69ca629d89d65737537e05308ac531f7bb07d5c Gitweb: https://git.kernel.org/tip/f69ca629d89d65737537e05308ac531f7bb07d5c Author:Ricardo Neri AuthorDate:Sun, 26 Jul 2020 21:31:31 -07:00 Committer: Ingo Molnar CommitterDate: Mon, 27 Jul 2020 12:42:06 +02:00 x86/cpu: Refactor sync_core() for readability Instead of having #ifdef/#endif blocks inside sync_core() for X86_64 and X86_32, implement the new function iret_to_self() with two versions. In this manner, avoid having to use even more more #ifdef/#endif blocks when adding support for SERIALIZE in sync_core(). Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Ricardo Neri Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200727043132.15082-4-ricardo.neri-calde...@linux.intel.com --- arch/x86/include/asm/special_insns.h | 1 +- arch/x86/include/asm/sync_core.h | 56 +++ 2 files changed, 32 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index eb8e781..59a3e13 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -234,7 +234,6 @@ static inline void clwb(volatile void *__p) #define nop() asm volatile ("nop") - #endif /* __KERNEL__ */ #endif /* _ASM_X86_SPECIAL_INSNS_H */ diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index 9c5573f..fdb5b35 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -6,6 +6,37 @@ #include #include +#ifdef CONFIG_X86_32 +static inline void iret_to_self(void) +{ + asm volatile ( + "pushfl\n\t" + "pushl %%cs\n\t" + "pushl $1f\n\t" + "iret\n\t" + "1:" + : ASM_CALL_CONSTRAINT : : "memory"); +} +#else +static inline void iret_to_self(void) +{ + unsigned int tmp; + + asm volatile ( + "mov %%ss, %0\n\t" + "pushq %q0\n\t" + "pushq %%rsp\n\t" + "addq $8, (%%rsp)\n\t" + "pushfq\n\t" + "mov %%cs, %0\n\t" + "pushq %q0\n\t" + "pushq $1f\n\t" + "iretq\n\t" + "1:" + : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); +} +#endif /* CONFIG_X86_32 */ + /* * This function forces the icache and prefetched instruction stream to * catch up with reality in two very specific cases: @@ -44,30 +75,7 @@ static inline void sync_core(void) * Like all of Linux's memory ordering operations, this is a * compiler barrier as well. */ -#ifdef CONFIG_X86_32 - asm volatile ( - "pushfl\n\t" - "pushl %%cs\n\t" - "pushl $1f\n\t" - "iret\n\t" - "1:" - : ASM_CALL_CONSTRAINT : : "memory"); -#else - unsigned int tmp; - - asm volatile ( - "mov %%ss, %0\n\t" - "pushq %q0\n\t" - "pushq %%rsp\n\t" - "addq $8, (%%rsp)\n\t" - "pushfq\n\t" - "mov %%cs, %0\n\t" - "pushq %q0\n\t" - "pushq $1f\n\t" - "iretq\n\t" - "1:" - : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); -#endif + iret_to_self(); } /*
[PATCH 3/4] x86/cpu: Refactor sync_core() for readability
Instead of having #ifdef/#endif blocks inside sync_core() for X86_64 and X86_32, implement the new function iret_to_self() with two versions. In this manner, avoid having to use even more more #ifdef/#endif blocks when adding support for SERIALIZE in sync_core(). Cc: Andy Lutomirski Cc: Cathy Zhang Cc: Dave Hansen Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Kyung Min Park Cc: Peter Zijlstra Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Ricardo Neri --- --- arch/x86/include/asm/special_insns.h | 1 - arch/x86/include/asm/sync_core.h | 56 2 files changed, 32 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index eb8e781c4353..59a3e13204c3 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -234,7 +234,6 @@ static inline void clwb(volatile void *__p) #define nop() asm volatile ("nop") - #endif /* __KERNEL__ */ #endif /* _ASM_X86_SPECIAL_INSNS_H */ diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index 9c5573f2c333..fdb5b356e59b 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -6,6 +6,37 @@ #include #include +#ifdef CONFIG_X86_32 +static inline void iret_to_self(void) +{ + asm volatile ( + "pushfl\n\t" + "pushl %%cs\n\t" + "pushl $1f\n\t" + "iret\n\t" + "1:" + : ASM_CALL_CONSTRAINT : : "memory"); +} +#else +static inline void iret_to_self(void) +{ + unsigned int tmp; + + asm volatile ( + "mov %%ss, %0\n\t" + "pushq %q0\n\t" + "pushq %%rsp\n\t" + "addq $8, (%%rsp)\n\t" + "pushfq\n\t" + "mov %%cs, %0\n\t" + "pushq %q0\n\t" + "pushq $1f\n\t" + "iretq\n\t" + "1:" + : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); +} +#endif /* CONFIG_X86_32 */ + /* * This function forces the icache and prefetched instruction stream to * catch up with reality in two very specific cases: @@ -44,30 +75,7 @@ static inline void sync_core(void) * Like all of Linux's memory ordering operations, this is a * compiler barrier as well. */ -#ifdef CONFIG_X86_32 - asm volatile ( - "pushfl\n\t" - "pushl %%cs\n\t" - "pushl $1f\n\t" - "iret\n\t" - "1:" - : ASM_CALL_CONSTRAINT : : "memory"); -#else - unsigned int tmp; - - asm volatile ( - "mov %%ss, %0\n\t" - "pushq %q0\n\t" - "pushq %%rsp\n\t" - "addq $8, (%%rsp)\n\t" - "pushfq\n\t" - "mov %%cs, %0\n\t" - "pushq %q0\n\t" - "pushq $1f\n\t" - "iretq\n\t" - "1:" - : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); -#endif + iret_to_self(); } /* -- 2.17.1
[PATCH 2/4] x86/cpu: Relocate sync_core() to sync_core.h
Having sync_core() in processor.h is problematic since it is not possible to check for hardware capabilities via the *cpu_has() family of macros. The latter needs the definitions in processor.h. It also looks more intuitive to relocate the function to sync_core.h. This changeset does not make changes in functionality. Cc: Andy Lutomirski Cc: Cathy Zhang Cc: Dave Hansen Cc: Dimitri Sivanich Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Kyung Min Park Cc: Peter Zijlstra Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- --- arch/x86/include/asm/processor.h| 64 - arch/x86/include/asm/sync_core.h| 64 + arch/x86/kernel/alternative.c | 1 + arch/x86/kernel/cpu/mce/core.c | 1 + drivers/misc/sgi-gru/grufault.c | 1 + drivers/misc/sgi-gru/gruhandles.c | 1 + drivers/misc/sgi-gru/grukservices.c | 1 + 7 files changed, 69 insertions(+), 64 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 2a1f7e1d7151..97143d87994c 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -676,70 +676,6 @@ static inline unsigned int cpuid_edx(unsigned int op) return edx; } -/* - * This function forces the icache and prefetched instruction stream to - * catch up with reality in two very specific cases: - * - * a) Text was modified using one virtual address and is about to be executed - * from the same physical page at a different virtual address. - * - * b) Text was modified on a different CPU, may subsequently be - * executed on this CPU, and you want to make sure the new version - * gets executed. This generally means you're calling this in a IPI. - * - * If you're calling this for a different reason, you're probably doing - * it wrong. - */ -static inline void sync_core(void) -{ - /* -* There are quite a few ways to do this. IRET-to-self is nice -* because it works on every CPU, at any CPL (so it's compatible -* with paravirtualization), and it never exits to a hypervisor. -* The only down sides are that it's a bit slow (it seems to be -* a bit more than 2x slower than the fastest options) and that -* it unmasks NMIs. The "push %cs" is needed because, in -* paravirtual environments, __KERNEL_CS may not be a valid CS -* value when we do IRET directly. -* -* In case NMI unmasking or performance ever becomes a problem, -* the next best option appears to be MOV-to-CR2 and an -* unconditional jump. That sequence also works on all CPUs, -* but it will fault at CPL3 (i.e. Xen PV). -* -* CPUID is the conventional way, but it's nasty: it doesn't -* exist on some 486-like CPUs, and it usually exits to a -* hypervisor. -* -* Like all of Linux's memory ordering operations, this is a -* compiler barrier as well. -*/ -#ifdef CONFIG_X86_32 - asm volatile ( - "pushfl\n\t" - "pushl %%cs\n\t" - "pushl $1f\n\t" - "iret\n\t" - "1:" - : ASM_CALL_CONSTRAINT : : "memory"); -#else - unsigned int tmp; - - asm volatile ( - "mov %%ss, %0\n\t" - "pushq %q0\n\t" - "pushq %%rsp\n\t" - "addq $8, (%%rsp)\n\t" - "pushfq\n\t" - "mov %%cs, %0\n\t" - "pushq %q0\n\t" - "pushq $1f\n\t" - "iretq\n\t" - "1:" - : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); -#endif -} - extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void amd_e400_c1e_apic_setup(void); diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index c67caafd3381..9c5573f2c333 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -6,6 +6,70 @@ #include #include +/* + * This function forces the icache and prefetched instruction stream to + * catch up with reality in two very specific cases: + * + * a) Text was modified using one virtual address and is about to be executed + * from the same physical page at a different virtual address. + * + * b) Text was modified on a different CPU, may subsequently be + * executed on this CPU, and you want to make sure the new version + * gets executed. This generally means you're calling this in a IPI. + * + * If you're calling this for a different reason, you're probably doing + * it wrong. + */ +static i
[PATCH 4/4] x86/cpu: Use SERIALIZE in sync_core() when available
The SERIALIZE instruction gives software a way to force the processor to complete all modifications to flags, registers and memory from previous instructions and drain all buffered writes to memory before the next instruction is fetched and executed. Thus, it serves the purpose of sync_core(). Use it when available. Use boot_cpu_has() and not static_cpu_has(); the most critical paths (returning to user mode and from interrupt and NMI) will not reach sync_core(). Cc: Andy Lutomirski Cc: Cathy Zhang Cc: Dave Hansen Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Kyung Min Park Cc: Peter Zijlstra Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviwed-by: Tony Luck Suggested-by: Andy Lutomirski Signed-off-by: Ricardo Neri --- --- arch/x86/include/asm/special_insns.h | 5 + arch/x86/include/asm/sync_core.h | 10 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 59a3e13204c3..0a2a60bba282 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -234,6 +234,11 @@ static inline void clwb(volatile void *__p) #define nop() asm volatile ("nop") +static inline void serialize(void) +{ + asm volatile(".byte 0xf, 0x1, 0xe8"); +} + #endif /* __KERNEL__ */ #endif /* _ASM_X86_SPECIAL_INSNS_H */ diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index fdb5b356e59b..bf132c09d61b 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -5,6 +5,7 @@ #include #include #include +#include #ifdef CONFIG_X86_32 static inline void iret_to_self(void) @@ -54,7 +55,8 @@ static inline void iret_to_self(void) static inline void sync_core(void) { /* -* There are quite a few ways to do this. IRET-to-self is nice +* Hardware can do this for us if SERIALIZE is available. Otherwise, +* there are quite a few ways to do this. IRET-to-self is nice * because it works on every CPU, at any CPL (so it's compatible * with paravirtualization), and it never exits to a hypervisor. * The only down sides are that it's a bit slow (it seems to be @@ -75,6 +77,12 @@ static inline void sync_core(void) * Like all of Linux's memory ordering operations, this is a * compiler barrier as well. */ + + if (boot_cpu_has(X86_FEATURE_SERIALIZE)) { + serialize(); + return; + } + iret_to_self(); } -- 2.17.1
[PATCH 1/4] x86/cpufeatures: Add enumeration for SERIALIZE instruction
The Intel architecture defines a set of Serializing Instructions (a detailed definition can be found in Vol.3 Section 8.3 of the Intel "main" manual, SDM). However, these instructions do more than what is required, have side effects and/or may be rather invasive. Furthermore, some of these instructions are only available in kernel mode or may cause VMExits. Thus, software using these instructions only to serialize execution (as defined in the manual) must handle the undesired side effects. As indicated in the name, SERIALIZE is a new Intel architecture Serializing Instruction. Crucially, it does not have any of the mentioned side effects. Also, it does not cause VMExit and can be used in user mode. This new instruction is currently documented in the latest "extensions" manual (ISE). It will appear in the "main" manual in the future. Cc: Andy Lutomirski Cc: Cathy Zhang Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Kyung Min Park Cc: Peter Zijlstra Cc: "Ravi V. Shankar" Cc: Sean Christopherson Cc: Tony Luck Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Dave Hansen Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 03390a1ef8e7..2901d5df4366 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -367,6 +367,7 @@ #define X86_FEATURE_SRBDS_CTRL (18*32+ 9) /* "" SRBDS mitigation MSR available */ #define X86_FEATURE_MD_CLEAR (18*32+10) /* VERW clears CPU buffers */ #define X86_FEATURE_TSX_FORCE_ABORT(18*32+13) /* "" TSX_FORCE_ABORT */ +#define X86_FEATURE_SERIALIZE (18*32+14) /* SERIALIZE instruction */ #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */ #define X86_FEATURE_ARCH_LBR (18*32+19) /* Intel ARCH LBR */ #define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */ -- 2.17.1
[PATCH 0/4] x86/cpu: Use SERIALIZE in sync_core()
A recent submission to LKML introduced a CPU feature flag for a new Intel architecture Serializing Instruction, SERIALIZE [1]. Unlike the existing Serializing Instructions, this new instruction does not have side effects such as clobbering registers or exiting to a hypervisor. As stated in the Intel "extensions" (ISE) manual [2], this instruction will appear first in Sapphire Rapids and Alder Lake. Andy Lutomirski suggested to use this instruction in sync_core() as it serves the very purpose of this function [3]. For completeness, I picked patch #3 from Cathy's series (and has become patch #1 here) [1]. Her series depends on such patch to build correctly. Maybe it can be merged independently while the discussion continues? Thanks and BR, Ricardo [1]. https://lore.kernel.org/kvm/1594088183-7187-1-git-send-email-cathy.zh...@intel.com/ [2]. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf [3]. https://lore.kernel.org/kvm/CALCETrWudiF8G8r57r5i4JefuP5biG1kHg==0o8yxb-bys-...@mail.gmail.com/ Ricardo Neri (4): x86/cpufeatures: Add enumeration for SERIALIZE instruction x86/cpu: Relocate sync_core() to sync_core.h x86/cpu: Refactor sync_core() for readability x86/cpu: Use SERIALIZE in sync_core() when available arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/processor.h | 64 -- arch/x86/include/asm/special_insns.h | 4 ++ arch/x86/include/asm/sync_core.h | 80 arch/x86/kernel/alternative.c| 1 + arch/x86/kernel/cpu/mce/core.c | 1 + drivers/misc/sgi-gru/grufault.c | 1 + drivers/misc/sgi-gru/gruhandles.c| 1 + drivers/misc/sgi-gru/grukservices.c | 1 + 9 files changed, 90 insertions(+), 64 deletions(-) -- 2.17.1
Re: [PATCH v2 1/4] x86/cpufeatures: Add enumeration for SERIALIZE instruction
On Thu, Jul 23, 2020 at 01:02:43AM +0200, Thomas Gleixner wrote: > Ricardo Neri writes: > > On Tue, Jul 07, 2020 at 09:36:15AM -0700, Andy Lutomirski wrote: > >> On Mon, Jul 6, 2020 at 7:21 PM Cathy Zhang wrote: > >> > > >> > This instruction gives software a way to force the processor to complete > >> > all modifications to flags, registers and memory from previous > >> > instructions > >> > and drain all buffered writes to memory before the next instruction is > >> > fetched and executed. > >> > > >> > The same effect can be obtained using the cpuid instruction. However, > >> > cpuid causes modification on the eax, ebx, ecx, and ecx regiters; it > >> > also causes a VM exit. > >> > > >> > A processor supports SERIALIZE instruction if CPUID.0x0x.0x0:EDX[14] is > >> > present. The CPU feature flag is shown as "serialize" in /proc/cpuinfo. > >> > > >> > Detailed information on the instructions and CPUID feature flag SERIALIZE > >> > can be found in the latest Intel Architecture Instruction Set Extensions > >> > and Future Features Programming Reference and Intel 64 and IA-32 > >> > Architectures Software Developer's Manual. > >> > >> Can you also wire this up so sync_core() uses it? > > > > I am cc'ing Fenghua, who has volunteered to work on this. Addind support > > for SERIALIZE in sync_core() should not block merging these patches, > > correct? > > Come on. We are not serving KVM first before making this usable on bare > metal. Hi Thomas, I ended up implementing support for SERIALIZE in sync_core() I will be posting patches for this in the next few days. Thanks and BR, Ricardo
Re: [PATCH v5] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Fri, Jul 10, 2020 at 03:45:25PM -0700, Brendan Shanks wrote: > Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > processes. > > Wine users have found a small number of Windows apps using SLDT that > were crashing when run on UMIP-enabled systems. > > Reported-by: Andreas Rammhold > Originally-by: Ricardo Neri > Signed-off-by: Brendan Shanks FWIW, tested on hardware with UMIP. Reviewed-by: Ricardo Neri Tested-by: Ricardo Neri Thanks and BR, Ricardo
Re: [PATCH v5] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Sat, Jul 11, 2020 at 02:49:54PM -0700, h...@zytor.com wrote: > On July 10, 2020 3:45:25 PM PDT, Brendan Shanks > wrote: > >Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > >processes. > > > >Wine users have found a small number of Windows apps using SLDT that > >were crashing when run on UMIP-enabled systems. > > > >Reported-by: Andreas Rammhold > >Originally-by: Ricardo Neri > >Signed-off-by: Brendan Shanks > >--- > > > >v5: Capitalize instruction names in comments. > > > > arch/x86/kernel/umip.c | 40 +++- > > 1 file changed, 27 insertions(+), 13 deletions(-) > > > >diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > >index 8d5cbe1bbb3b..2c304fd0bb1a 100644 > >--- a/arch/x86/kernel/umip.c > >+++ b/arch/x86/kernel/umip.c > >@@ -45,11 +45,12 @@ > >* value that, lies close to the top of the kernel memory. The limit for > >the GDT > > * and the IDT are set to zero. > > * > >- * Given that SLDT and STR are not commonly used in programs that run > >on WineHQ > >- * or DOSEMU2, they are not emulated. > >- * > >- * The instruction smsw is emulated to return the value that the > >register CR0 > >+ * The instruction SMSW is emulated to return the value that the > >register CR0 > > * has at boot time as set in the head_32. > >+ * SLDT and STR are emulated to return the values that the kernel > >programmatically > >+ * assigns: > >+ * - SLDT returns (GDT_ENTRY_LDT * 8) if an LDT has been set, 0 if > >not. > >+ * - STR returns (GDT_ENTRY_TSS * 8). > > * > > * Emulation is provided for both 32-bit and 64-bit processes. > > * > >@@ -244,16 +245,34 @@ static int emulate_umip_insn(struct insn *insn, > >int umip_inst, > > *data_size += UMIP_GDT_IDT_LIMIT_SIZE; > > memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE); > > > >-} else if (umip_inst == UMIP_INST_SMSW) { > >-unsigned long dummy_value = CR0_STATE; > >+} else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT > >|| > >+ umip_inst == UMIP_INST_STR) { > >+unsigned long dummy_value; > >+ > >+if (umip_inst == UMIP_INST_SMSW) { > >+dummy_value = CR0_STATE; > >+} else if (umip_inst == UMIP_INST_STR) { > >+dummy_value = GDT_ENTRY_TSS * 8; > >+} else if (umip_inst == UMIP_INST_SLDT) { > >+#ifdef CONFIG_MODIFY_LDT_SYSCALL > >+down_read(>mm->context.ldt_usr_sem); > >+if (current->mm->context.ldt) > >+dummy_value = GDT_ENTRY_LDT * 8; > >+else > >+dummy_value = 0; > >+up_read(>mm->context.ldt_usr_sem); > >+#else > >+dummy_value = 0; > >+#endif > >+} > > > > /* > >- * Even though the CR0 register has 4 bytes, the number > >+ * For these 3 instructions, the number > > * of bytes to be copied in the result buffer is determined > > * by whether the operand is a register or a memory location. > > * If operand is a register, return as many bytes as the operand > > * size. If operand is memory, return only the two least > >- * siginificant bytes of CR0. > >+ * siginificant bytes. > > */ > > if (X86_MODRM_MOD(insn->modrm.value) == 3) > > *data_size = insn->opnd_bytes; > >@@ -261,7 +280,6 @@ static int emulate_umip_insn(struct insn *insn, int > >umip_inst, > > *data_size = 2; > > > > memcpy(data, _value, *data_size); > >-/* STR and SLDT are not emulated */ > > } else { > > return -EINVAL; > > } > >@@ -383,10 +401,6 @@ bool fixup_umip_exception(struct pt_regs *regs) > > umip_pr_warn(regs, "%s instruction cannot be used by applications.\n", > > umip_insns[umip_inst]); > > > >-/* Do not emulate (spoof) SLDT or STR. */ > >-if (umip_inst == UMIP_INST_STR || umip_inst == UMIP_INST_SLDT) > >-return false; > >- > > umip_pr_warn(regs, "For now, expensive software emulation returns the > >result.\n"); > > > > if (emulate_umip_insn(, umip_inst, dummy_data, _data_size, > > It's there any reason for SLDT to not *always* return a fixed value? "An LDT > has been assigned" is formally a kernel internal property, separate from the > property of whenever there are user space enteies in the LDT. But isn't it true that sldt returns 0 if the application has not set an LDT and non-zero otherwise? In native_set_ldt() I see that the the LDT register is set to 0 if the table has no entries and to GDT_ENTRY_LDT*8 otherwise. Please correct me if I understand this wrong. Thanks and BR, Ricardo
Re: [PATCH v2 1/4] x86/cpufeatures: Add enumeration for SERIALIZE instruction
On Tue, Jul 07, 2020 at 09:36:15AM -0700, Andy Lutomirski wrote: > On Mon, Jul 6, 2020 at 7:21 PM Cathy Zhang wrote: > > > > This instruction gives software a way to force the processor to complete > > all modifications to flags, registers and memory from previous instructions > > and drain all buffered writes to memory before the next instruction is > > fetched and executed. > > > > The same effect can be obtained using the cpuid instruction. However, > > cpuid causes modification on the eax, ebx, ecx, and ecx regiters; it > > also causes a VM exit. > > > > A processor supports SERIALIZE instruction if CPUID.0x0x.0x0:EDX[14] is > > present. The CPU feature flag is shown as "serialize" in /proc/cpuinfo. > > > > Detailed information on the instructions and CPUID feature flag SERIALIZE > > can be found in the latest Intel Architecture Instruction Set Extensions > > and Future Features Programming Reference and Intel 64 and IA-32 > > Architectures Software Developer's Manual. > > Can you also wire this up so sync_core() uses it? I am cc'ing Fenghua, who has volunteered to work on this. Addind support for SERIALIZE in sync_core() should not block merging these patches, correct? Thanks and BR, Ricardo
Re: [PATCH v4] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Tue, Jun 09, 2020 at 10:54:23AM -0700, Brendan Shanks wrote: > Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > processes. > > Wine users have found a small number of Windows apps using SLDT that > were crashing when run on UMIP-enabled systems. > > Reported-by: Andreas Rammhold > Originally-by: Ricardo Neri > Signed-off-by: Brendan Shanks > --- > > v4: Use braces for every clause of the conditional. I tried a switch(), > but it takes more lines and looks more cluttered (especially with the > #ifdef). > Also replace out-of-date comment at top of file. > > arch/x86/kernel/umip.c | 38 ++ > 1 file changed, 26 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > index 8d5cbe1bbb3b..62f4f0afb979 100644 > --- a/arch/x86/kernel/umip.c > +++ b/arch/x86/kernel/umip.c > @@ -45,11 +45,12 @@ > * value that, lies close to the top of the kernel memory. The limit for the > GDT > * and the IDT are set to zero. > * > - * Given that SLDT and STR are not commonly used in programs that run on > WineHQ > - * or DOSEMU2, they are not emulated. > - * > * The instruction smsw is emulated to return the value that the register CR0 > * has at boot time as set in the head_32. > + * sldt and str are emulated to return the values that the kernel > programmatically > + * assigns: > + * - sldt returns (GDT_ENTRY_LDT * 8) if an LDT has been set, 0 if not. > + * - str returns (GDT_ENTRY_TSS * 8). Probably for consistency with the rest of the document the instruction names should be capitalized (I know that smsw was not capitalized before :/). Maybe maintainers can fix this when applying? Other than that, FWIW: Reviewed-by: Ricardo Neri Tested-by: Ricardo Neri
Re: [PATCH v3] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Mon, Jun 08, 2020 at 05:38:12PM -0700, Ricardo Neri wrote: > On Mon, Jun 08, 2020 at 03:44:24PM -0700, Brendan Shanks wrote: > > Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > > processes. > > > > Wine users have found a small number of Windows apps using SLDT that > > were crashing when run on UMIP-enabled systems. > > > > Reported-by: Andreas Rammhold > > Originally-by: Ricardo Neri > > Signed-off-by: Brendan Shanks > > --- > > > > v3: Use (GDT_ENTRY_TSS * 8) for task register selector instead of > > harcoding 0x40. > > > > arch/x86/kernel/umip.c | 32 +++- > > 1 file changed, 23 insertions(+), 9 deletions(-) > > > > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > > index 8d5cbe1bbb3b..166c579b0273 100644 > > --- a/arch/x86/kernel/umip.c > > +++ b/arch/x86/kernel/umip.c > > @@ -244,16 +244,35 @@ static int emulate_umip_insn(struct insn *insn, int > > umip_inst, > > *data_size += UMIP_GDT_IDT_LIMIT_SIZE; > > memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE); > > > > - } else if (umip_inst == UMIP_INST_SMSW) { > > - unsigned long dummy_value = CR0_STATE; > > + } else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT || > > + umip_inst == UMIP_INST_STR) { > > + unsigned long dummy_value; > > + > > + if (umip_inst == UMIP_INST_SMSW) > > + dummy_value = CR0_STATE; > > + else if (umip_inst == UMIP_INST_STR) > > + dummy_value = GDT_ENTRY_TSS * 8; > > + else if (umip_inst == UMIP_INST_SLDT) > > + { > > This brace should go in the previous line. Also, if you use braces in > the last part of the conditional you should probably use them in the > previous ones. I guess in this case it woudln't improve readability. > Instead, you can probably have a switch instead of the three ifs. That > probably does improve readability and solves the dilemma of needing to > put braces in all the one-line conditionals. > > BTW, you should also delete the comment at the top of the file saying > that str and sldt will not be emulated: > > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > index 166c579b0273..0984a55eb8c0 100644 > --- a/arch/x86/kernel/umip.c > +++ b/arch/x86/kernel/umip.c > @@ -45,9 +45,6 @@ > * value that, lies close to the top of the kernel memory. The limit for the > GDT > * and the IDT are set to zero. > * > - * Given that SLDT and STR are not commonly used in programs that run on > WineHQ > - * or DOSEMU2, they are not emulated. > - * > * The instruction smsw is emulated to return the value that the register CR0 > * has at boot time as set in the head_32. ... And also explain that the emulated values for str and sldt are the simply the values that Linux assigns programatically. Thanks and BR, Ricardo
Re: [PATCH v3] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Mon, Jun 08, 2020 at 03:44:24PM -0700, Brendan Shanks wrote: > Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > processes. > > Wine users have found a small number of Windows apps using SLDT that > were crashing when run on UMIP-enabled systems. > > Reported-by: Andreas Rammhold > Originally-by: Ricardo Neri > Signed-off-by: Brendan Shanks > --- > > v3: Use (GDT_ENTRY_TSS * 8) for task register selector instead of > harcoding 0x40. > > arch/x86/kernel/umip.c | 32 +++- > 1 file changed, 23 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > index 8d5cbe1bbb3b..166c579b0273 100644 > --- a/arch/x86/kernel/umip.c > +++ b/arch/x86/kernel/umip.c > @@ -244,16 +244,35 @@ static int emulate_umip_insn(struct insn *insn, int > umip_inst, > *data_size += UMIP_GDT_IDT_LIMIT_SIZE; > memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE); > > - } else if (umip_inst == UMIP_INST_SMSW) { > - unsigned long dummy_value = CR0_STATE; > + } else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT || > +umip_inst == UMIP_INST_STR) { > + unsigned long dummy_value; > + > + if (umip_inst == UMIP_INST_SMSW) > + dummy_value = CR0_STATE; > + else if (umip_inst == UMIP_INST_STR) > + dummy_value = GDT_ENTRY_TSS * 8; > + else if (umip_inst == UMIP_INST_SLDT) > + { This brace should go in the previous line. Also, if you use braces in the last part of the conditional you should probably use them in the previous ones. I guess in this case it woudln't improve readability. Instead, you can probably have a switch instead of the three ifs. That probably does improve readability and solves the dilemma of needing to put braces in all the one-line conditionals. BTW, you should also delete the comment at the top of the file saying that str and sldt will not be emulated: diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index 166c579b0273..0984a55eb8c0 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -45,9 +45,6 @@ * value that, lies close to the top of the kernel memory. The limit for the GDT * and the IDT are set to zero. * - * Given that SLDT and STR are not commonly used in programs that run on WineHQ - * or DOSEMU2, they are not emulated. - * * The instruction smsw is emulated to return the value that the register CR0 * has at boot time as set in the head_32. * Thanks and BR, Ricardo
Re: [PATCH v2] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Mon, Jun 08, 2020 at 11:14:54AM -0700, Brendan Shanks wrote: > Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > processes. > > Wine users have found a small number of Windows apps using SLDT that > were crashing when run on UMIP-enabled systems. > > Reported-by: Andreas Rammhold > Originally-by: Ricardo Neri > Signed-off-by: Brendan Shanks > --- > > v2: Return (GDT_ENTRY_LDT * 8) for SLDT when an LDT is set. > > arch/x86/kernel/umip.c | 34 +- > 1 file changed, 25 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > index 8d5cbe1bbb3b..a85f0b0ec2b9 100644 > --- a/arch/x86/kernel/umip.c > +++ b/arch/x86/kernel/umip.c > @@ -64,6 +64,8 @@ > #define UMIP_DUMMY_GDT_BASE 0xfffeULL > #define UMIP_DUMMY_IDT_BASE 0xULL > > +#define UMIP_DUMMY_TASK_REGISTER_SELECTOR 0x40 One more thing. How was this value selected? Would it be possible to use GDT_ENTRY_TSS*8? Linux already uses this value. Thanks and BR, Ricardo
Re: [PATCH] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Fri, Jun 05, 2020 at 11:58:13AM -0700, Brendan Shanks wrote: > > > On Jun 3, 2020, at 9:39 PM, Andy Lutomirski wrote: > > > > On Wed, Jun 3, 2020 at 5:12 PM Ricardo Neri > > > <mailto:ricardo.neri-calde...@linux.intel.com>> wrote: > >> > >> On Tue, Jun 02, 2020 at 11:42:12AM -0700, Brendan Shanks wrote: > >>> Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > >>> processes. > >>> > >>> Wine users have found a small number of Windows apps using SLDT that > >>> were crashing when run on UMIP-enabled systems. > >>> > >>> Reported-by: Andreas Rammhold > >>> Originally-by: Ricardo Neri > >>> Signed-off-by: Brendan Shanks > >>> --- > >>> arch/x86/kernel/umip.c | 23 ++- > >>> 1 file changed, 14 insertions(+), 9 deletions(-) > >>> > >>> diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > >>> index 8d5cbe1bbb3b..59dfceac5cc0 100644 > >>> --- a/arch/x86/kernel/umip.c > >>> +++ b/arch/x86/kernel/umip.c > >>> @@ -64,6 +64,8 @@ > >>> #define UMIP_DUMMY_GDT_BASE 0xfffeULL > >>> #define UMIP_DUMMY_IDT_BASE 0xULL > >>> > >>> +#define UMIP_DUMMY_TASK_REGISTER_SELECTOR 0x40 > >>> + > >>> /* > >>> * The SGDT and SIDT instructions store the contents of the global > >>> descriptor > >>> * table and interrupt table registers, respectively. The destination is a > >>> @@ -244,16 +246,24 @@ static int emulate_umip_insn(struct insn *insn, int > >>> umip_inst, > >>> *data_size += UMIP_GDT_IDT_LIMIT_SIZE; > >>> memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE); > >>> > >>> - } else if (umip_inst == UMIP_INST_SMSW) { > >>> - unsigned long dummy_value = CR0_STATE; > >>> + } else if (umip_inst == UMIP_INST_SMSW || umip_inst == > >>> UMIP_INST_SLDT || > >>> +umip_inst == UMIP_INST_STR) { > >>> + unsigned long dummy_value; > >>> + > >>> + if (umip_inst == UMIP_INST_SMSW) > >>> + dummy_value = CR0_STATE; > >>> + else if (umip_inst == UMIP_INST_STR) > >>> + dummy_value = UMIP_DUMMY_TASK_REGISTER_SELECTOR; > >>> + else > >>> + dummy_value = 0; > >> > >> Perhaps you can return a non-zero value for SLDT if it has an LDT, as > >> Andy had suggested. Maybe this can be implemented by looking at > >> current->mm->context.ldt > >> > >> I guess the non-zero value can be (GDT_ENTRY_LDT*8). > > > > You could probably even get away with always returning a nonzero > > value. After all, an empty LDT is quite similar to no LDT. > > > Is something like this what you both had in mind? > I don’t have any software handy to test the LDT-present case though. Perhaps you can insert a test in the kernel selftest. Something like this (based on Andreas' test program): --- a/tools/testing/selftests/x86/ldt_gdt.c +++ b/tools/testing/selftests/x86/ldt_gdt.c @@ -220,12 +220,23 @@ static void install_invalid(const struct user_desc *desc, bool oldmode) } } +unsigned long test(void) +{ +unsigned char ldtr[5] = "\xef\xbe\xad\xde"; +unsigned long ldt = 0; +asm("sldt %0\n" : "=m" (ldtr)); +ldt = *((unsigned long *)[0]); +printf ("LDT base: 0x%lx\n", ldt); +return (ldt); +} + static int safe_modify_ldt(int func, struct user_desc *ptr, unsigned long bytecount) { int ret = syscall(SYS_modify_ldt, 0x11, ptr, bytecount); if (ret < -1) errno = -ret; + test(); return ret; } Thanks and BR, Ricardo > > > else if (umip_inst == UMIP_INST_STR) > dummy_value = UMIP_DUMMY_TASK_REGISTER_SELECTOR; > else if (umip_inst == UMIP_INST_SLDT) > { > #ifdef CONFIG_MODIFY_LDT_SYSCALL > down_read(>mm->context.ldt_usr_sem); > if (current->mm->context.ldt) > dummy_value = GDT_ENTRY_LDT * 8; > else > dummy_value = 0; > up_read(>mm->context.ldt_usr_sem); > #else > dummy_value = 0; > #endif > It looks fine to me. Perhaps Andy prefers a simpler, always-non-zero implementation? Thanks and BR, Ricardo
Re: [PATCH] x86/umip: Add emulation/spoofing for SLDT and STR instructions
On Tue, Jun 02, 2020 at 11:42:12AM -0700, Brendan Shanks wrote: > Add emulation/spoofing of SLDT and STR for both 32- and 64-bit > processes. > > Wine users have found a small number of Windows apps using SLDT that > were crashing when run on UMIP-enabled systems. > > Reported-by: Andreas Rammhold > Originally-by: Ricardo Neri > Signed-off-by: Brendan Shanks > --- > arch/x86/kernel/umip.c | 23 ++- > 1 file changed, 14 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > index 8d5cbe1bbb3b..59dfceac5cc0 100644 > --- a/arch/x86/kernel/umip.c > +++ b/arch/x86/kernel/umip.c > @@ -64,6 +64,8 @@ > #define UMIP_DUMMY_GDT_BASE 0xfffeULL > #define UMIP_DUMMY_IDT_BASE 0xULL > > +#define UMIP_DUMMY_TASK_REGISTER_SELECTOR 0x40 > + > /* > * The SGDT and SIDT instructions store the contents of the global descriptor > * table and interrupt table registers, respectively. The destination is a > @@ -244,16 +246,24 @@ static int emulate_umip_insn(struct insn *insn, int > umip_inst, > *data_size += UMIP_GDT_IDT_LIMIT_SIZE; > memcpy(data, _limit, UMIP_GDT_IDT_LIMIT_SIZE); > > - } else if (umip_inst == UMIP_INST_SMSW) { > - unsigned long dummy_value = CR0_STATE; > + } else if (umip_inst == UMIP_INST_SMSW || umip_inst == UMIP_INST_SLDT || > +umip_inst == UMIP_INST_STR) { > + unsigned long dummy_value; > + > + if (umip_inst == UMIP_INST_SMSW) > + dummy_value = CR0_STATE; > + else if (umip_inst == UMIP_INST_STR) > + dummy_value = UMIP_DUMMY_TASK_REGISTER_SELECTOR; > + else > + dummy_value = 0; Perhaps you can return a non-zero value for SLDT if it has an LDT, as Andy had suggested. Maybe this can be implemented by looking at current->mm->context.ldt I guess the non-zero value can be (GDT_ENTRY_LDT*8). Thanks and BR, Ricardo
Re: [PATCH v2 2/3] x86, sched: Bail out of frequency invariance if turbo frequency is unknown
On Sun, May 31, 2020 at 08:24:52PM +0200, Giovanni Gherdovich wrote: > There may be CPUs that support turbo boost but don't declare any turbo > ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition > scale-invariant calculations can't be performed. > > Signed-off-by: Giovanni Gherdovich > Suggested-by: Ricardo Neri FWIW, Tested-by: Ricardo Neri
Re: umip: AMD Ryzen 3900X, pagefault after emulate SLDT/SIDT instruction
On Sat, May 23, 2020 at 04:17:39AM +0200, Andreas Rammhold wrote: > On 12:43 19.05.20, Ricardo Neri wrote: > > I have a patch for this already that I wrote for testing purposes: > > https://github.com/ricardon/tip/commit/1692889cb3f8accb523d44b682458e234b93be50 > > Perhaps it can be used as a starting point? Not sure what the spoofing > > value should be, though. Perhaps 0? > > I tried the above patch (in modified/rebased version; hope that didn't > kill it [0]). The results are negative, as without the patch. Ah. My patch above is based on a rather old kernel. There is a check in fixup_umip_exception() for SLDT and STR. I think this causes the exception you see. Perhaps you can try by removing such check: diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c @@ -383,10 +389,6 @@ bool fixup_umip_exception(struct pt_regs *regs) umip_pr_warn(regs, "%s instruction cannot be used by applications.\n", umip_insns[umip_inst]); - /* Do not emulate (spoof) SLDT or STR. */ - if (umip_inst == UMIP_INST_STR || umip_inst == UMIP_INST_SLDT) - return false; - umip_pr_warn(regs, "For now, expensive software emulation returns the result.\n"); if (emulate_umip_insn(, umip_inst, dummy_data, _data_size, You would still need my old patch. Thanks and BR, Ricardo
Re: umip: AMD Ryzen 3900X, pagefault after emulate SLDT/SIDT instruction
On Tue, May 19, 2020 at 05:54:53PM -0700, Andy Lutomirski wrote: > On Tue, May 19, 2020 at 12:43 PM Ricardo Neri > wrote: > > > > On Tue, May 19, 2020 at 11:56:40AM -0700, Brendan Shanks wrote: > > > > > > > On May 19, 2020, at 7:38 AM, Andreas Rammhold > > > > wrote: > > > > > > > > Hi, > > > > > > > > I've been running into a weird problem with UMIP on a current Ryzen > > > > 3900x with kernel 5.6.11 where a process receives a page fault after the > > > > kernel handled the SLDT (or SIDT) instruction (emulation). > > > > > > > > The program I am running is run through WINE in 32bit mode and tries to > > > > figure out if it is running in a VMWare machine by comparing the results > > > > of SLDT against well known constants (basically as shown in the > > > > [example] linked below). > > > > > > > > In dmesg I see the following log lines: > > > >> [99970.004756] umip: Program.exe[3080] ip:4373fb sp:32f3e0: SIDT > > > >> instruction cannot be used by applications. > > > >> [99970.004757] umip: Program.exe[3080] ip:4373fb sp:32f3e0: For now, > > > >> expensive software emulation returns the result. > > > >> [99970.004758] umip: Program.exe[3080] ip:437415 sp:32f3e0: SLDT > > > >> instruction cannot be used by applications. > > > > > > > > Following that the process terminates with a page fault: > > > >> Unhandled exception: page fault on read access to 0x in 32-bit > > > >> code (0x00437415). > > > > > > > > Assembly at that address: > > > >> 0x00437415: sldt0xffe8(%ebp) > > > > > > > > Running the same executable on the exact same kernel (and userland) but > > > > on a Intel i7-8565U doesn't crash at this point. I am guessing the > > > > emulation is supposed to do something different on AMD CPUs? > > > > I am surprised you don't see it on the Intel processor. Maybe it does > > not have UMIP. Do you see umip when you do > > > > $ grep umip /proc/cpuinfo > > > > ? > > > > > > > > On the Ryzen the code executes successfully after setting > > > > CONFIG_X86_UMIP=n. > > > > > > Hi Andreas, > > > > > > The problem is that the kernel does not emulate/spoof the SLDT > > > instruction, only SGDT, SIDT, and SMSW. > > > SLDT and STR weren't thought to be commonly used, so emulation/spoofing > > > wasn’t added. > > > In the last few months I have seen reports of one or two (32-bit) Windows > > > games that use SLDT though. > > > Can you share more information about the application you’re running? > > > > > > Maybe the best path is to add kernel emulation/spoofing for SLDT and STR > > > on 32 and 64-bit, just to cover all the cases. It should be a pretty > > > simple patch, I’ll start working on it. > > > > I have a patch for this already that I wrote for testing purposes: > > > > https://github.com/ricardon/tip/commit/1692889cb3f8accb523d44b682458e234b93be50 > > > > Perhaps it can be used as a starting point? Not sure what the spoofing > > value should be, though. Perhaps 0? > > Possibly SLDT should return nonzero if there's an LDT. I guess the value should be in the same hole of the x86_64 memory map, right? Currently sgdt and sidt return 0xfffe and 0x, respectively. Thanks and BR, Ricardo
Re: umip: AMD Ryzen 3900X, pagefault after emulate SLDT/SIDT instruction
On Tue, May 19, 2020 at 11:56:40AM -0700, Brendan Shanks wrote: > > > On May 19, 2020, at 7:38 AM, Andreas Rammhold wrote: > > > > Hi, > > > > I've been running into a weird problem with UMIP on a current Ryzen > > 3900x with kernel 5.6.11 where a process receives a page fault after the > > kernel handled the SLDT (or SIDT) instruction (emulation). > > > > The program I am running is run through WINE in 32bit mode and tries to > > figure out if it is running in a VMWare machine by comparing the results > > of SLDT against well known constants (basically as shown in the > > [example] linked below). > > > > In dmesg I see the following log lines: > >> [99970.004756] umip: Program.exe[3080] ip:4373fb sp:32f3e0: SIDT > >> instruction cannot be used by applications. > >> [99970.004757] umip: Program.exe[3080] ip:4373fb sp:32f3e0: For now, > >> expensive software emulation returns the result. > >> [99970.004758] umip: Program.exe[3080] ip:437415 sp:32f3e0: SLDT > >> instruction cannot be used by applications. > > > > Following that the process terminates with a page fault: > >> Unhandled exception: page fault on read access to 0x in 32-bit > >> code (0x00437415). > > > > Assembly at that address: > >> 0x00437415: sldt0xffe8(%ebp) > > > > Running the same executable on the exact same kernel (and userland) but > > on a Intel i7-8565U doesn't crash at this point. I am guessing the > > emulation is supposed to do something different on AMD CPUs? I am surprised you don't see it on the Intel processor. Maybe it does not have UMIP. Do you see umip when you do $ grep umip /proc/cpuinfo ? > > > > On the Ryzen the code executes successfully after setting CONFIG_X86_UMIP=n. > > Hi Andreas, > > The problem is that the kernel does not emulate/spoof the SLDT instruction, > only SGDT, SIDT, and SMSW. > SLDT and STR weren't thought to be commonly used, so emulation/spoofing > wasn’t added. > In the last few months I have seen reports of one or two (32-bit) Windows > games that use SLDT though. > Can you share more information about the application you’re running? > > Maybe the best path is to add kernel emulation/spoofing for SLDT and STR on > 32 and 64-bit, just to cover all the cases. It should be a pretty simple > patch, I’ll start working on it. I have a patch for this already that I wrote for testing purposes: https://github.com/ricardon/tip/commit/1692889cb3f8accb523d44b682458e234b93be50 Perhaps it can be used as a starting point? Not sure what the spoofing value should be, though. Perhaps 0? Thanks and BR, Ricardo
Re: [PATCH 1/2] x86, sched: Prevent divisions by zero in frequency invariant accounting
On Sat, May 02, 2020 at 04:25:00PM +0200, Giovanni Gherdovich wrote: > > > > I've changed the patch like so.. OK? > > > > (ok, perhaps I went a little overboard with the paranoia ;-) > > Right, I wasn't really checking for overflow, only for when the product > "mcnt * arch_max_freq_ratio" becomes zero. > > Thanks for your edit (I took note of the macros check_*_overflow, didn't know > them). I fully subscribe to the paranoid approach. > > I understand you've already edited the patches in your tree, so I am not > resending, just confirming my > > Signed-off-by: Giovanni Gherdovich Hi, have these changes been merged? I still don't see them in the tip or Linus' tree. Thanks and BR, Ricardo
Re: [PATCH 2/2] x86, sched: Bail out of frequency invariance if turbo frequency is unknown
On Fri, May 01, 2020 at 03:04:27PM +0200, Peter Zijlstra wrote: > On Tue, Apr 28, 2020 at 03:24:50PM +0200, Giovanni Gherdovich wrote: > > There may be CPUs that support turbo boost but don't declare any turbo > > ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition > > scale-invariant calculations can't be performed. > > > > Signed-off-by: Giovanni Gherdovich > > Suggested-by: Ricardo Neri > > Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance") > > --- > > arch/x86/kernel/smpboot.c | 6 -- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 4718f29a3065..ab2a0df7d1fb 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -1991,9 +1991,11 @@ static bool intel_set_max_freq_ratio(void) > > /* > > * Some hypervisors advertise X86_FEATURE_APERFMPERF > > * but then fill all MSR's with zeroes. > > +* Some CPUs have turbo boost but don't declare any turbo ratio > > +* in MSR_TURBO_RATIO_LIMIT. > > */ > > - if (!base_freq) { > > - pr_debug("Couldn't determine cpu base frequency, necessary for > > scale-invariant accounting.\n"); > > + if (!base_freq || !turbo_freq) { > > + pr_debug("Couldn't determine cpu base or turbo frequency, > > necessary for scale-invariant accounting.\n"); > > return false; > > } > > I've added the below, imagine base_freq > turbo_freq * > SCHED_CAPACITY_SCALE. > > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1975,6 +1975,7 @@ static bool core_set_max_freq_ratio(u64 > static bool intel_set_max_freq_ratio(void) > { > u64 base_freq, turbo_freq; > + u64 turbo_ratio; > > if (slv_set_max_freq_ratio(_freq, _freq)) > goto out; > @@ -2008,9 +2009,15 @@ static bool intel_set_max_freq_ratio(voi > return false; > } > > - arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, > - base_freq); > + turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq); > + if (!turbo_ratio) { > + pr_debug("Non-zero turbo and base frequencies led to a 0 > ratio.\n"); > + return false; > + } > + > + arch_turbo_freq_ratio = turbo_ratio; I guess this covers more cases in which turbo_ratio can be zero. Also, FWIW Tested-by: Ricardo Neri Thanks and BR, Ricardo
Re: [PATCH 2/2] x86, sched: Bail out of frequency invariance if turbo frequency is unknown
On Tue, Apr 28, 2020 at 03:24:50PM +0200, Giovanni Gherdovich wrote: > There may be CPUs that support turbo boost but don't declare any turbo > ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition > scale-invariant calculations can't be performed. > > Signed-off-by: Giovanni Gherdovich > Suggested-by: Ricardo Neri Thanks for implementing this, Giovanni! Tested-by: Ricardo Neri
Re: [PATCH] x86/umip: Add emulation for 64-bit processes
On Thu, Sep 05, 2019 at 04:22:21PM -0700, Brendan Shanks wrote: > Add emulation of the sgdt, sidt, and smsw instructions for 64-bit > processes. > > Wine users have encountered a number of 64-bit Windows games that use > these instructions (particularly sgdt), and were crashing when run on > UMIP-enabled systems. Emulation support for 64-bit processes was not initially included because no use cases had been identified. Brendan has found one. Here is the relevant e-mail thread: https://lkml.org/lkml/2017/1/26/12 FWIW, Reviewed-by: Ricardo Neri Only one minor comment below... > > Originally-by: Ricardo Neri > Signed-off-by: Brendan Shanks > --- > arch/x86/kernel/umip.c | 55 +- > 1 file changed, 33 insertions(+), 22 deletions(-) > > diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c > index 5b345add550f..1812e95d2f55 100644 > --- a/arch/x86/kernel/umip.c > +++ b/arch/x86/kernel/umip.c > @@ -51,9 +51,7 @@ > * The instruction smsw is emulated to return the value that the register CR0 > * has at boot time as set in the head_32. > * > - * Also, emulation is provided only for 32-bit processes; 64-bit processes > - * that attempt to use the instructions that UMIP protects will receive the > - * SIGSEGV signal issued as a consequence of the general protection fault. > + * Emulation is provided for both 32-bit and 64-bit processes. > * > * Care is taken to appropriately emulate the results when segmentation is > * used. That is, rather than relying on USER_DS and USER_CS, the function > @@ -63,17 +61,18 @@ > * application uses a local descriptor table. > */ > > -#define UMIP_DUMMY_GDT_BASE 0xfffe > -#define UMIP_DUMMY_IDT_BASE 0x > +#define UMIP_DUMMY_GDT_BASE 0xfffeULL > +#define UMIP_DUMMY_IDT_BASE 0xULL > > /* > * The SGDT and SIDT instructions store the contents of the global descriptor > * table and interrupt table registers, respectively. The destination is a > * memory operand of X+2 bytes. X bytes are used to store the base address of > - * the table and 2 bytes are used to store the limit. In 32-bit processes, > the > - * only processes for which emulation is provided, X has a value of 4. > + * the table and 2 bytes are used to store the limit. In 32-bit processes X > + * has a value of 4, in 64-bit processes X has a value of 8. > */ > -#define UMIP_GDT_IDT_BASE_SIZE 4 > +#define UMIP_GDT_IDT_BASE_SIZE_64BIT 8 > +#define UMIP_GDT_IDT_BASE_SIZE_32BIT 4 > #define UMIP_GDT_IDT_LIMIT_SIZE 2 > > #define UMIP_INST_SGDT 0 /* 0F 01 /0 */ > @@ -189,6 +188,7 @@ static int identify_insn(struct insn *insn) > * @umip_inst: A constant indicating the instruction to emulate > * @data:Buffer into which the dummy result is stored > * @data_size: Size of the emulated result > + * @x86_64: true if process is 64-bit, false otherwise > * > * Emulate an instruction protected by UMIP and provide a dummy result. The > * result of the emulation is saved in @data. The size of the results depends > @@ -202,11 +202,8 @@ static int identify_insn(struct insn *insn) > * 0 on success, -EINVAL on error while emulating. > */ > static int emulate_umip_insn(struct insn *insn, int umip_inst, > - unsigned char *data, int *data_size) > + unsigned char *data, int *data_size, bool x86_64) > { > - unsigned long dummy_base_addr, dummy_value; > - unsigned short dummy_limit = 0; > - > if (!data || !data_size || !insn) > return -EINVAL; > /* > @@ -219,6 +216,9 @@ static int emulate_umip_insn(struct insn *insn, int > umip_inst, >* is always returned irrespective of the operand size. >*/ > if (umip_inst == UMIP_INST_SGDT || umip_inst == UMIP_INST_SIDT) { > + u64 dummy_base_addr; > + u16 dummy_limit = 0; > + > /* SGDT and SIDT do not use registers operands. */ > if (X86_MODRM_MOD(insn->modrm.value) == 3) > return -EINVAL; > @@ -228,13 +228,24 @@ static int emulate_umip_insn(struct insn *insn, int > umip_inst, > else > dummy_base_addr = UMIP_DUMMY_IDT_BASE; > > - *data_size = UMIP_GDT_IDT_LIMIT_SIZE + UMIP_GDT_IDT_BASE_SIZE; Maybe a blank line here? Thanks and BR, Ricardo
Re: [Kernel BUG?] SMSW operation get success on UMIP KVM guest
On Mon, Jul 01, 2019 at 08:57:28PM +0800, Li Wang wrote: > On Mon, Jul 1, 2019 at 8:02 PM Paolo Bonzini wrote: > > > On 01/07/19 09:50, Li Wang wrote: > > > Hello there, > > > > > > LTP/umip_basic_test get failed on KVM UMIP > > > system(kernel-v5.2-rc4.x86_64). The test is only trying to do > > > asm volatile("smsw %0\n" : "=m" (val)); > > > and expect to get SIGSEGV in this SMSW operation, but it exits with 0 > > > unexpectedly. > > > > In addition to what Thomas said, perhaps you are using a host that does > > *not* have UMIP, and configuring KVM to emulate it(*). In that case, it > > is not possible to intercept SMSW, and therefore it will incorrectly > > succeed. > > > > Right, I checked the host system, and confirmed that CPU doesn't support > UMIP. > > > > > Paolo > > > > (*) before the x86 people jump at me, this won't happen unless you > > explicitly pass an option to QEMU, such as "-cpu host,+umip". :) The > > incorrect emulation of SMSW when CR4.UMIP=1 is why. > > > Good to know this, is there any document for that declaration? It seems > neither LTP issue nor kernel bug here. But anyway we'd better do something > to avoid the error in the test. The test case already checks for umip in /proc/cpuinfo, right? And in long mode it always expects a SIGSEGV signal. If you did not add -cpu host,+umip, how come umip was present in /proc/cpuinfo? Thanks and BR, Ricardo
Re: [Kernel BUG?] SMSW operation get success on UMIP KVM guest
On Mon, Jul 01, 2019 at 02:02:35PM +0200, Paolo Bonzini wrote: > On 01/07/19 09:50, Li Wang wrote: > > Hello there, > > > > LTP/umip_basic_test get failed on KVM UMIP > > system(kernel-v5.2-rc4.x86_64). The test is only trying to do > > Â Â Â asm volatile("smsw %0\n" : "=m" (val)); > > and expect to get SIGSEGV in this SMSW operation, but it exits with 0 > > unexpectedly. > > In addition to what Thomas said, perhaps you are using a host that does > *not* have UMIP, and configuring KVM to emulate it(*). In that case, it > is not possible to intercept SMSW, and therefore it will incorrectly > succeed. Also, emulation for SMSW, SIDT, and SGDT is done only for 32-bit processes. As Thomas said, the purpose is not on break Wine. In 64-bit processes, we sould always see a #GP exception. > > Paolo > > (*) before the x86 people jump at me, this won't happen unless you > explicitly pass an option to QEMU, such as "-cpu host,+umip". :) The > incorrect emulation of SMSW when CR4.UMIP=1 is why. Paolo, what do you mean by the incorrect emulation of SMSW? Thanks and BR, Ricardo
[tip:x86/cpu] x86/mtrr: Skip cache flushes on CPUs with cache self-snooping
Commit-ID: fd329f276ecaad7a371d6f91b9bbea031d0c3440 Gitweb: https://git.kernel.org/tip/fd329f276ecaad7a371d6f91b9bbea031d0c3440 Author: Ricardo Neri AuthorDate: Thu, 27 Jun 2019 19:35:37 -0700 Committer: Thomas Gleixner CommitDate: Fri, 28 Jun 2019 07:21:00 +0200 x86/mtrr: Skip cache flushes on CPUs with cache self-snooping Programming MTRR registers in multi-processor systems is a rather lengthy process. Furthermore, all processors must program these registers in lock step and with interrupts disabled; the process also involves flushing caches and TLBs twice. As a result, the process may take a considerable amount of time. On some platforms, this can lead to a large skew of the refined-jiffies clock source. Early when booting, if no other clock is available (e.g., booting with hpet=disabled), the refined-jiffies clock source is used to monitor the TSC clock source. If the skew of refined-jiffies is too large, Linux wrongly assumes that the TSC is unstable: clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last: fffedb90 mask: clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4 mask: tsc: Marking TSC unstable due to clocksource watchdog As per measurements, around 98% of the time needed by the procedure to program MTRRs in multi-processor systems is spent flushing caches with wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32 Architectures Software Developer's Manual, it is not necessary to flush caches if the CPU supports cache self-snooping. Thus, skipping the cache flushes can reduce by several tens of milliseconds the time needed to complete the programming of the MTRR registers: PlatformBefore After 104-core (208 Threads) Skylake 1437ms 28ms 2-core ( 4 Threads) Haswell 114ms 2ms Reported-by: Mohammad Etemadi Signed-off-by: Ricardo Neri Signed-off-by: Thomas Gleixner Cc: Borislav Petkov Cc: Alan Cox Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Andy Shevchenko Cc: Andi Kleen Cc: Hans de Goede Cc: Greg Kroah-Hartman Cc: Jordan Borgner Cc: "Ravi V. Shankar" Cc: Ricardo Neri Cc: Andy Shevchenko Cc: Andi Kleen Cc: Peter Feiner Cc: "Rafael J. Wysocki" Link: https://lkml.kernel.org/r/1561689337-19390-3-git-send-email-ricardo.neri-calde...@linux.intel.com --- arch/x86/kernel/cpu/mtrr/generic.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c index 9356c1c9024d..aa5c064a6a22 100644 --- a/arch/x86/kernel/cpu/mtrr/generic.c +++ b/arch/x86/kernel/cpu/mtrr/generic.c @@ -743,7 +743,15 @@ static void prepare_set(void) __acquires(set_atomicity_lock) /* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */ cr0 = read_cr0() | X86_CR0_CD; write_cr0(cr0); - wbinvd(); + + /* +* Cache flushing is the most time-consuming step when programming +* the MTRRs. Fortunately, as per the Intel Software Development +* Manual, we can skip it if the processor supports cache self- +* snooping. +*/ + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) + wbinvd(); /* Save value of CR4 and clear Page Global Enable (bit 7) */ if (boot_cpu_has(X86_FEATURE_PGE)) { @@ -760,7 +768,10 @@ static void prepare_set(void) __acquires(set_atomicity_lock) /* Disable MTRRs, and set the default type to uncached */ mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi); - wbinvd(); + + /* Again, only flush caches if we have to. */ + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) + wbinvd(); } static void post_set(void) __releases(set_atomicity_lock)
[tip:x86/cpu] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
Commit-ID: 1e03bff3600101bd9158d005e4313132e55bdec8 Gitweb: https://git.kernel.org/tip/1e03bff3600101bd9158d005e4313132e55bdec8 Author: Ricardo Neri AuthorDate: Thu, 27 Jun 2019 19:35:36 -0700 Committer: Thomas Gleixner CommitDate: Fri, 28 Jun 2019 07:20:48 +0200 x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata Processors which have self-snooping capability can handle conflicting memory type across CPUs by snooping its own cache. However, there exists CPU models in which having conflicting memory types still leads to unpredictable behavior, machine check errors, or hangs. Clear this feature on affected CPUs to prevent its use. Suggested-by: Alan Cox Signed-off-by: Ricardo Neri Signed-off-by: Thomas Gleixner Cc: Borislav Petkov Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Andy Shevchenko Cc: Andi Kleen Cc: Hans de Goede Cc: Greg Kroah-Hartman Cc: Jordan Borgner Cc: "Ravi V. Shankar" Cc: Mohammad Etemadi Cc: Ricardo Neri Cc: Andy Shevchenko Cc: Andi Kleen Cc: Peter Feiner Cc: "Rafael J. Wysocki" Link: https://lkml.kernel.org/r/1561689337-19390-2-git-send-email-ricardo.neri-calde...@linux.intel.com --- arch/x86/kernel/cpu/intel.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index f17c1a714779..8d6d92ebeb54 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -66,6 +66,32 @@ void check_mpx_erratum(struct cpuinfo_x86 *c) } } +/* + * Processors which have self-snooping capability can handle conflicting + * memory type across CPUs by snooping its own cache. However, there exists + * CPU models in which having conflicting memory types still leads to + * unpredictable behavior, machine check errors, or hangs. Clear this + * feature to prevent its use on machines with known erratas. + */ +static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c) +{ + switch (c->x86_model) { + case INTEL_FAM6_CORE_YONAH: + case INTEL_FAM6_CORE2_MEROM: + case INTEL_FAM6_CORE2_MEROM_L: + case INTEL_FAM6_CORE2_PENRYN: + case INTEL_FAM6_CORE2_DUNNINGTON: + case INTEL_FAM6_NEHALEM: + case INTEL_FAM6_NEHALEM_G: + case INTEL_FAM6_NEHALEM_EP: + case INTEL_FAM6_NEHALEM_EX: + case INTEL_FAM6_WESTMERE: + case INTEL_FAM6_WESTMERE_EP: + case INTEL_FAM6_SANDYBRIDGE: + setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP); + } +} + static bool ring3mwait_disabled __read_mostly; static int __init ring3mwait_disable(char *__unused) @@ -304,6 +330,7 @@ static void early_init_intel(struct cpuinfo_x86 *c) } check_mpx_erratum(c); + check_memory_type_self_snoop_errata(c); /* * Get the number of SMT siblings early from the extended topology
[PATCH v2 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
Processors which have self-snooping capability can handle conflicting memory type across CPUs by snooping its own cache. However, there exists CPU models in which having conflicting memory types still leads to unpredictable behavior, machine check errors, or hangs. Clear this feature to prevent its use. Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Andy Shevchenko Cc: Andi Kleen Cc: Hans de Goede Cc: Peter Feiner Cc: "Rafael J. Wysocki" Cc: Greg Kroah-Hartman Cc: Jordan Borgner Cc: "Ravi V. Shankar" Cc: x...@kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Alan Cox Signed-off-by: Ricardo Neri --- arch/x86/kernel/cpu/intel.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index f17c1a714779..62e366ec0812 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -66,6 +66,32 @@ void check_mpx_erratum(struct cpuinfo_x86 *c) } } +/* + * Processors which have self-snooping capability can handle conflicting + * memory type across CPUs by snooping its own cache. However, there exists + * CPU models in which having conflicting memory types still leads to + * unpredictable behavior, machine check errors, or hangs. Clear this feature + * to prevent its use. + */ +static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c) +{ + switch (c->x86_model) { + case INTEL_FAM6_CORE_YONAH: + case INTEL_FAM6_CORE2_MEROM: + case INTEL_FAM6_CORE2_MEROM_L: + case INTEL_FAM6_CORE2_PENRYN: + case INTEL_FAM6_CORE2_DUNNINGTON: + case INTEL_FAM6_NEHALEM: + case INTEL_FAM6_NEHALEM_G: + case INTEL_FAM6_NEHALEM_EP: + case INTEL_FAM6_NEHALEM_EX: + case INTEL_FAM6_WESTMERE: + case INTEL_FAM6_WESTMERE_EP: + case INTEL_FAM6_SANDYBRIDGE: + setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP); + } +} + static bool ring3mwait_disabled __read_mostly; static int __init ring3mwait_disable(char *__unused) @@ -304,6 +330,7 @@ static void early_init_intel(struct cpuinfo_x86 *c) } check_mpx_erratum(c); + check_memory_type_self_snoop_errata(c); /* * Get the number of SMT siblings early from the extended topology -- 2.17.1
[PATCH v2 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping
Programming MTRR registers in multi-processor systems is a rather lengthy process. Furthermore, all processors must program these registers in lock step and with interrupts disabled; the process also involves flushing caches and TLBs twice. As a result, the process may take a considerable amount of time. In some platforms, this can lead to a large skew of the refined-jiffies clock source. Early when booting, if no other clock is available (e.g., booting with hpet=disabled), the refined-jiffies clock source is used to monitor the TSC clock source. If the skew of refined-jiffies is too large, Linux wrongly assumes that the TSC is unstable: clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last: fffedb90 mask: clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4 mask: tsc: Marking TSC unstable due to clocksource watchdog As per my measurements, around 98% of the time needed by the procedure to program MTRRs in multi-processor systems is spent flushing caches with wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32 Architectures Software Developer's Manual, it is not necessary to flush caches if the CPU supports cache self-snooping. Thus, skipping the cache flushes can reduce by several tens of milliseconds the time needed to complete the programming of the MTRR registers. Cc: Alan Cox Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Andy Shevchenko Cc: Andi Kleen Cc: Hans de Goede Cc: Peter Feiner Cc: "Rafael J. Wysocki" Cc: Greg Kroah-Hartman Cc: Jordan Borgner Cc: x...@kernel.org Cc: linux-kernel@vger.kernel.org Cc: "Ravi V. Shankar" Reported-by: Mohammad Etemadi Signed-off-by: Ricardo Neri --- arch/x86/kernel/cpu/mtrr/generic.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c index 9356c1c9024d..aa5c064a6a22 100644 --- a/arch/x86/kernel/cpu/mtrr/generic.c +++ b/arch/x86/kernel/cpu/mtrr/generic.c @@ -743,7 +743,15 @@ static void prepare_set(void) __acquires(set_atomicity_lock) /* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */ cr0 = read_cr0() | X86_CR0_CD; write_cr0(cr0); - wbinvd(); + + /* +* Cache flushing is the most time-consuming step when programming +* the MTRRs. Fortunately, as per the Intel Software Development +* Manual, we can skip it if the processor supports cache self- +* snooping. +*/ + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) + wbinvd(); /* Save value of CR4 and clear Page Global Enable (bit 7) */ if (boot_cpu_has(X86_FEATURE_PGE)) { @@ -760,7 +768,10 @@ static void prepare_set(void) __acquires(set_atomicity_lock) /* Disable MTRRs, and set the default type to uncached */ mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi); - wbinvd(); + + /* Again, only flush caches if we have to. */ + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) + wbinvd(); } static void post_set(void) __releases(set_atomicity_lock) -- 2.17.1
[PATCH v2 0/2] Speed MTRR programming up when we can
This is the second iteration of this patchset. The first iteration can be viewed here [1]. Programming MTRR registers in multi-processor systems is a rather lengthy process. Furthermore, all processors must program these registers in lock step and with interrupts disabled; the process also involves flushing caches and TLBs twice. As a result, the process may take a considerable amount of time. In some platforms, this can lead to a large skew of the refined-jiffies clock source. Early when booting, if no other clock is available (e.g., booting with hpet=disabled), the refined-jiffies clock source is used to monitor the TSC clock source. If the skew of refined-jiffies is too large, Linux wrongly assumes that the TSC is unstable: clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last: fffedb90 mask: clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4 mask: tsc: Marking TSC unstable due to clocksource watchdog As per my measurements, around 98% of the time needed by the procedure to program MTRRs in multi-processor systems is spent flushing caches with wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32 Architectures Software Developer's Manual, it is not necessary to flush caches if the CPU supports cache self-snooping. Thus, skipping the cache flushes can reduce by several tens of milliseconds the time needed to complete the programming of the MTRR registers. However, there exist CPU models with errata that affect their self- snooping capabilities. Such errata may cause unpredictable behavior, machine check errors, or hangs. For instance: "Where two different logical processors have the same code page mapped with two different memory types Specifically if one code page is mapped by one logical processor as write back and by another as uncacheable and certain instruction timing conditions occur the system may experience unpredictable behaviour." [2]. Similar errata are reported in other processors as well [3], [4], [5], [6], and [7]. Thus, in order to confidently leverage self-snooping for the MTRR programming algorithm, we must first clear such feature in models with known errata. By measuring the execution time of mtrr_aps_init() (from which MTRRs in all CPUs are programmed in lock-step at boot), I find savings in the time required to program MTRRs as follows: Platform time-with-wbinvd(ms) time-no-wbinvd(ms) 104-core (208 LP) Skylake1437 28 2-core (4 LP) Haswell 114 2 LP = Logical Processor Thanks and BR, Ricardo Changes since v1: * Relocated comment on the utility of cache self-snooping from check_memory_type_self_snoop_errata() to the prepare_set() function of the generic MTRR programming ops (Thomas Gleixner). * In early_init_intel(), moved check_memory_type_self_snoop_errata() next to check_mpx_erratum() for improved readability. [1]. https://lkml.org/lkml/2019/6/27/828 [2]. Erratum BF52, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf [3]. Erratum BK47, https://www.mouser.com/pdfdocs/2ndgencorefamilymobilespecificationupdate.pdf [4]. Erratum AAO54, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf [5]. Errata AZ39, AZ42, https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf [6]. Errata AQ51, AQ102, AQ104, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-desktop-e2000-specification-update.pdf [7]. Errata AN107, AN109, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf Ricardo Neri (2): x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping arch/x86/kernel/cpu/intel.c| 27 +++ arch/x86/kernel/cpu/mtrr/generic.c | 15 +-- 2 files changed, 40 insertions(+), 2 deletions(-) -- 2.17.1
Re: [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
On Thu, Jun 27, 2019 at 10:38:13PM +0200, Thomas Gleixner wrote: > Ricardo, > > On Thu, 27 Jun 2019, Ricardo Neri wrote: > > > > +/* > > + * Processors which have self-snooping capability can handle conflicting > > + * memory type across CPUs by snooping its own cache. However, there exists > > + * CPU models in which having conflicting memory types still leads to > > + * unpredictable behavior, machine check errors, or hangs. Clear this > > feature > > + * to prevent its use. For instance, the algorithm to program the Memory > > Type > > + * Region Registers and the Page Attribute Table MSR can skip expensive > > cache > > + * flushes if self-snooping is supported. > > I appreciate informative comments, but this is the part which disables a > feature on errata inflicted CPUs. So the whole information about what > self-snooping helps with is not that interesting here. It's broken, we > disable it and be done with it. Sure, Thomas. I will move the the usefulness of self-snooping to the MTRR programming function as you mention below. > > > + */ > > +static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c) > > +{ > > + switch (c->x86_model) { > > + case INTEL_FAM6_CORE_YONAH: > > + case INTEL_FAM6_CORE2_MEROM: > > + case INTEL_FAM6_CORE2_MEROM_L: > > + case INTEL_FAM6_CORE2_PENRYN: > > + case INTEL_FAM6_CORE2_DUNNINGTON: > > + case INTEL_FAM6_NEHALEM: > > + case INTEL_FAM6_NEHALEM_G: > > + case INTEL_FAM6_NEHALEM_EP: > > + case INTEL_FAM6_NEHALEM_EX: > > + case INTEL_FAM6_WESTMERE: > > + case INTEL_FAM6_WESTMERE_EP: > > + case INTEL_FAM6_SANDYBRIDGE: > > + setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP); > > + } > > +} > > + > > But looking at the actual interesting part of the 2nd patch: > > > @@ -743,7 +743,9 @@ static void prepare_set(void) > > __acquires(set_atomicity_lock) > >/* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */ > >cr0 = read_cr0() | X86_CR0_CD; > >write_cr0(cr0); > > - wbinvd(); > > + > > + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) > > + wbinvd(); > > This part lacks any form of explanation. So I'd rather have the comment > about why we can avoid the wbindv() here. I''d surely never would look at > that errata handling function to get that information. > > Other than that detail, the patches are well done! Thank you, Thomas! BR, Ricardo
[PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
Processors which have self-snooping capability can handle conflicting memory type across CPUs by snooping its own cache. However, there exists CPU models in which having conflicting memory types still leads to unpredictable behavior, machine check errors, or hangs. Clear this feature to prevent its use. For instance, the algorithm to program the Memory Type Region Registers and the Page Attribute Table MSR can skip expensive cache flushes if self-snooping is supported. Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Andy Shevchenko Cc: Andi Kleen Cc: Hans de Goede Cc: Peter Feiner Cc: "Rafael J. Wysocki" Cc: Greg Kroah-Hartman Cc: Jordan Borgner Cc: "Ravi V. Shankar" Cc: x...@kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Alan Cox Signed-off-by: Ricardo Neri --- arch/x86/kernel/cpu/intel.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index f17c1a714779..8684c66e71e0 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -66,6 +66,34 @@ void check_mpx_erratum(struct cpuinfo_x86 *c) } } +/* + * Processors which have self-snooping capability can handle conflicting + * memory type across CPUs by snooping its own cache. However, there exists + * CPU models in which having conflicting memory types still leads to + * unpredictable behavior, machine check errors, or hangs. Clear this feature + * to prevent its use. For instance, the algorithm to program the Memory Type + * Region Registers and the Page Attribute Table MSR can skip expensive cache + * flushes if self-snooping is supported. + */ +static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c) +{ + switch (c->x86_model) { + case INTEL_FAM6_CORE_YONAH: + case INTEL_FAM6_CORE2_MEROM: + case INTEL_FAM6_CORE2_MEROM_L: + case INTEL_FAM6_CORE2_PENRYN: + case INTEL_FAM6_CORE2_DUNNINGTON: + case INTEL_FAM6_NEHALEM: + case INTEL_FAM6_NEHALEM_G: + case INTEL_FAM6_NEHALEM_EP: + case INTEL_FAM6_NEHALEM_EX: + case INTEL_FAM6_WESTMERE: + case INTEL_FAM6_WESTMERE_EP: + case INTEL_FAM6_SANDYBRIDGE: + setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP); + } +} + static bool ring3mwait_disabled __read_mostly; static int __init ring3mwait_disable(char *__unused) @@ -311,6 +339,8 @@ static void early_init_intel(struct cpuinfo_x86 *c) */ if (detect_extended_topology_early(c) < 0) detect_ht_early(c); + + check_memory_type_self_snoop_errata(c); } #ifdef CONFIG_X86_32 -- 2.17.1
[PATCH 0/2] Speed MTRR programming up when we can
Programming MTRR registers in multi-processor systems is a rather lengthy process. Furthermore, all processors must program these registers in lock step and with interrupts disabled; the process also involves flushing caches and TLBs twice. As a result, the process may take a considerable amount of time. In some platforms, this can lead to a large skew of the refined-jiffies clock source. Early when booting, if no other clock is available (e.g., booting with hpet=disabled), the refined-jiffies clock source is used to monitor the TSC clock source. If the skew of refined-jiffies is too large, Linux wrongly assumes that the TSC is unstable: clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last: fffedb90 mask: clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4 mask: tsc: Marking TSC unstable due to clocksource watchdog As per my measurements, around 98% of the time needed by the procedure to program MTRRs in multi-processor systems is spent flushing caches with wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32 Architectures Software Developer's Manual, it is not necessary to flush caches if the CPU supports cache self-snooping. Thus, skipping the cache flushes can reduce by several tens of milliseconds the time needed to complete the programming of the MTRR registers. By measuring the execution time of mtrr_aps_init() (from which MTRRs in all CPUs are programmed in lock-step at boot), I find savings in the time required to program MTRRs as follows: Platform time-with-wbinvd(ms) time-no-wbinvd(ms) 104-core (208 LP) Skylake1437 28 2-core (4 LP) Haswell 114 2 However, there exist CPU models with errata that affect their self- snooping capability. Such errata may cause unpredictable behavior, machine check errors, or hangs. For instance: "Where two different logical processors have the same code page mapped with two different memory types Specifically if one code page is mapped by one logical processor as write back and by another as uncacheable and certain instruction timing conditions occur the system may experience unpredictable behaviour." [1]. Similar errata are reported in other processors as well [2], [3], [4], [5], and [6]. Thus, in order to confidently leverage self-snooping for the MTRR programming algorithm, we must first clear such feature in models with known errata. Thanks and BR, Ricardo LP = Logical Processor [1]. Erratum BF52, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf [2]. Erratum BK47, https://www.mouser.com/pdfdocs/2ndgencorefamilymobilespecificationupdate.pdf [3]. Erratum AAO54, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf [4]. Errata AZ39, AZ42, https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf [5]. Errata AQ51, AQ102, AQ104, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-desktop-e2000-specification-update.pdf [6]. Errata AN107, AN109, https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf Ricardo Neri (2): x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping arch/x86/kernel/cpu/intel.c| 30 ++ arch/x86/kernel/cpu/mtrr/generic.c | 8 ++-- 2 files changed, 36 insertions(+), 2 deletions(-) -- 2.17.1
[PATCH 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping
Programming MTRR registers in multi-processor systems is a rather lengthy process. Furthermore, all processors must program these registers in lock step and with interrupts disabled; the process also involves flushing caches and TLBs twice. As a result, the process may take a considerable amount of time. In some platforms, this can lead to a large skew of the refined-jiffies clock source. Early when booting, if no other clock is available (e.g., booting with hpet=disabled), the refined-jiffies clock source is used to monitor the TSC clock source. If the skew of refined-jiffies is too large, Linux wrongly assumes that the TSC is unstable: clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last: fffedb90 mask: clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4 mask: tsc: Marking TSC unstable due to clocksource watchdog As per my measurements, around 98% of the time needed by the procedure to program MTRRs in multi-processor systems is spent flushing caches with wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32 Architectures Software Developer's Manual, it is not necessary to flush caches if the CPU supports cache self-snooping. Thus, skipping the cache flushes can reduce by several tens of milliseconds the time needed to complete the programming of the MTRR registers. Cc: Alan Cox Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Andy Shevchenko Cc: Andi Kleen Cc: Hans de Goede Cc: Peter Feiner Cc: "Rafael J. Wysocki" Cc: Greg Kroah-Hartman Cc: Jordan Borgner Cc: x...@kernel.org Cc: linux-kernel@vger.kernel.org Cc: "Ravi V. Shankar" Reported-by: Mohammad Etemadi Signed-off-by: Ricardo Neri --- arch/x86/kernel/cpu/mtrr/generic.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c index 9356c1c9024d..169672a6935c 100644 --- a/arch/x86/kernel/cpu/mtrr/generic.c +++ b/arch/x86/kernel/cpu/mtrr/generic.c @@ -743,7 +743,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock) /* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */ cr0 = read_cr0() | X86_CR0_CD; write_cr0(cr0); - wbinvd(); + + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) + wbinvd(); /* Save value of CR4 and clear Page Global Enable (bit 7) */ if (boot_cpu_has(X86_FEATURE_PGE)) { @@ -760,7 +762,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock) /* Disable MTRRs, and set the default type to uncached */ mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi); - wbinvd(); + + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) + wbinvd(); } static void post_set(void) __releases(set_atomicity_lock) -- 2.17.1
[RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the hardlockup detector
Implement the initial configuration of the timer to be used by the hardlockup detector. Return a data structure with a description of the timer; this information is subsequently used by the hardlockup detector. Only provide the timer if it supports Front Side Bus interrupt delivery. This condition greatly simplifies the implementation of the detector. Specifically, it helps to avoid the complexities of routing the interrupt via the IO-APIC (e.g., potential race conditions that arise from re- programming the IO-APIC in NMI context). Cc: "H. Peter Anvin" Cc: Ashok Raj Cc: Andi Kleen Cc: Tony Luck Cc: Clemens Ladisch Cc: Arnd Bergmann Cc: Philippe Ombredanne Cc: Kate Stewart Cc: "Rafael J. Wysocki" Cc: Stephane Eranian Cc: Suravee Suthikulpanit Cc: "Ravi V. Shankar" Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/include/asm/hpet.h | 13 + arch/x86/kernel/hpet.c | 35 +++ 2 files changed, 48 insertions(+) diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index 6f099e2781ce..20abdaa5372d 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -109,6 +109,19 @@ extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int period); #endif /* CONFIG_HPET_EMULATE_RTC */ +#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET +struct hpet_hld_data { + boolhas_periodic; + u32 num; + u64 ticks_per_second; +}; + +extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void); +#else +static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void) +{ return NULL; } +#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */ + #else /* CONFIG_HPET_TIMER */ static inline int hpet_enable(void) { return 0; } diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index ff0250831786..5f9209949fc7 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -171,6 +171,41 @@ do { \ _hpet_print_config(__func__, __LINE__); \ } while (0) +#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET +struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void) +{ + struct hpet_hld_data *hdata; + u64 temp; + u32 cfg; + + cfg = hpet_readl(HPET_Tn_CFG(HPET_WD_TIMER_NR)); + + if (!(cfg & HPET_TN_FSB_CAP)) + return NULL; + + hdata = kzalloc(sizeof(*hdata), GFP_KERNEL); + if (!hdata) + return NULL; + + if (cfg & HPET_TN_PERIODIC_CAP) + hdata->has_periodic = true; + + hdata->num = HPET_WD_TIMER_NR; + + cfg = hpet_readl(HPET_PERIOD); + + /* +* hpet_get_ticks_per_sec() expects the contents of the general +* capabilities register. The period is in the 32 most significant +* bits. +*/ + temp = (u64)cfg << HPET_COUNTER_CLK_PERIOD_SHIFT; + hdata->ticks_per_second = hpet_get_ticks_per_sec(temp); + + return hdata; +} +#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */ + /* * When the hpet driver (/dev/hpet) is enabled, we need to reserve * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case -- 2.17.1
[RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
The only direct method to determine whether an HPET timer caused an interrupt is to read the Interrupt Status register. Unfortunately, reading HPET registers is slow and, therefore, it is not recommended to read them while in NMI context. Furthermore, status is not available if the interrupt is generated vi the Front Side Bus. An indirect manner to infer if the non-maskable interrupt we see was caused by the HPET timer is to use the time-stamp counter. Compute the value that the time-stamp counter should have at the next interrupt of the HPET timer. Since the hardlockup detector operates in seconds, high precision is not needed. This implementation considers that the HPET caused the HMI if the time-stamp counter reads the expected value -/+ 1.5%. This value is selected as it is equivalent to 1/64 and the division can be performed using a bit shift operation. Experimentally, the error in the estimation is consistently less than 1%. The computation of the expected value of the time-stamp counter must be performed in relation to watchdog_thresh divided by the number of monitored CPUs. This quantity is stored in tsc_ticks_per_cpu and must be updated whenever the number of monitored CPUs changes. Cc: "H. Peter Anvin" Cc: Ashok Raj Cc: Andi Kleen Cc: Tony Luck Cc: Peter Zijlstra Cc: Clemens Ladisch Cc: Arnd Bergmann Cc: Philippe Ombredanne Cc: Kate Stewart Cc: "Rafael J. Wysocki" Cc: Mimi Zohar Cc: Jan Kiszka Cc: Nick Desaulniers Cc: Masahiro Yamada Cc: Nayna Jain Cc: Stephane Eranian Cc: Suravee Suthikulpanit Cc: "Ravi V. Shankar" Cc: x...@kernel.org Suggested-by: Andi Kleen Signed-off-by: Ricardo Neri --- arch/x86/include/asm/hpet.h | 2 ++ arch/x86/kernel/watchdog_hld_hpet.c | 27 ++- 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index 64acacce095d..fd99f2390714 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -115,6 +115,8 @@ struct hpet_hld_data { u32 num; u64 ticks_per_second; u64 ticks_per_cpu; + u64 tsc_next; + u64 tsc_ticks_per_cpu; u32 handling_cpu; u32 enabled_cpus; struct msi_msg msi_msg; diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c index 74aeb0535d08..dcc50cd29374 100644 --- a/arch/x86/kernel/watchdog_hld_hpet.c +++ b/arch/x86/kernel/watchdog_hld_hpet.c @@ -24,6 +24,7 @@ static struct hpet_hld_data *hld_data; static bool hardlockup_use_hpet; +static u64 tsc_next_error; /** * kick_timer() - Reprogram timer to expire in the future @@ -33,11 +34,22 @@ static bool hardlockup_use_hpet; * Reprogram the timer to expire within watchdog_thresh seconds in the future. * If the timer supports periodic mode, it is not kicked unless @force is * true. + * + * Also, compute the expected value of the time-stamp counter at the time of + * expiration as well as a deviation from the expected value. The maximum + * deviation is of ~1.5%. This deviation can be easily computed by shifting + * by 6 positions the delta between the current and expected time-stamp values. */ static void kick_timer(struct hpet_hld_data *hdata, bool force) { + u64 tsc_curr, tsc_delta, new_compare, count, period = 0; bool kick_needed = force || !(hdata->has_periodic); - u64 new_compare, count, period = 0; + + tsc_curr = rdtsc(); + + tsc_delta = (unsigned long)watchdog_thresh * hdata->tsc_ticks_per_cpu; + hdata->tsc_next = tsc_curr + tsc_delta; + tsc_next_error = tsc_delta >> 6; /* * Update the comparator in increments of watch_thresh seconds relative @@ -93,6 +105,15 @@ static void enable_timer(struct hpet_hld_data *hdata) */ static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata) { + if (smp_processor_id() == hdata->handling_cpu) { + u64 tsc_curr; + + tsc_curr = rdtsc(); + + return (tsc_curr - hdata->tsc_next) + tsc_next_error < + 2 * tsc_next_error; + } + return false; } @@ -260,6 +281,10 @@ static void update_ticks_per_cpu(struct hpet_hld_data *hdata) do_div(temp, hdata->enabled_cpus); hdata->ticks_per_cpu = temp; + + temp = (unsigned long)tsc_khz * 1000L; + do_div(temp, hdata->enabled_cpus); + hdata->tsc_ticks_per_cpu = temp; } /** -- 2.17.1
[RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes
Instead of setting the timer period directly in hpet_set_periodic(), add a new helper function hpet_set_comparator() that only sets the accumulator and comparator. hpet_set_periodic() will only prepare the timer for periodic mode and leave the expiration programming to hpet_set_comparator(). This new function can also be used by other components (e.g., the HPET- based hardlockup detector) which also need to configure HPET timers. Thus, add its declaration into the hpet header file. Cc: "H. Peter Anvin" Cc: Ashok Raj Cc: Andi Kleen Cc: Tony Luck Cc: Philippe Ombredanne Cc: Kate Stewart Cc: "Rafael J. Wysocki" Cc: Stephane Eranian Cc: Suravee Suthikulpanit Cc: "Ravi V. Shankar" Cc: x...@kernel.org Originally-by: Suravee Suthikulpanit Signed-off-by: Ricardo Neri --- arch/x86/include/asm/hpet.h | 1 + arch/x86/kernel/hpet.c | 57 + 2 files changed, 46 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index f132fbf984d4..e7098740f5ee 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -102,6 +102,7 @@ extern int hpet_rtc_timer_init(void); extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id); extern int hpet_register_irq_handler(rtc_irq_handler handler); extern void hpet_unregister_irq_handler(rtc_irq_handler handler); +extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int period); #endif /* CONFIG_HPET_EMULATE_RTC */ diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 5e86e024c489..1723d55219e8 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -290,6 +290,47 @@ static void hpet_legacy_clockevent_register(void) printk(KERN_DEBUG "hpet clockevent registered\n"); } +/** + * hpet_set_comparator() - Helper function for setting comparator register + * @num: The timer ID + * @cmp: The value to be written to the comparator/accumulator + * @period:The value to be written to the period (0 = oneshot mode) + * + * Helper function for updating comparator, accumulator and period values. + * + * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing + * to the Tn_CMP to update the accumulator. Then, HPET needs a second + * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period. + * The HPET_TN_SETVAL bit is automatically cleared after the first write. + * + * For one-shot mode, HPET_TN_SETVAL does not need to be set. + * + * See the following documents: + * - Intel IA-PC HPET (High Precision Event Timers) Specification + * - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674 + */ +void hpet_set_comparator(int num, unsigned int cmp, unsigned int period) +{ + if (period) { + unsigned int v = hpet_readl(HPET_Tn_CFG(num)); + + hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num)); + } + + hpet_writel(cmp, HPET_Tn_CMP(num)); + + if (!period) + return; + + /* +* This delay is seldom used: never in one-shot mode and in periodic +* only when reprogramming the timer. +*/ + udelay(1); + hpet_writel(period, HPET_Tn_CMP(num)); +} +EXPORT_SYMBOL_GPL(hpet_set_comparator); + static int hpet_set_periodic(struct clock_event_device *evt, int timer) { unsigned int cfg, cmp, now; @@ -301,19 +342,11 @@ static int hpet_set_periodic(struct clock_event_device *evt, int timer) now = hpet_readl(HPET_COUNTER); cmp = now + (unsigned int)delta; cfg = hpet_readl(HPET_Tn_CFG(timer)); - cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL | - HPET_TN_32BIT; + cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT; hpet_writel(cfg, HPET_Tn_CFG(timer)); - hpet_writel(cmp, HPET_Tn_CMP(timer)); - udelay(1); - /* -* HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL -* cleared) to T0_CMP to set the period. The HPET_TN_SETVAL -* bit is automatically cleared after the first write. -* (See AMD-8111 HyperTransport I/O Hub Data Sheet, -* Publication # 24674) -*/ - hpet_writel((unsigned int)delta, HPET_Tn_CMP(timer)); + + hpet_set_comparator(timer, cmp, (unsigned int)delta); + hpet_start_counter(); hpet_print_config(); -- 2.17.1
[RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable
The HPET-based hardlockup detector relies on the TSC to determine if an observed NMI interrupt was originated by HPET timer. Hence, this detector can no longer be used with an unstable TSC. In such case, permanently stop the HPET-based hardlockup detector and start the perf-based detector. Signed-off-by: Ricardo Neri --- arch/x86/include/asm/hpet.h| 2 ++ arch/x86/kernel/tsc.c | 2 ++ arch/x86/kernel/watchdog_hld.c | 7 +++ 3 files changed, 11 insertions(+) diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index fd99f2390714..a82cbe17479d 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void); extern void hardlockup_detector_hpet_stop(void); extern void hardlockup_detector_hpet_enable(unsigned int cpu); extern void hardlockup_detector_hpet_disable(unsigned int cpu); +extern void hardlockup_detector_switch_to_perf(void); #else static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void) { return NULL; } @@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void) static inline void hardlockup_detector_hpet_stop(void) {} static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {} static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {} +static void harrdlockup_detector_switch_to_perf(void) {} #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */ #else /* CONFIG_HPET_TIMER */ diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 59b57605e66c..b2210728ce3d 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1158,6 +1158,8 @@ void mark_tsc_unstable(char *reason) clocksource_mark_unstable(_tsc_early); clocksource_mark_unstable(_tsc); + + hardlockup_detector_switch_to_perf(); } EXPORT_SYMBOL_GPL(mark_tsc_unstable); diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c index c2512d4c79c5..c8547c227a41 100644 --- a/arch/x86/kernel/watchdog_hld.c +++ b/arch/x86/kernel/watchdog_hld.c @@ -76,3 +76,10 @@ void watchdog_nmi_stop(void) if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) hardlockup_detector_hpet_stop(); } + +void hardlockup_detector_switch_to_perf(void) +{ + detector_type = X86_HARDLOCKUP_DETECTOR_PERF; + hardlockup_detector_hpet_stop(); + hardlockup_start_all(); +} -- 2.17.1
[RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
This is the initial implementation of a hardlockup detector driven by an HPET timer. This initial implementation includes functions to control the timer via its registers. It also requests such timer, installs an NMI interrupt handler and performs the initial configuration of the timer. The detector is not functional at this stage. A subsequent changeset will invoke the interfaces provides by this detector as well as functionality to determine if the HPET timer caused the NMI. In order to detect hardlockups in all the monitored CPUs, move the interrupt to the next monitored CPU while handling the NMI interrupt; wrap around when reaching the highest CPU in the mask. This rotation is achieved by setting the affinity mask to only contain the next CPU to monitor. A cpumask keeps track of all the CPUs that need to be monitored. Such cpumask is updated when the watchdog is enabled or disabled in a particular CPU. This detector relies on an HPET timer that is capable of using Front Side Bus interrupts. In order to avoid using the generic interrupt code, program directly the MSI message register of the HPET timer. HPET registers are only accessed to kick the timer after looking for hardlockups. This happens every watchdog_thresh seconds. A subsequent changeset will determine whether the HPET timer caused the interrupt based on the value of the time-stamp counter. For now, just add a stub function. Cc: "H. Peter Anvin" Cc: Ashok Raj Cc: Andi Kleen Cc: Tony Luck Cc: Peter Zijlstra Cc: Clemens Ladisch Cc: Arnd Bergmann Cc: Philippe Ombredanne Cc: Kate Stewart Cc: "Rafael J. Wysocki" Cc: "Ravi V. Shankar" Cc: Mimi Zohar Cc: Jan Kiszka Cc: Nick Desaulniers Cc: Masahiro Yamada Cc: Nayna Jain Cc: Stephane Eranian Cc: Suravee Suthikulpanit Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/Kconfig.debug | 11 + arch/x86/include/asm/hpet.h | 13 ++ arch/x86/kernel/Makefile| 1 + arch/x86/kernel/hpet.c | 3 +- arch/x86/kernel/watchdog_hld_hpet.c | 335 5 files changed, 362 insertions(+), 1 deletion(-) create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index f730680dc818..445bbb188f10 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -169,6 +169,17 @@ config IOMMU_LEAK config HAVE_MMIOTRACE_SUPPORT def_bool y +config X86_HARDLOCKUP_DETECTOR_HPET + bool "Use HPET Timer for Hard Lockup Detection" + select SOFTLOCKUP_DETECTOR + select HARDLOCKUP_DETECTOR + select HARDLOCKUP_DETECTOR_CORE + depends on HPET_TIMER && HPET && X86_64 + help + Say y to enable a hardlockup detector that is driven by a High- + Precision Event Timer. This option is helpful to not use counters + from the Performance Monitoring Unit to drive the detector. + config X86_DECODER_SELFTEST bool "x86 instruction decoder selftest" depends on DEBUG_KERNEL && KPROBES diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index 20abdaa5372d..31fc27508cf3 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -114,12 +114,25 @@ struct hpet_hld_data { boolhas_periodic; u32 num; u64 ticks_per_second; + u32 handling_cpu; + u32 enabled_cpus; + struct msi_msg msi_msg; + unsigned long cpu_monitored_mask[0]; }; extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void); +extern int hardlockup_detector_hpet_init(void); +extern void hardlockup_detector_hpet_stop(void); +extern void hardlockup_detector_hpet_enable(unsigned int cpu); +extern void hardlockup_detector_hpet_disable(unsigned int cpu); #else static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void) { return NULL; } +static inline int hardlockup_detector_hpet_init(void) +{ return -ENODEV; } +static inline void hardlockup_detector_hpet_stop(void) {} +static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {} +static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {} #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */ #else /* CONFIG_HPET_TIMER */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 3578ad248bc9..3ad55de67e8b 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -106,6 +106,7 @@ obj-$(CONFIG_VM86) += vm86_32.o obj-$(CONFIG_EARLY_PRINTK) += early_printk.o obj-$(CONFIG_HPET_TIMER) += hpet.o +obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o obj-$(CONFIG_APB_TIMER)+= apb_timer.o obj-$(CONFIG_AMD_NB) += amd_nb.o diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 5f9209949fc7..dd3bb664a188 100644 --- a/arch
[RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode
A recent change introduced a new member to struct irq_cfg to specify the delivery mode of an interrupt. Supporting the configuration of the delivery mode would require adding a third argument to prepare_irte(). Instead, simply take a pointer to a irq_cfg data structure as a the only argument. Internally, configure the delivery mode of the Interrupt Remapping Table Entry as specified in the irq_cfg data structure and not as the APIC setting. This change does not change the existing behavior, as the delivery mode of the APIC is used to configure irq_cfg data structure. Cc: Ashok Raj Cc: Andi Kleen Cc: Tony Luck Cc: Borislav Petkov Cc: Jacob Pan Cc: Joerg Roedel Cc: Juergen Gross Cc: Bjorn Helgaas Cc: Wincy Van Cc: Kate Stewart Cc: Philippe Ombredanne Cc: "Eric W. Biederman" Cc: Baoquan He Cc: Jan Kiszka Cc: Lu Baolu Cc: Stephane Eranian Cc: Suravee Suthikulpanit Cc: "Ravi V. Shankar" Cc: x...@kernel.org Cc: io...@lists.linux-foundation.org Signed-off-by: Ricardo Neri --- drivers/iommu/intel_irq_remapping.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 4160aa9f3f80..2e61eaca7d7e 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -1072,7 +1072,7 @@ static int reenable_irq_remapping(int eim) return -1; } -static void prepare_irte(struct irte *irte, int vector, unsigned int dest) +static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg) { memset(irte, 0, sizeof(*irte)); @@ -1086,9 +1086,9 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest) * irq migration in the presence of interrupt-remapping. */ irte->trigger_mode = 0; - irte->dlvry_mode = apic->irq_delivery_mode; - irte->vector = vector; - irte->dest_id = IRTE_DEST(dest); + irte->dlvry_mode = irq_cfg->delivery_mode; + irte->vector = irq_cfg->vector; + irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid); irte->redir_hint = 1; } @@ -1265,7 +1265,7 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data, struct irte *irte = >irte_entry; struct msi_msg *msg = >msi_entry; - prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid); + prepare_irte(irte, irq_cfg); switch (info->type) { case X86_IRQ_ALLOC_TYPE_IOAPIC: /* Set source-id of interrupt request */ -- 2.17.1
[RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
Keep the HPET-based hardlockup detector disabled unless explicitly enabled via a command-line argument. If such parameter is not given, the initialization of the hpet-based hardlockup detector fails and the NMI watchdog will fallback to use the perf-based implementation. Given that __setup("nmi_watchdog=") is already used to control the behavior of the NMI watchdog (via hardlockup_panic_setup()), it cannot be used to control of the hpet-based implementation. Instead, use a new early_param("nmi_watchdog"). Cc: "H. Peter Anvin" Cc: Ashok Raj Cc: Andi Kleen Cc: Tony Luck Cc: Peter Zijlstra Cc: Clemens Ladisch Cc: Arnd Bergmann Cc: Philippe Ombredanne Cc: Kate Stewart Cc: "Rafael J. Wysocki" Cc: Mimi Zohar Cc: Jan Kiszka Cc: Nick Desaulniers Cc: Masahiro Yamada Cc: Nayna Jain Cc: Stephane Eranian Cc: Suravee Suthikulpanit Cc: "Ravi V. Shankar" Cc: x...@kernel.org Signed-off-by: Ricardo Neri -- checkpatch gives the following warning: CHECK: __setup appears un-documented -- check Documentation/admin-guide/kernel-parameters.rst +__setup("nmi_watchdog=", hardlockup_detector_hpet_setup); This is a false-positive as the option nmi_watchdog is already documented. The option is re-evaluated in this file as well. --- .../admin-guide/kernel-parameters.txt | 8 ++- arch/x86/kernel/watchdog_hld_hpet.c | 22 +++ 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 138f6664b2e2..17ed3dcda13e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2831,7 +2831,7 @@ Format: [state][,regs][,debounce][,die] nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels - Format: [panic,][nopanic,][num] + Format: [panic,][nopanic,][num,][hpet] Valid num: 0 or 1 0 - turn hardlockup detector in nmi_watchdog off 1 - turn hardlockup detector in nmi_watchdog on @@ -2841,6 +2841,12 @@ please see 'nowatchdog'. This is useful when you use a panic=... timeout and need the box quickly up again. + When hpet is specified, the NMI watchdog will be driven + by an HPET timer, if available in the system. Otherwise, + it falls back to the default implementation (perf or + architecture-specific). Specifying hpet has no effect + if the NMI watchdog is not enabled (either at build time + or via the command line). These settings can be accessed at runtime via the nmi_watchdog and hardlockup_panic sysctls. diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c index dcc50cd29374..76eed714a1cb 100644 --- a/arch/x86/kernel/watchdog_hld_hpet.c +++ b/arch/x86/kernel/watchdog_hld_hpet.c @@ -351,6 +351,28 @@ void hardlockup_detector_hpet_stop(void) disable_timer(hld_data); } +/** + * hardlockup_detector_hpet_setup() - Parse command-line parameters + * @str: A string containing the kernel command line + * + * Parse the nmi_watchdog parameter from the kernel command line. If + * selected by the user, use this implementation to detect hardlockups. + */ +static int __init hardlockup_detector_hpet_setup(char *str) +{ + if (!str) + return -EINVAL; + + if (parse_option_str(str, "hpet")) + hardlockup_use_hpet = true; + + if (!nmi_watchdog_user_enabled && hardlockup_use_hpet) + pr_warn("Selecting HPET NMI watchdog has no effect with NMI watchdog disabled\n"); + + return 0; +} +early_param("nmi_watchdog", hardlockup_detector_hpet_setup); + /** * hardlockup_detector_hpet_init() - Initialize the hardlockup detector * -- 2.17.1
Re: [RFC PATCH v2 12/14] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
On Wed, Apr 10, 2019 at 09:01:52AM +0200, Peter Zijlstra wrote: > On Tue, Apr 09, 2019 at 06:19:57PM -0700, Ricardo Neri wrote: > > On Tue, Apr 09, 2019 at 01:28:17PM +0200, Peter Zijlstra wrote: > > > > @@ -147,6 +161,14 @@ static void set_periodic(struct hpet_hld_data > > > > *hdata) > > > > */ > > > > static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata) > > > > { > > > > + if (smp_processor_id() == hdata->handling_cpu) { > > > > + unsigned long tsc_curr; > > > > > > TSC is u64 > > > > In x86_64, isn't u64 an unsigned long? Do you mean to consider the > > 32-bit case? > > Unless none of this code is available for x86_32, you pretty much have > to consider 32bit. > > But even then, using u64 for 64bit values is the right thing. OK, I'll implement this change. Thanks and BR, Ricardo