Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
On 2021/4/15 10:49, Liuxiangdong wrote: On 2021/4/15 9:38, Xu, Like wrote: On 2021/4/14 22:49, Liuxiangdong wrote: Hi Like, On 2021/4/9 16:46, Like Xu wrote: Hi Liuxiangdong, On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service Product Dept.) wrote: Do you have any comments or ideas about it ? https://lore.kernel.org/kvm/606e5ef6.2060...@huawei.com/ My expectation is that there may be many fewer PEBS samples on Skylake without any soft lockup. You may need to confirm the statement "All that matters is that the EPT pages don't get unmapped ever while PEBS is active" is true in the kernel level. Try "-overcommit mem-lock=on" for your qemu. Sorry, in fact, I don't quite understand "My expectation is that there may be many fewer PEBS samples on Skylake without any soft lockup. " For testcase: perf record -e instructions:pp ./workload We can get 2242 samples on the ICX guest, but only 17 samples or less on the Skylake guest. In my testcase on Skylake, neither the host nor the guest triggered the soft lock. Thanks for your explanation! Could you please show your complete qemu command and qemu version used on Skylake? I hope I can test it again according to your qemu cmd and version. A new version is released and you may have a try. qemu command: "-enable-kvm -cpu host,migratable=no" qemu base commit: db55d2c9239d445cb7f1fa8ede8e42bd339058f4 kvm base commit: f96be2deac9bca3ef5a2b0b66b71fcef8bad586d diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 63c55f45ca92..727f55400eaf 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5618,6 +5618,7 @@ __init int intel_pmu_init(void) case INTEL_FAM6_KABYLAKE: case INTEL_FAM6_COMETLAKE_L: case INTEL_FAM6_COMETLAKE: + x86_pmu.pebs_vmx = 1; x86_add_quirk(intel_pebs_isolation_quirk); x86_pmu.late_ack = true; memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids)); diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 100a749251b8..9e37e3dbe3ae 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -150,9 +150,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, * the accuracy of the PEBS profiling result, because the "event IP" * in the PEBS record is calibrated on the guest side. */ - attr.precise_ip = 1; - if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32) - attr.precise_ip = 3; + attr.precise_ip = x86_match_cpu(vmx_icl_pebs_cpu) ? + ((pmc->idx == 32) ? 3 : 1) : ((pmc->idx == 1) ? 3 : 1); } event = perf_event_create_kernel_counter(&attr, -1, current, And, I have used "-overcommit mem-lock=on" when soft lockup happens. I misunderstood the use of "mem-lock=on". It is not the same as the guest mem pin and I believe more kernel patches are needed. Now, I have tried to configure 1G-hugepages for 2G-mem vm. Each of guest numa nodes has 1G mem. When I use pebs(perf record -e cycles:pp) in guest, there are successful pebs samples just for a while and then I cannot get pebs samples. Host doesn't soft lockup in this process. In the worst case, no samples are expected. Are there something wrong on skylake for we can only get a few samples? IRQ? Or using hugepage is not effecitve? The few samples comes from hardware limitation. The Skylake doesn't have this "EPT-Friendly PEBS" capabilityand some PEBS records will be lost when used by guests. Thanks! On 2021/4/6 13:14, Xu, Like wrote: Hi Xiangdong, On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service Product Dept.) wrote: Hi,like. Some questions about this new pebs patches set: https://lore.kernel.org/kvm/20210329054137.120994-2-like...@linux.intel.com/ The new hardware facility supporting guest PEBS is only available on Intel Ice Lake Server platforms for now. Yes, we have documented this "EPT-friendly PEBS" capability in the SDM 18.3.10.1 Processor Event Based Sampling (PEBS) Facility And again, this patch set doesn't officially support guest PEBS on the Skylake. AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake doesn't. But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose counter in Skylake. For Skylake, only the PMC0-PMC3 are valid for PEBS and you may mask the other unsupported bits in the pmu->pebs_enable_mask. Is there anything else that only Icelake supports in this patches set? The PDIR counter on the Ice Lake is the fixed counter 0 while the PDIR counter on the Sky Lake is the gp counter 1. You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch. Besides, we have tried
[PATCH v5 16/16] KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
The CPUID features PDCM, DS and DTES64 are required for PEBS feature. KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS is supported in the KVM on the Ice Lake server platforms. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/kvm/vmx/capabilities.h | 26 ++ arch/x86/kvm/vmx/vmx.c | 15 +++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index d1d77985e889..241e41221701 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -5,6 +5,7 @@ #include #include "lapic.h" +#include "pmu.h" extern bool __read_mostly enable_vpid; extern bool __read_mostly flexpriority_enabled; @@ -378,20 +379,29 @@ static inline bool vmx_pt_mode_is_host_guest(void) return pt_mode == PT_MODE_HOST_GUEST; } -static inline u64 vmx_get_perf_capabilities(void) +static inline bool vmx_pebs_supported(void) { - u64 perf_cap = 0; - - if (boot_cpu_has(X86_FEATURE_PDCM)) - rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); - - perf_cap &= PMU_CAP_LBR_FMT; + return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_vmx; +} +static inline u64 vmx_get_perf_capabilities(void) +{ /* * Since counters are virtualized, KVM would support full * width counting unconditionally, even if the host lacks it. */ - return PMU_CAP_FW_WRITES | perf_cap; + u64 perf_cap = PMU_CAP_FW_WRITES; + u64 host_perf_cap = 0; + + if (boot_cpu_has(X86_FEATURE_PDCM)) + rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap); + + perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; + + if (vmx_pebs_supported()) + perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK; + + return perf_cap; } static inline u64 vmx_supported_debugctl(void) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5ad12bb76296..e44eb57706e2 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2261,6 +2261,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!cpuid_model_is_consistent(vcpu)) return 1; } + if (data & PERF_CAP_PEBS_FORMAT) { + if ((data & PERF_CAP_PEBS_MASK) != + (vmx_get_perf_capabilities() & PERF_CAP_PEBS_MASK)) + return 1; + if (!guest_cpuid_has(vcpu, X86_FEATURE_DS)) + return 1; + if (!guest_cpuid_has(vcpu, X86_FEATURE_DTES64)) + return 1; + if (!cpuid_model_is_consistent(vcpu)) + return 1; + } ret = kvm_set_msr_common(vcpu, msr_info); break; @@ -7287,6 +7298,10 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_clear(X86_FEATURE_INVPCID); if (vmx_pt_mode_is_host_guest()) kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); + if (vmx_pebs_supported()) { + kvm_cpu_cap_check_and_set(X86_FEATURE_DS); + kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64); + } if (vmx_umip_emulated()) kvm_cpu_cap_set(X86_FEATURE_UMIP); -- 2.30.2
[PATCH v5 15/16] KVM: x86/cpuid: Refactor host/guest CPU model consistency check
For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be renamed for reuse by more callers, and remove the comment about LBR use case can be deleted by the way. Signed-off-by: Like Xu --- arch/x86/kvm/cpuid.h | 5 + arch/x86/kvm/vmx/pmu_intel.c | 12 +--- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/vmx/vmx.h | 1 - 4 files changed, 7 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index ded84d244f19..3114ecff8080 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -278,6 +278,11 @@ static inline int guest_cpuid_model(struct kvm_vcpu *vcpu) return x86_model(best->eax); } +static inline bool cpuid_model_is_consistent(struct kvm_vcpu *vcpu) +{ + return boot_cpu_data.x86_model == guest_cpuid_model(vcpu); +} + static inline int guest_cpuid_stepping(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index fb297ffb5481..31e0e5e7d5a5 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -173,16 +173,6 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr) return get_gp_pmc(pmu, msr, MSR_IA32_PMC0); } -bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu) -{ - /* -* As a first step, a guest could only enable LBR feature if its -* cpu model is the same as the host because the LBR registers -* would be pass-through to the guest and they're model specific. -*/ - return boot_cpu_data.x86_model == guest_cpuid_model(vcpu); -} - bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu) { struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu); @@ -578,7 +568,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) nested_vmx_pmu_entry_exit_ctls_update(vcpu); - if (intel_pmu_lbr_is_compatible(vcpu)) + if (cpuid_model_is_consistent(vcpu)) x86_perf_get_lbr(&lbr_desc->records); else lbr_desc->records.nr = 0; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4f0e35a0cd0f..5ad12bb76296 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2258,7 +2258,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if ((data & PMU_CAP_LBR_FMT) != (vmx_get_perf_capabilities() & PMU_CAP_LBR_FMT)) return 1; - if (!intel_pmu_lbr_is_compatible(vcpu)) + if (!cpuid_model_is_consistent(vcpu)) return 1; } ret = kvm_set_msr_common(vcpu, msr_info); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 1311f67046aa..28a588d83a01 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -97,7 +97,6 @@ union vmx_exit_reason { #define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records) void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu); -bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu); bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu); int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu); -- 2.30.2
[PATCH v5 14/16] KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability
The information obtained from the interface perf_get_x86_pmu_capability() doesn't change, so an exportable "struct x86_pmu_capability" is introduced for all guests in the KVM, and it's initialized before hardware_setup(). Signed-off-by: Like Xu --- arch/x86/kvm/cpuid.c | 24 +++- arch/x86/kvm/pmu.c | 3 +++ arch/x86/kvm/pmu.h | 20 arch/x86/kvm/vmx/pmu_intel.c | 17 - arch/x86/kvm/x86.c | 9 - 5 files changed, 42 insertions(+), 31 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 6bd2f8b830e4..b3c751d425b7 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -680,32 +680,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) case 9: break; case 0xa: { /* Architectural Performance Monitoring */ - struct x86_pmu_capability cap; union cpuid10_eax eax; union cpuid10_edx edx; - perf_get_x86_pmu_capability(&cap); + eax.split.version_id = kvm_pmu_cap.version; + eax.split.num_counters = kvm_pmu_cap.num_counters_gp; + eax.split.bit_width = kvm_pmu_cap.bit_width_gp; + eax.split.mask_length = kvm_pmu_cap.events_mask_len; - /* -* Only support guest architectural pmu on a host -* with architectural pmu. -*/ - if (!cap.version) - memset(&cap, 0, sizeof(cap)); - - eax.split.version_id = min(cap.version, 2); - eax.split.num_counters = cap.num_counters_gp; - eax.split.bit_width = cap.bit_width_gp; - eax.split.mask_length = cap.events_mask_len; - - edx.split.num_counters_fixed = min(cap.num_counters_fixed, MAX_FIXED_COUNTERS); - edx.split.bit_width_fixed = cap.bit_width_fixed; + edx.split.num_counters_fixed = kvm_pmu_cap.num_counters_fixed; + edx.split.bit_width_fixed = kvm_pmu_cap.bit_width_fixed; edx.split.anythread_deprecated = 1; edx.split.reserved1 = 0; edx.split.reserved2 = 0; entry->eax = eax.full; - entry->ebx = cap.events_mask; + entry->ebx = kvm_pmu_cap.events_mask; entry->ecx = 0; entry->edx = edx.full; break; diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 666a5e90a3cb..4798bf991b60 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -19,6 +19,9 @@ #include "lapic.h" #include "pmu.h" +struct x86_pmu_capability __read_mostly kvm_pmu_cap; +EXPORT_SYMBOL_GPL(kvm_pmu_cap); + /* This is enough to filter the vast majority of currently defined events. */ #define KVM_PMU_EVENT_FILTER_MAX_EVENTS 300 diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 6c902b2d2d5a..e945cf604c13 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -160,6 +160,24 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE; } +extern struct x86_pmu_capability kvm_pmu_cap; + +static inline void kvm_init_pmu_capability(void) +{ + perf_get_x86_pmu_capability(&kvm_pmu_cap); + + /* +* Only support guest architectural pmu on +* a host with architectural pmu. +*/ + if (!kvm_pmu_cap.version) + memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap)); + + kvm_pmu_cap.version = min(kvm_pmu_cap.version, 2); + kvm_pmu_cap.num_counters_fixed = min(kvm_pmu_cap.num_counters_fixed, +MAX_FIXED_COUNTERS); +} + void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); @@ -177,9 +195,11 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu); void kvm_pmu_cleanup(struct kvm_vcpu *vcpu); void kvm_pmu_destroy(struct kvm_vcpu *vcpu); int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp); +void kvm_init_pmu_capability(void); bool is_vmware_backdoor_pmc(u32 pmc_idx); extern struct kvm_pmu_ops intel_pmu_ops; extern struct kvm_pmu_ops amd_pmu_ops; + #endif /* __KVM_X86_PMU_H */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 989e7245d790..fb297ffb5481 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -504,8 +504,6 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); - - struct x86_pmu_capability x86_pmu; struct kvm_cpuid_entry2 *entry; union cpuid10_eax eax;
[PATCH v5 13/16] KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations
The guest PEBS will be disabled when some users try to perf KVM and its user-space through the same PEBS facility OR when the host perf doesn't schedule the guest PEBS counter in a one-to-one mapping manner (neither of these are typical scenarios). The PEBS records in the guest DS buffer are still accurate and the above two restrictions will be checked before each vm-entry only if guest PEBS is deemed to be enabled. Suggested-by: Wei Wang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c| 11 +-- arch/x86/include/asm/kvm_host.h | 9 + arch/x86/kvm/vmx/pmu_intel.c| 19 +++ arch/x86/kvm/vmx/vmx.c | 4 arch/x86/kvm/vmx/vmx.h | 1 + 5 files changed, 42 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index dc6335a054ff..8786a1d39940 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3895,8 +3895,15 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, }; - /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */ - arr[0].guest |= arr[*nr].guest; + if (arr[*nr].host) { + /* Disable guest PEBS if host PEBS is enabled. */ + arr[*nr].guest = 0; + } else { + /* Disable guest PEBS for cross-mapped PEBS counters. */ + arr[*nr].guest &= ~pmu->host_cross_mapped_mask; + /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */ + arr[0].guest |= arr[*nr].guest; + } ++(*nr); return arr; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index e1a6b7c0537c..5aadf6060011 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -466,6 +466,15 @@ struct kvm_pmu { u64 pebs_data_cfg; u64 pebs_data_cfg_mask; + /* +* If a guest counter is cross-mapped to host counter with different +* index, its PEBS capability will be temporarily disabled. +* +* The user should make sure that this mask is updated +* after disabling interrupts and before perf_guest_get_msrs(); +*/ + u64 host_cross_mapped_mask; + /* * The gate to release perf_events not marked in * pmc_in_use only once in a vcpu time slice. diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index c846d3eef7a7..989e7245d790 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -770,6 +770,25 @@ static void intel_pmu_cleanup(struct kvm_vcpu *vcpu) intel_pmu_release_guest_lbr_event(vcpu); } +void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu) +{ + struct kvm_pmc *pmc = NULL; + int bit; + + for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl, +X86_PMC_IDX_MAX) { + pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit); + + if (!pmc || !pmc_speculative_in_use(pmc) || + !pmc_is_enabled(pmc)) + continue; + + if (pmc->perf_event && (pmc->idx != pmc->perf_event->hw.idx)) + pmu->host_cross_mapped_mask |= + BIT_ULL(pmc->perf_event->hw.idx); + } +} + struct kvm_pmu_ops intel_pmu_ops = { .find_arch_event = intel_find_arch_event, .find_fixed_event = intel_find_fixed_event, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 58673351c475..4f0e35a0cd0f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6539,6 +6539,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx) struct perf_guest_switch_msr *msrs; struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu); + pmu->host_cross_mapped_mask = 0; + if (pmu->pebs_enable & pmu->global_ctrl) + intel_pmu_cross_mapped_check(pmu); + /* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */ msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu); if (!msrs) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 7886a08505cc..1311f67046aa 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -96,6 +96,7 @@ union vmx_exit_reason { #define vcpu_to_lbr_desc(vcpu) (&to_vmx(vcpu)->lbr_desc) #define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records) +void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu); bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu); bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu); -- 2.30.2
[PATCH v5 10/16] KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" : 1 = PEBS is not supported. 0 = PEBS is supported. A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS is enabled. Some PEBS drivers in guest may care about this bit. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 2 ++ arch/x86/kvm/x86.c | 4 2 files changed, 6 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 58f32a55cc2e..c846d3eef7a7 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -588,6 +588,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED_VLBR, 1); if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) { + vcpu->arch.ia32_misc_enable_msr &= ~MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) { pmu->pebs_enable_mask = ~pmu->global_ctrl; pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE; @@ -597,6 +598,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) } pmu->pebs_data_cfg_mask = ~0xff0full; } else { + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1a64e816e06d..ed38f1dada63 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3126,6 +3126,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_MISC_ENABLE: data &= ~MSR_IA32_MISC_ENABLE_EMON; + if (!msr_info->host_initiated && + (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) && + (data & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL)) + return 1; if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) && ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) { if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3)) -- 2.30.2
[PATCH v5 11/16] KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
The PEBS-PDIR facility on Ice Lake server is supported on IA31_FIXED0 only. If the guest configures counter 32 and PEBS is enabled, the PEBS-PDIR facility is supposed to be used, in which case KVM adjusts attr.precise_ip to 3 and request host perf to assign the exactly requested counter or fail. The cpu model check is also required since some platforms may place the PEBS-PDIR facility in another counter index. Signed-off-by: Like Xu --- arch/x86/kvm/pmu.c | 2 ++ arch/x86/kvm/pmu.h | 7 +++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 0f86c1142f17..d3f746877d1b 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -149,6 +149,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, * in the PEBS record is calibrated on the guest side. */ attr.precise_ip = 1; + if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32) + attr.precise_ip = 3; } event = perf_event_create_kernel_counter(&attr, -1, current, diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 7b30bc967af3..d9157128e6eb 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -4,6 +4,8 @@ #include +#include + #define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu) #define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu)) #define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu) @@ -16,6 +18,11 @@ #define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002 #define MAX_FIXED_COUNTERS 3 +static const struct x86_cpu_id vmx_icl_pebs_cpu[] = { + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, NULL), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, NULL), + {} +}; struct kvm_event_hw_type_mapping { u8 eventsel; -- 2.30.2
[PATCH v5 12/16] KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h
It allows this inline function to be reused by more callers in more files, such as pmu_intel.c. Signed-off-by: Like Xu --- arch/x86/kvm/pmu.c | 11 --- arch/x86/kvm/pmu.h | 11 +++ 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index d3f746877d1b..666a5e90a3cb 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -477,17 +477,6 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu) kvm_pmu_refresh(vcpu); } -static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) -{ - struct kvm_pmu *pmu = pmc_to_pmu(pmc); - - if (pmc_is_fixed(pmc)) - return fixed_ctrl_field(pmu->fixed_ctr_ctrl, - pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3; - - return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE; -} - /* Release perf_events for vPMCs that have been unused for a full time slice. */ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index d9157128e6eb..6c902b2d2d5a 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -149,6 +149,17 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) return sample_period; } +static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) +{ + struct kvm_pmu *pmu = pmc_to_pmu(pmc); + + if (pmc_is_fixed(pmc)) + return fixed_ctrl_field(pmu->fixed_ctr_ctrl, + pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3; + + return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE; +} + void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); -- 2.30.2
[PATCH v5 09/16] KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL. FCx_Adaptive_Record) are also supported. Adaptive PEBS provides software the capability to configure the PEBS records to capture only the data of interest, keeping the record size compact. An overflow of PMCx results in generation of an adaptive PEBS record with state information based on the selections specified in MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group. When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c| 8 arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx/pmu_intel.c| 16 3 files changed, 26 insertions(+) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 6cd857066d69..dc6335a054ff 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3881,6 +3881,14 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) .guest = pmu->ds_area, }; + if (x86_pmu.intel_cap.pebs_baseline) { + arr[(*nr)++] = (struct perf_guest_switch_msr){ + .msr = MSR_PEBS_DATA_CFG, + .host = cpuc->pebs_data_cfg, + .guest = pmu->pebs_data_cfg, + }; + } + arr[*nr] = (struct perf_guest_switch_msr){ .msr = MSR_IA32_PEBS_ENABLE, .host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask, diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c9bc8352b1f0..e1a6b7c0537c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -463,6 +463,8 @@ struct kvm_pmu { u64 ds_area; u64 pebs_enable; u64 pebs_enable_mask; + u64 pebs_data_cfg; + u64 pebs_data_cfg_mask; /* * The gate to release perf_events not marked in diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 5584b8dfadb3..58f32a55cc2e 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -226,6 +226,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_IA32_DS_AREA: ret = guest_cpuid_has(vcpu, X86_FEATURE_DS); break; + case MSR_PEBS_DATA_CFG: + ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE; + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -379,6 +382,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_DS_AREA: msr_info->data = pmu->ds_area; return 0; + case MSR_PEBS_DATA_CFG: + msr_info->data = pmu->pebs_data_cfg; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -452,6 +458,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; pmu->ds_area = data; return 0; + case MSR_PEBS_DATA_CFG: + if (pmu->pebs_data_cfg == data) + return 0; + if (!(data & pmu->pebs_data_cfg_mask)) { + pmu->pebs_data_cfg = data; + return 0; + } + break; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -505,6 +519,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->reserved_bits = 0x0020ull; pmu->fixed_ctr_ctrl_mask = ~0ull; pmu->pebs_enable_mask = ~0ull; + pmu->pebs_data_cfg_mask = ~0ull; entry = kvm_find_cpuid_entry(vcpu, 0xa, 0); if (!entry) @@ -580,6 +595,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->fixed_ctr_ctrl_mask &= ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4)); } + pmu->pebs_data_cfg_mask = ~0xff0full; } else { pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1); -- 2.30.2
[PATCH v5 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points to the linear address of the first byte of the DS buffer management area, which is used to manage the PEBS records. When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register contains a non-canonical address. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c| 11 ++- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/pmu_intel.c| 11 +++ 3 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 4e5ed12cb52d..6cd857066d69 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3838,6 +3839,8 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; + struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds); + struct kvm_pmu *pmu = (struct kvm_pmu *)data; u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ? cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK); @@ -3849,7 +3852,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) (~cpuc->intel_ctrl_host_mask | ~pebs_mask), }; - if (!x86_pmu.pebs) + if (!pmu || !x86_pmu.pebs_vmx) return arr; /* @@ -3872,6 +3875,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) if (!x86_pmu.pebs_vmx) return arr; + arr[(*nr)++] = (struct perf_guest_switch_msr){ + .msr = MSR_IA32_DS_AREA, + .host = (unsigned long)ds, + .guest = pmu->ds_area, + }; + arr[*nr] = (struct perf_guest_switch_msr){ .msr = MSR_IA32_PEBS_ENABLE, .host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask, diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a48abcad3329..c9bc8352b1f0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,6 +460,7 @@ struct kvm_pmu { DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX); DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX); + u64 ds_area; u64 pebs_enable; u64 pebs_enable_mask; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 9938b485c31c..5584b8dfadb3 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -223,6 +223,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_IA32_PEBS_ENABLE: ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT; break; + case MSR_IA32_DS_AREA: + ret = guest_cpuid_has(vcpu, X86_FEATURE_DS); + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -373,6 +376,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_PEBS_ENABLE: msr_info->data = pmu->pebs_enable; return 0; + case MSR_IA32_DS_AREA: + msr_info->data = pmu->ds_area; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -441,6 +447,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_IA32_DS_AREA: + if (is_noncanonical_address(data, vcpu)) + return 1; + pmu->ds_area = data; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { -- 2.30.2
[PATCH v5 06/16] KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter
When a guest counter is configured as a PEBS counter through IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by configuring a non-zero precision level in the perf_event_attr. The guest PEBS overflow PMI bit would be set in the guest GLOBAL_STATUS MSR when PEBS facility generates a PEBS overflow PMI based on guest IA32_DS_AREA MSR. Even with the same counter index and the same event code and mask, guest PEBS events will not be reused for non-PEBS events. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/kvm/pmu.c | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 827886c12c16..0f86c1142f17 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -74,11 +74,21 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event, { struct kvm_pmc *pmc = perf_event->overflow_handler_context; struct kvm_pmu *pmu = pmc_to_pmu(pmc); + bool skip_pmi = false; if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) { - __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); + if (perf_event->attr.precise_ip) { + /* Indicate PEBS overflow PMI to guest. */ + skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, + (unsigned long *)&pmu->global_status); + } else { + __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); + } kvm_make_request(KVM_REQ_PMU, pmc->vcpu); + if (skip_pmi) + return; + /* * Inject PMI. If vcpu was in a guest mode during NMI PMI * can be ejected on a guest mode re-entry. Otherwise we can't @@ -99,6 +109,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, bool exclude_kernel, bool intr, bool in_tx, bool in_tx_cp) { + struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu); struct perf_event *event; struct perf_event_attr attr = { .type = type, @@ -110,6 +121,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .exclude_kernel = exclude_kernel, .config = config, }; + bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); attr.sample_period = get_sample_period(pmc, pmc->counter); @@ -124,9 +136,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, attr.sample_period = 0; attr.config |= HSW_IN_TX_CHECKPOINTED; } + if (pebs) { + /* +* The non-zero precision level of guest event makes the ordinary +* guest event becomes a guest PEBS event and triggers the host +* PEBS PMI handler to determine whether the PEBS overflow PMI +* comes from the host counters or the guest. +* +* For most PEBS hardware events, the difference in the software +* precision levels of guest and host PEBS events will not affect +* the accuracy of the PEBS profiling result, because the "event IP" +* in the PEBS record is calibrated on the guest side. +*/ + attr.precise_ip = 1; + } event = perf_event_create_kernel_counter(&attr, -1, current, -intr ? kvm_perf_overflow_intr : +(intr || pebs) ? kvm_perf_overflow_intr : kvm_perf_overflow, pmc); if (IS_ERR(event)) { pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n", @@ -161,6 +187,10 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc) get_sample_period(pmc, pmc->counter))) return false; + if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) && + pmc->perf_event->attr.precise_ip) + return false; + /* reuse perf_event to serve as pmc_reprogram_counter() does*/ perf_event_enable(pmc->perf_event); -- 2.30.2
[PATCH v5 07/16] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed and general purpose counters have corresponding bits in IA32_PEBS_ENABLE that enable generation of PEBS records. The general-purpose counter bits start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at bit IA32_PEBS_ENABLE[32]. When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be added to the perf_guest_switch_msr() and atomically switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. Based on whether the platform supports x86_pmu.pebs_vmx, it has also refactored the way to add more msrs to art[] in intel_guest_get_msrs() for extensibility. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c | 61 +--- arch/x86/include/asm/kvm_host.h | 3 ++ arch/x86/include/asm/msr-index.h | 6 arch/x86/kvm/vmx/pmu_intel.c | 31 4 files changed, 80 insertions(+), 21 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 2f8ac53fe594..4e5ed12cb52d 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3838,31 +3838,50 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; + u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ? + cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK); + + *nr = 0; + arr[(*nr)++] = (struct perf_guest_switch_msr){ + .msr = MSR_CORE_PERF_GLOBAL_CTRL, + .host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask, + .guest = x86_pmu.intel_ctrl & + (~cpuc->intel_ctrl_host_mask | ~pebs_mask), + }; - arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; - arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; - arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; - if (x86_pmu.flags & PMU_FL_PEBS_ALL) - arr[0].guest &= ~cpuc->pebs_enabled; - else - arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); - *nr = 1; + if (!x86_pmu.pebs) + return arr; - if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) { - /* -* If PMU counter has PEBS enabled it is not enough to -* disable counter on a guest entry since PEBS memory -* write can overshoot guest entry and corrupt guest -* memory. Disabling PEBS solves the problem. -* -* Don't do this if the CPU already enforces it. -*/ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; - *nr = 2; + /* +* If PMU counter has PEBS enabled it is not enough to +* disable counter on a guest entry since PEBS memory +* write can overshoot guest entry and corrupt guest +* memory. Disabling PEBS solves the problem. +* +* Don't do this if the CPU already enforces it. +*/ + if (x86_pmu.pebs_no_isolation) { + arr[(*nr)++] = (struct perf_guest_switch_msr){ + .msr = MSR_IA32_PEBS_ENABLE, + .host = cpuc->pebs_enabled, + .guest = 0, + }; + return arr; } + if (!x86_pmu.pebs_vmx) + return arr; + + arr[*nr] = (struct perf_guest_switch_msr){ + .msr = MSR_IA32_PEBS_ENABLE, + .host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask, + .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, + }; + + /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */ + arr[0].guest |= arr[*nr].guest; + + ++(*nr); return arr; } diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5b9692397350..a48abcad3329 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,6 +460,9 @@ struct kvm_pmu { DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX); DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX); + u64 pebs_enable; + u64 pebs_enable_mask; + /* * The gate to release perf_events not marked in * pmc_in_use only once in a vcpu time slice. diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 546d6ecf0a35..2e997c8c79bf 100644 --- a/arch/x
[PATCH v5 05/16] KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
The mask value of fixed counter control register should be dynamic adjusted with the number of fixed counters. This patch introduces a variable that includes the reserved bits of fixed counter control registers. This is needed for later Ice Lake fixed counter changes. Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/pmu_intel.c| 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 44f893043a3c..5b9692397350 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -444,6 +444,7 @@ struct kvm_pmu { unsigned nr_arch_fixed_counters; unsigned available_event_types; u64 fixed_ctr_ctrl; + u64 fixed_ctr_ctrl_mask; u64 global_ctrl; u64 global_status; u64 global_ovf_ctrl; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index d9dbebe03cae..ac7fe714e6c1 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -400,7 +400,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_FIXED_CTR_CTRL: if (pmu->fixed_ctr_ctrl == data) return 0; - if (!(data & 0xf444ull)) { + if (!(data & pmu->fixed_ctr_ctrl_mask)) { reprogram_fixed_counters(pmu, data); return 0; } @@ -470,6 +470,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) struct kvm_cpuid_entry2 *entry; union cpuid10_eax eax; union cpuid10_edx edx; + int i; pmu->nr_arch_gp_counters = 0; pmu->nr_arch_fixed_counters = 0; @@ -477,6 +478,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->counter_bitmask[KVM_PMC_FIXED] = 0; pmu->version = 0; pmu->reserved_bits = 0x0020ull; + pmu->fixed_ctr_ctrl_mask = ~0ull; entry = kvm_find_cpuid_entry(vcpu, 0xa, 0); if (!entry) @@ -511,6 +513,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) ((u64)1 << edx.split.bit_width_fixed) - 1; } + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) + pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4)); pmu->global_ctrl = ((1ull << pmu->nr_arch_gp_counters) - 1) | (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED); pmu->global_ctrl_mask = ~pmu->global_ctrl; -- 2.30.2
[PATCH v5 02/16] perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest
With PEBS virtualization, the guest PEBS records get delivered to the guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest() to distinguish whether the PMI comes from the guest code like Intel PT. No matter how many guest PEBS counters are overflowed, only triggering one fake event is enough. The fake event causes the KVM PMI callback to be called, thereby injecting the PEBS overflow PMI into the guest. KVM may inject the PMI with BUFFER_OVF set, even if the guest DS is empty. That should really be harmless. Thus guest PEBS handler would retrieve the correct information from its own PEBS records buffer. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c | 40 1 file changed, 40 insertions(+) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 591d60cc8436..021658df1feb 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2747,6 +2747,43 @@ static void intel_pmu_reset(void) local_irq_restore(flags); } +/* + * We may be running with guest PEBS events created by KVM, and the + * PEBS records are logged into the guest's DS and invisible to host. + * + * In the case of guest PEBS overflow, we only trigger a fake event + * to emulate the PEBS overflow PMI for guest PBES counters in KVM. + * The guest will then vm-entry and check the guest DS area to read + * the guest PEBS records. + * + * The contents and other behavior of the guest event do not matter. + */ +static void x86_pmu_handle_guest_pebs(struct pt_regs *regs, + struct perf_sample_data *data) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; + struct perf_event *event = NULL; + int bit; + + if (!x86_pmu.pebs_active || !guest_pebs_idxs) + return; + + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs, +INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) { + event = cpuc->events[bit]; + if (!event->attr.precise_ip) + continue; + + perf_sample_data_init(data, 0, event->hw.last_period); + if (perf_event_overflow(event, data, regs)) + x86_pmu_stop(event, 0); + + /* Inject one fake event is enough. */ + break; + } +} + static int handle_pmi_common(struct pt_regs *regs, u64 status) { struct perf_sample_data data; @@ -2797,6 +2834,9 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) u64 pebs_enabled = cpuc->pebs_enabled; handled++; + if (x86_pmu.pebs_vmx && perf_guest_cbs && + perf_guest_cbs->is_in_guest()) + x86_pmu_handle_guest_pebs(regs, &data); x86_pmu.drain_pebs(regs, &data); status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI; -- 2.30.2
[PATCH v5 04/16] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to detect whether the processor supports performance monitoring facility. It depends on the PMU is enabled for the guest, and a software write operation to this available bit will be ignored. Cc: Yao Yuan Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 1 + arch/x86/kvm/x86.c | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 9efc1a6b8693..d9dbebe03cae 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) if (!pmu->version) return; + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON; perf_get_x86_pmu_capability(&x86_pmu); pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 16fb39503296..1a64e816e06d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3125,6 +3125,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) } break; case MSR_IA32_MISC_ENABLE: + data &= ~MSR_IA32_MISC_ENABLE_EMON; if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) && ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) { if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3)) -- 2.30.2
[PATCH v5 03/16] perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
Splitting the logic for determining the guest values is unnecessarily confusing, and potentially fragile. Perf should have full knowledge and control of what values are loaded for the guest. If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it can generate the full set of guest values by grabbing guest ds_area and pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is vendor agnostic, so we don't see any reason to not just pass the pointer. Suggested-by: Sean Christopherson Signed-off-by: Like Xu --- arch/x86/events/core.c| 4 ++-- arch/x86/events/intel/core.c | 4 ++-- arch/x86/events/perf_event.h | 2 +- arch/x86/include/asm/perf_event.h | 4 ++-- arch/x86/kvm/vmx/vmx.c| 3 ++- 5 files changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 06bef6ba8a9b..7e2264a8c3f7 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -673,9 +673,9 @@ void x86_pmu_disable_all(void) } } -struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr) +struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data) { - return static_call(x86_pmu_guest_get_msrs)(nr); + return static_call(x86_pmu_guest_get_msrs)(nr, data); } EXPORT_SYMBOL_GPL(perf_guest_get_msrs); diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 021658df1feb..2f8ac53fe594 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3834,7 +3834,7 @@ static int intel_pmu_hw_config(struct perf_event *event) return 0; } -static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) +static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; @@ -3866,7 +3866,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) return arr; } -static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr) +static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 85dc4e1d4514..e52b35333e1f 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -809,7 +809,7 @@ struct x86_pmu { /* * Intel host/guest support (KVM) */ - struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr); + struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr, void *data); /* * Check period value for PERF_EVENT_IOC_PERIOD ioctl. diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 6a6e707905be..d5957b68906b 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -491,10 +491,10 @@ static inline void perf_check_microcode(void) { } #endif #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL) -extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); +extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data); extern int x86_perf_get_lbr(struct x86_pmu_lbr *lbr); #else -struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); +struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data); static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { return -1; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c05e6e2854b5..58673351c475 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6537,9 +6537,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx) { int i, nr_msrs; struct perf_guest_switch_msr *msrs; + struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu); /* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */ - msrs = perf_guest_get_msrs(&nr_msrs); + msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu); if (!msrs) return; -- 2.30.2
[PATCH v5 01/16] perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server
The new hardware facility supporting guest PEBS is only available on Intel Ice Lake Server platforms for now. KVM will check this field through perf_get_x86_pmu_capability() instead of hard coding the cpu models in the KVM code. If it is supported, the guest PBES capability will be exposed to the guest. Signed-off-by: Like Xu --- arch/x86/events/core.c| 1 + arch/x86/events/intel/core.c | 1 + arch/x86/events/perf_event.h | 3 ++- arch/x86/include/asm/perf_event.h | 1 + 4 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..06bef6ba8a9b 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2776,5 +2776,6 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap) cap->bit_width_fixed= x86_pmu.cntval_bits; cap->events_mask= (unsigned int)x86_pmu.events_maskl; cap->events_mask_len= x86_pmu.events_mask_len; + cap->pebs_vmx = x86_pmu.pebs_vmx; } EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability); diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 7bbb5bb98d8c..591d60cc8436 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5574,6 +5574,7 @@ __init int intel_pmu_init(void) case INTEL_FAM6_ICELAKE_X: case INTEL_FAM6_ICELAKE_D: + x86_pmu.pebs_vmx = 1; pmem = true; fallthrough; case INTEL_FAM6_ICELAKE_L: diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 53b2b5fc23bc..85dc4e1d4514 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -729,7 +729,8 @@ struct x86_pmu { pebs_prec_dist :1, pebs_no_tlb :1, pebs_no_isolation :1, - pebs_block :1; + pebs_block :1, + pebs_vmx:1; int pebs_record_size; int pebs_buffer_size; int max_pebs_events; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 544f41a179fb..6a6e707905be 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -192,6 +192,7 @@ struct x86_pmu_capability { int bit_width_fixed; unsigned intevents_mask; int events_mask_len; + unsigned intpebs_vmx:1; }; /* -- 2.30.2
[PATCH v5 00/16] KVM: x86/pmu: Add basic support to enable guest PEBS via DS
The guest Precise Event Based Sampling (PEBS) feature can provide an architectural state of the instruction executed after the guest instruction that exactly caused the event. It needs new hardware facility only available on Intel Ice Lake Server platforms. This patch set enables the basic PEBS feature for KVM guests on ICX. We can use PEBS feature on the Linux guest like native: # perf record -e instructions:ppp ./br_instr a # perf record -c 10 -e instructions:pp ./br_instr a To emulate guest PEBS facility for the above perf usages, we need to implement 2 code paths: 1) Fast path This is when the host assigned physical PMC has an identical index as the virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0). This path is used in most common use cases. 2) Slow path This is when the host assigned physical PMC has a different index from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0) In this case, KVM needs to rewrite the PEBS records to change the applicable counter indexes to the virtual PMC indexes, which would otherwise contain the physical counter index written by PEBS facility, and switch the counter reset values to the offset corresponding to the physical counter indexes in the DS data structure. The previous version [0] enables both fast path and slow path, which seems a bit more complex as the first step. In this patchset, we want to start with the fast path to get the basic guest PEBS enabled while keeping the slow path disabled. More focused discussion on the slow path [1] is planned to be put to another patchset in the next step. Compared to later versions in subsequent steps, the functionality to support host-guest PEBS both enabled and the functionality to emulate guest PEBS when the counter is cross-mapped are missing in this patch set (neither of these are typical scenarios). With the basic support, the guest can retrieve the correct PEBS information from its own PEBS records on the Ice Lake servers. And we expect it should work when migrating to another Ice Lake and no regression about host perf is expected. Here are the results of pebs test from guest/host for same workload: perf report on guest: # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250 # Overhead Command Shared Object Symbol 57.74% br_instr br_instr [.] lfsr_cond 41.40% br_instr br_instr [.] cmp_end 0.21% br_instr [kernel.kallsyms] [k] __lock_acquire perf report on host: # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386 # Overhead Command Shared Object Symbol 57.90% br_instr br_instr [.] lfsr_cond 41.95% br_instr br_instr [.] cmp_end 0.05% br_instr [kernel.vmlinux] [k] lock_acquire Conclusion: the profiling results on the guest are similar tothat on the host. A minimum guest kernel version may be v5.4 or a backport version support Icelake server PEBS. Please check more details in each commit and feel free to comment. Previous: https://lore.kernel.org/kvm/20210329054137.120994-1-like...@linux.intel.com/ [0] https://lore.kernel.org/kvm/20210104131542.495413-1-like...@linux.intel.com/ [1] https://lore.kernel.org/kvm/20210115191113.nktlnmivc3eds...@two.firstfloor.org/ v4->v5 Changelog: - Rewrite intel_guest_get_msrs() to address Peter's comments; - Fix coding style including indentation and {}; - Use __test_and_set_bit in the kvm_perf_overflow_intr(); - Return void for x86_pmu_handle_guest_pebs(); - Always drain pebs buffer on the host side; Like Xu (16): perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability KVM: x86/cpuid: Refactor host/guest CPU model consistency check KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64 arch/x86/events/core.c| 5 +- arch/x86/events/intel/core.c | 130 -- arch/x86/events/perf_event.h | 5 +- arch/x86/include/asm/kvm_host.h | 16 arch/x86/include/asm/msr-index.h | 6 ++ arch/x86/include/asm/perf_
Re: [PATCH v5 0/5] perf/x86: Some minor changes to support guest Arch LBR
Em, does anyone want to review these minor changes? I believe some of them solve the real problem. On 2021/4/6 11:20, Like Xu wrote: Hi all, do we have any comments on this patch set? On 2021/3/26 9:19, Like Xu wrote: Hi Peter, Please help review these minor perf/x86 changes in this patch set, and we need some of them to support Guest Architectural LBR in KVM. This version keeps reserve_lbr_buffers() as is because the LBR xsave buffer is a per-CPU buffer, not a per-event buffer. We only need to allocate the buffer once when initializing the first event. If you are interested in the KVM emulation, please check https://lore.kernel.org/kvm/20210314155225.206661-1-like...@linux.intel.com/ Please check more details in each commit and feel free to comment. Previous: https://lore.kernel.org/lkml/20210322060635.821531-1-like...@linux.intel.com/ v4->v5 Changelog: - Add "Tested-by: Kan Liang" - Make the commit message simpler - Make check_msr() to ignore msr==0 - Use kmem_cache_alloc_node() [Namhyung] Like Xu (5): perf/x86/intel: Fix the comment about guest LBR support on KVM perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers perf/x86: Skip checking MSR for MSR 0x000 perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/core.c | 7 +++ arch/x86/events/intel/lbr.c | 29 ++--- arch/x86/events/perf_event.h | 8 +++- arch/x86/include/asm/msr-index.h | 1 + 6 files changed, 35 insertions(+), 20 deletions(-)
Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
Hi Liuxiangdong, On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service Product Dept.) wrote: Do you have any comments or ideas about it ? https://lore.kernel.org/kvm/606e5ef6.2060...@huawei.com/ My expectation is that there may be many fewer PEBS samples on Skylake without any soft lockup. You may need to confirm the statement "All that matters is that the EPT pages don't get unmapped ever while PEBS is active" is true in the kernel level. Try "-overcommit mem-lock=on" for your qemu. On 2021/4/6 13:14, Xu, Like wrote: Hi Xiangdong, On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service Product Dept.) wrote: Hi,like. Some questions about this new pebs patches set: https://lore.kernel.org/kvm/20210329054137.120994-2-like...@linux.intel.com/ The new hardware facility supporting guest PEBS is only available on Intel Ice Lake Server platforms for now. Yes, we have documented this "EPT-friendly PEBS" capability in the SDM 18.3.10.1 Processor Event Based Sampling (PEBS) Facility And again, this patch set doesn't officially support guest PEBS on the Skylake. AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake doesn't. But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose counter in Skylake. For Skylake, only the PMC0-PMC3 are valid for PEBS and you may mask the other unsupported bits in the pmu->pebs_enable_mask. Is there anything else that only Icelake supports in this patches set? The PDIR counter on the Ice Lake is the fixed counter 0 while the PDIR counter on the Sky Lake is the gp counter 1. You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch. Besides, we have tried this patches set in Icelake. We can use pebs(eg: "perf record -e cycles:pp") when guest is kernel-5.11, but can't when kernel-4.18. Is there a minimum guest kernel version requirement? The Ice Lake CPU model has been added since v5.4. You may double check whether the stable tree(s) code has INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h. Thanks, Xiangdong Liu
Re: [PATCH v5 0/5] perf/x86: Some minor changes to support guest Arch LBR
Hi all, do we have any comments on this patch set? On 2021/3/26 9:19, Like Xu wrote: Hi Peter, Please help review these minor perf/x86 changes in this patch set, and we need some of them to support Guest Architectural LBR in KVM. This version keeps reserve_lbr_buffers() as is because the LBR xsave buffer is a per-CPU buffer, not a per-event buffer. We only need to allocate the buffer once when initializing the first event. If you are interested in the KVM emulation, please check https://lore.kernel.org/kvm/20210314155225.206661-1-like...@linux.intel.com/ Please check more details in each commit and feel free to comment. Previous: https://lore.kernel.org/lkml/20210322060635.821531-1-like...@linux.intel.com/ v4->v5 Changelog: - Add "Tested-by: Kan Liang" - Make the commit message simpler - Make check_msr() to ignore msr==0 - Use kmem_cache_alloc_node() [Namhyung] Like Xu (5): perf/x86/intel: Fix the comment about guest LBR support on KVM perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers perf/x86: Skip checking MSR for MSR 0x000 perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/core.c | 7 +++ arch/x86/events/intel/lbr.c | 29 ++--- arch/x86/events/perf_event.h | 8 +++- arch/x86/include/asm/msr-index.h | 1 + 6 files changed, 35 insertions(+), 20 deletions(-)
[PATCH v4 15/16] KVM: x86/cpuid: Refactor host/guest CPU model consistency check
For the same purpose, the leagcy intel_pmu_lbr_is_compatible() could be renamed for reuse by more callers for the same purpose and remove the comment about LBR use case incidentally. Signed-off-by: Like Xu --- arch/x86/kvm/cpuid.h | 5 + arch/x86/kvm/vmx/pmu_intel.c | 12 +--- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/vmx/vmx.h | 1 - 4 files changed, 7 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 2a0c5064497f..fb478fb45b9e 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -270,6 +270,11 @@ static inline int guest_cpuid_model(struct kvm_vcpu *vcpu) return x86_model(best->eax); } +static inline bool cpuid_model_is_consistent(struct kvm_vcpu *vcpu) +{ + return boot_cpu_data.x86_model == guest_cpuid_model(vcpu); +} + static inline int guest_cpuid_stepping(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 3c1ee59571d9..4fe13cf80bb5 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -173,16 +173,6 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr) return get_gp_pmc(pmu, msr, MSR_IA32_PMC0); } -bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu) -{ - /* -* As a first step, a guest could only enable LBR feature if its -* cpu model is the same as the host because the LBR registers -* would be pass-through to the guest and they're model specific. -*/ - return boot_cpu_data.x86_model == guest_cpuid_model(vcpu); -} - bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu) { struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu); @@ -576,7 +566,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) nested_vmx_pmu_entry_exit_ctls_update(vcpu); - if (intel_pmu_lbr_is_compatible(vcpu)) + if (cpuid_model_is_consistent(vcpu)) x86_perf_get_lbr(&lbr_desc->records); else lbr_desc->records.nr = 0; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 966fa7962808..b0f2cb790359 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2259,7 +2259,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if ((data & PMU_CAP_LBR_FMT) != (vmx_get_perf_capabilities() & PMU_CAP_LBR_FMT)) return 1; - if (!intel_pmu_lbr_is_compatible(vcpu)) + if (!cpuid_model_is_consistent(vcpu)) return 1; } ret = kvm_set_msr_common(vcpu, msr_info); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 0029aaad8eda..d214b6c43886 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -97,7 +97,6 @@ union vmx_exit_reason { #define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records) void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu); -bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu); bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu); int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu); -- 2.29.2
[PATCH v4 16/16] KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
The CPUID features PDCM, DS and DTES64 are required for PEBS feature. KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS is supported in the KVM on the Ice Lake server platforms. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/kvm/vmx/capabilities.h | 26 -- arch/x86/kvm/vmx/vmx.c | 15 +++ 2 files changed, 35 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index d1d77985e889..df06da09f84c 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -5,6 +5,7 @@ #include #include "lapic.h" +#include "pmu.h" extern bool __read_mostly enable_vpid; extern bool __read_mostly flexpriority_enabled; @@ -378,20 +379,33 @@ static inline bool vmx_pt_mode_is_host_guest(void) return pt_mode == PT_MODE_HOST_GUEST; } -static inline u64 vmx_get_perf_capabilities(void) +static inline bool vmx_pebs_supported(void) { - u64 perf_cap = 0; + struct x86_pmu_capability x86_pmu; - if (boot_cpu_has(X86_FEATURE_PDCM)) - rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); + perf_get_x86_pmu_capability(&x86_pmu); - perf_cap &= PMU_CAP_LBR_FMT; + return boot_cpu_has(X86_FEATURE_PEBS) && x86_pmu.pebs_vmx; +} +static inline u64 vmx_get_perf_capabilities(void) +{ /* * Since counters are virtualized, KVM would support full * width counting unconditionally, even if the host lacks it. */ - return PMU_CAP_FW_WRITES | perf_cap; + u64 perf_cap = PMU_CAP_FW_WRITES; + u64 host_perf_cap = 0; + + if (boot_cpu_has(X86_FEATURE_PDCM)) + rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap); + + perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; + + if (vmx_pebs_supported()) + perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK; + + return perf_cap; } static inline u64 vmx_supported_debugctl(void) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b0f2cb790359..7cd9370357f9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2262,6 +2262,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!cpuid_model_is_consistent(vcpu)) return 1; } + if (data & PERF_CAP_PEBS_FORMAT) { + if ((data & PERF_CAP_PEBS_MASK) != + (vmx_get_perf_capabilities() & PERF_CAP_PEBS_MASK)) + return 1; + if (!guest_cpuid_has(vcpu, X86_FEATURE_DS)) + return 1; + if (!guest_cpuid_has(vcpu, X86_FEATURE_DTES64)) + return 1; + if (!cpuid_model_is_consistent(vcpu)) + return 1; + } ret = kvm_set_msr_common(vcpu, msr_info); break; @@ -7264,6 +7275,10 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_clear(X86_FEATURE_INVPCID); if (vmx_pt_mode_is_host_guest()) kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); + if (vmx_pebs_supported()) { + kvm_cpu_cap_check_and_set(X86_FEATURE_DS); + kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64); + } if (vmx_umip_emulated()) kvm_cpu_cap_set(X86_FEATURE_UMIP); -- 2.29.2
[PATCH v4 14/16] KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability
The information obtained from the interface perf_get_x86_pmu_capability() doesn't change, so an exported "struct x86_pmu_capability" is introduced for all guests in the KVM, and it's initialized before hardware_setup(). Signed-off-by: Like Xu --- arch/x86/kvm/cpuid.c | 24 +++- arch/x86/kvm/pmu.c | 3 +++ arch/x86/kvm/pmu.h | 19 +++ arch/x86/kvm/vmx/pmu_intel.c | 13 + arch/x86/kvm/x86.c | 9 - 5 files changed, 38 insertions(+), 30 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 6bd2f8b830e4..b3c751d425b7 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -680,32 +680,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) case 9: break; case 0xa: { /* Architectural Performance Monitoring */ - struct x86_pmu_capability cap; union cpuid10_eax eax; union cpuid10_edx edx; - perf_get_x86_pmu_capability(&cap); + eax.split.version_id = kvm_pmu_cap.version; + eax.split.num_counters = kvm_pmu_cap.num_counters_gp; + eax.split.bit_width = kvm_pmu_cap.bit_width_gp; + eax.split.mask_length = kvm_pmu_cap.events_mask_len; - /* -* Only support guest architectural pmu on a host -* with architectural pmu. -*/ - if (!cap.version) - memset(&cap, 0, sizeof(cap)); - - eax.split.version_id = min(cap.version, 2); - eax.split.num_counters = cap.num_counters_gp; - eax.split.bit_width = cap.bit_width_gp; - eax.split.mask_length = cap.events_mask_len; - - edx.split.num_counters_fixed = min(cap.num_counters_fixed, MAX_FIXED_COUNTERS); - edx.split.bit_width_fixed = cap.bit_width_fixed; + edx.split.num_counters_fixed = kvm_pmu_cap.num_counters_fixed; + edx.split.bit_width_fixed = kvm_pmu_cap.bit_width_fixed; edx.split.anythread_deprecated = 1; edx.split.reserved1 = 0; edx.split.reserved2 = 0; entry->eax = eax.full; - entry->ebx = cap.events_mask; + entry->ebx = kvm_pmu_cap.events_mask; entry->ecx = 0; entry->edx = edx.full; break; diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 0081cb742743..28deb51242e1 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -19,6 +19,9 @@ #include "lapic.h" #include "pmu.h" +struct x86_pmu_capability __read_mostly kvm_pmu_cap; +EXPORT_SYMBOL_GPL(kvm_pmu_cap); + /* This is enough to filter the vast majority of currently defined events. */ #define KVM_PMU_EVENT_FILTER_MAX_EVENTS 300 diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 6c902b2d2d5a..3f84640d8f8c 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -160,6 +160,23 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE; } +extern struct x86_pmu_capability kvm_pmu_cap; + +static inline void kvm_init_pmu_capability(void) +{ + perf_get_x86_pmu_capability(&kvm_pmu_cap); + + /* +* Only support guest architectural pmu on +* a host with architectural pmu. +*/ + if (!kvm_pmu_cap.version) + memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap)); + + kvm_pmu_cap.version = min(kvm_pmu_cap.version, 2); + kvm_pmu_cap.num_counters_fixed = min(kvm_pmu_cap.num_counters_fixed, MAX_FIXED_COUNTERS); +} + void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); @@ -177,9 +194,11 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu); void kvm_pmu_cleanup(struct kvm_vcpu *vcpu); void kvm_pmu_destroy(struct kvm_vcpu *vcpu); int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp); +void kvm_init_pmu_capability(void); bool is_vmware_backdoor_pmc(u32 pmc_idx); extern struct kvm_pmu_ops intel_pmu_ops; extern struct kvm_pmu_ops amd_pmu_ops; + #endif /* __KVM_X86_PMU_H */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 55caa941e336..3c1ee59571d9 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -504,8 +504,6 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); - - struct x86_pmu_capability x86_pmu; struct kvm_cpuid_entry2 *entry; union cpuid10_eax eax; union cpuid10_edx edx; @@ -532,13 +530,12
[PATCH v4 12/16] KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h
It allows this inline function to be reused by more callers in more files, such as pmu_intel.c. Signed-off-by: Like Xu --- arch/x86/kvm/pmu.c | 11 --- arch/x86/kvm/pmu.h | 11 +++ 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 8d2873cfec69..0081cb742743 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -476,17 +476,6 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu) kvm_pmu_refresh(vcpu); } -static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) -{ - struct kvm_pmu *pmu = pmc_to_pmu(pmc); - - if (pmc_is_fixed(pmc)) - return fixed_ctrl_field(pmu->fixed_ctr_ctrl, - pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3; - - return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE; -} - /* Release perf_events for vPMCs that have been unused for a full time slice. */ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index d9157128e6eb..6c902b2d2d5a 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -149,6 +149,17 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) return sample_period; } +static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) +{ + struct kvm_pmu *pmu = pmc_to_pmu(pmc); + + if (pmc_is_fixed(pmc)) + return fixed_ctrl_field(pmu->fixed_ctr_ctrl, + pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3; + + return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE; +} + void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); -- 2.29.2
[PATCH v4 13/16] KVM: x86/pmu: Disable guest PEBS before vm-entry in two cases
The guest PEBS will be disabled when some users try to perf KVM and its user-space through the same PEBS facility OR when the host perf doesn't schedule the guest PEBS counter in a one-to-one mapping manner (neither of these are typical scenarios). The PEBS records in the guest DS buffer is still accurate and the above two restrictions will be checked before each vm-entry only if guest PEBS is deemed to be enabled. Signed-off-by: Like Xu --- arch/x86/events/intel/core.c| 8 +++- arch/x86/include/asm/kvm_host.h | 9 + arch/x86/kvm/vmx/pmu_intel.c| 16 arch/x86/kvm/vmx/vmx.c | 4 arch/x86/kvm/vmx/vmx.h | 1 + 5 files changed, 37 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 3bbdfc4f6931..20ee1b3fd06b 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3858,7 +3858,13 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) if (pmu && x86_pmu.pebs) { arr[1].msr = MSR_IA32_PEBS_ENABLE; arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; - arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; + if (!arr[1].host) { + arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; + /* Disable guest PEBS for cross-mapped PEBS counters. */ + arr[1].guest &= ~pmu->host_cross_mapped_mask; + } else + /* Disable guest PEBS if host PEBS is enabled. */ + arr[1].guest = 0; arr[2].msr = MSR_IA32_DS_AREA; arr[2].host = (unsigned long)ds; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 94366da2dfee..cfb5467be7e6 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -466,6 +466,15 @@ struct kvm_pmu { u64 pebs_data_cfg; u64 pebs_data_cfg_mask; + /* +* If a guest counter is cross-mapped to host counter with different +* index, its PEBS capability will be temporarily disabled. +* +* The user should make sure that this mask is updated +* after disabling interrupts and before perf_guest_get_msrs(); +*/ + u64 host_cross_mapped_mask; + /* * The gate to release perf_events not marked in * pmc_in_use only once in a vcpu time slice. diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 4dcf66e6c398..55caa941e336 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -767,6 +767,22 @@ static void intel_pmu_cleanup(struct kvm_vcpu *vcpu) intel_pmu_release_guest_lbr_event(vcpu); } +void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu) +{ + struct kvm_pmc *pmc = NULL; + int bit; + + for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl, X86_PMC_IDX_MAX) { + pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit); + + if (!pmc || !pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc)) + continue; + + if (pmc->perf_event && (pmc->idx != pmc->perf_event->hw.idx)) + pmu->host_cross_mapped_mask |= BIT_ULL(pmc->perf_event->hw.idx); + } +} + struct kvm_pmu_ops intel_pmu_ops = { .find_arch_event = intel_find_arch_event, .find_fixed_event = intel_find_fixed_event, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 594c058f6f0f..966fa7962808 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6516,6 +6516,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx) struct perf_guest_switch_msr *msrs; struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu); + pmu->host_cross_mapped_mask = 0; + if (pmu->pebs_enable & pmu->global_ctrl) + intel_pmu_cross_mapped_check(pmu); + /* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */ msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu); if (!msrs) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 0fb3236b0283..0029aaad8eda 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -96,6 +96,7 @@ union vmx_exit_reason { #define vcpu_to_lbr_desc(vcpu) (&to_vmx(vcpu)->lbr_desc) #define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records) +void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu); bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu); bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu); -- 2.29.2
[PATCH v4 10/16] KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" : 1 = PEBS is not supported. 0 = PEBS is supported. A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS is enabled. Some PEBS drivers in guest may care about this bit. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 2 ++ arch/x86/kvm/x86.c | 4 2 files changed, 6 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 7f18c760dbae..4dcf66e6c398 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -588,6 +588,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED_VLBR, 1); if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) { + vcpu->arch.ia32_misc_enable_msr &= ~MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) { pmu->pebs_enable_mask = ~pmu->global_ctrl; pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE; @@ -598,6 +599,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) } else pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1); } else { + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK; } } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 536b64360b75..888f2c3cc288 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3126,6 +3126,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_MISC_ENABLE: data &= ~MSR_IA32_MISC_ENABLE_EMON; + if (!msr_info->host_initiated && + (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) && + (data & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL)) + return 1; if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) && ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) { if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3)) -- 2.29.2
[PATCH v4 09/16] KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL. FCx_Adaptive_Record) are also supported. Adaptive PEBS provides software the capability to configure the PEBS records to capture only the data of interest, keeping the record size compact. An overflow of PMCx results in generation of an adaptive PEBS record with state information based on the selections specified in MSR_PEBS_DATA_CFG (Memory Info [bit 0], GPRs [bit 1], XMMs [bit 2], and LBRs [bit 3], LBR Entries [bit 31:24]). By default, the PEBS record will only contain the Basic group. When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c| 11 ++- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx/pmu_intel.c| 16 3 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 7f3821a59b84..3bbdfc4f6931 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3844,6 +3844,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds); struct kvm_pmu *pmu = (struct kvm_pmu *)data; + bool baseline = x86_pmu.intel_cap.pebs_baseline; arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; @@ -3863,6 +3864,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) arr[2].host = (unsigned long)ds; arr[2].guest = pmu->ds_area; + if (baseline) { + arr[3].msr = MSR_PEBS_DATA_CFG; + arr[3].host = cpuc->pebs_data_cfg; + arr[3].guest = pmu->pebs_data_cfg; + } + /* * If PMU counter has PEBS enabled it is not enough to * disable counter on a guest entry since PEBS memory @@ -3879,9 +3886,11 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) else { arr[1].guest = arr[1].host; arr[2].guest = arr[2].host; + if (baseline) + arr[3].guest = arr[3].host; } - *nr = 3; + *nr = baseline ? 4 : 3; } return arr; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 2275cc144f58..94366da2dfee 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -463,6 +463,8 @@ struct kvm_pmu { u64 ds_area; u64 pebs_enable; u64 pebs_enable_mask; + u64 pebs_data_cfg; + u64 pebs_data_cfg_mask; /* * The gate to release perf_events not marked in diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 77d30106abca..7f18c760dbae 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -226,6 +226,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_IA32_DS_AREA: ret = guest_cpuid_has(vcpu, X86_FEATURE_DS); break; + case MSR_PEBS_DATA_CFG: + ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE; + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -379,6 +382,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_DS_AREA: msr_info->data = pmu->ds_area; return 0; + case MSR_PEBS_DATA_CFG: + msr_info->data = pmu->pebs_data_cfg; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -452,6 +458,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; pmu->ds_area = data; return 0; + case MSR_PEBS_DATA_CFG: + if (pmu->pebs_data_cfg == data) + return 0; + if (!(data & pmu->pebs_data_cfg_mask)) { + pmu->pebs_data_cfg = data; + return 0; + } + break; default:
[PATCH v4 11/16] KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
The PEBS-PDIR facility on Ice Lake server is supported on IA31_FIXED0 only. If the guest configures counter 32 and PEBS is enabled, the PEBS-PDIR facility is supposed to be used, in which case KVM adjusts attr.precise_ip to 3 and request host perf to assign the exactly requested counter or fail. The cpu model check is also required since some platforms may place the PEBS-PDIR facility in another counter index. Signed-off-by: Like Xu --- arch/x86/kvm/pmu.c | 2 ++ arch/x86/kvm/pmu.h | 7 +++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 3509b18478b9..8d2873cfec69 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -148,6 +148,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, * in the PEBS record is calibrated on the guest side. */ attr.precise_ip = 1; + if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32) + attr.precise_ip = 3; } event = perf_event_create_kernel_counter(&attr, -1, current, diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 7b30bc967af3..d9157128e6eb 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -4,6 +4,8 @@ #include +#include + #define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu) #define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu)) #define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu) @@ -16,6 +18,11 @@ #define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002 #define MAX_FIXED_COUNTERS 3 +static const struct x86_cpu_id vmx_icl_pebs_cpu[] = { + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, NULL), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, NULL), + {} +}; struct kvm_event_hw_type_mapping { u8 eventsel; -- 2.29.2
[PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points to the linear address of the first byte of the DS buffer management area, which is used to manage the PEBS records. When guest PEBS is enabled and the value is different from the host, KVM will add the IA32_DS_AREA MSR to the msr-switch list. The guest's DS value can be loaded to the real HW before VM-entry, and will be removed when guest PEBS is disabled. The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register contains a non-canonical address. The switch of IA32_DS_AREA MSR would also, setup a quiescent period to write the host PEBS records (if any) to host DS area rather than guest DS area. When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c| 15 --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/pmu_intel.c| 11 +++ arch/x86/kvm/vmx/vmx.c | 1 + 4 files changed, 25 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 2ca8ed61f444..7f3821a59b84 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3841,6 +3842,8 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; + struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds); + struct kvm_pmu *pmu = (struct kvm_pmu *)data; arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; @@ -3851,11 +3854,15 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); *nr = 1; - if (x86_pmu.pebs) { + if (pmu && x86_pmu.pebs) { arr[1].msr = MSR_IA32_PEBS_ENABLE; arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; + arr[2].msr = MSR_IA32_DS_AREA; + arr[2].host = (unsigned long)ds; + arr[2].guest = pmu->ds_area; + /* * If PMU counter has PEBS enabled it is not enough to * disable counter on a guest entry since PEBS memory @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) if (arr[1].guest) arr[0].guest |= arr[1].guest; - else + else { arr[1].guest = arr[1].host; + arr[2].guest = arr[2].host; + } - *nr = 2; + *nr = 3; } return arr; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f620485d7836..2275cc144f58 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,6 +460,7 @@ struct kvm_pmu { DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX); DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX); + u64 ds_area; u64 pebs_enable; u64 pebs_enable_mask; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 0700d6d739f7..77d30106abca 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -223,6 +223,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_IA32_PEBS_ENABLE: ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT; break; + case MSR_IA32_DS_AREA: + ret = guest_cpuid_has(vcpu, X86_FEATURE_DS); + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -373,6 +376,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_PEBS_ENABLE: msr_info->data = pmu->pebs_enable; return 0; + case MSR_IA32_DS_AREA: + msr_info->data = pmu->ds_area; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -441,6 +447,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0;
[PATCH v4 06/16] KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter
When a guest counter is configured as a PEBS counter through IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by configuring a non-zero precision level in the perf_event_attr. The guest PEBS overflow PMI bit would be set in the guest GLOBAL_STATUS MSR when PEBS facility generates a PEBS overflow PMI based on guest IA32_DS_AREA MSR. The attr.precise_ip would be adjusted to a special precision level when the new PEBS-PDIR feature is supported later which would affect the host counters scheduling. The guest PEBS event would not be reused for non-PEBS guest event even with the same guest counter index. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/pmu.c | 33 +++-- 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c560960544a3..9b814bdc9137 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,6 +460,8 @@ struct kvm_pmu { DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX); DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX); + u64 pebs_enable; + /* * The gate to release perf_events not marked in * pmc_in_use only once in a vcpu time slice. diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 827886c12c16..3509b18478b9 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -74,11 +74,20 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event, { struct kvm_pmc *pmc = perf_event->overflow_handler_context; struct kvm_pmu *pmu = pmc_to_pmu(pmc); + bool skip_pmi = false; if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) { - __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); + if (perf_event->attr.precise_ip) { + /* Indicate PEBS overflow PMI to guest. */ + skip_pmi = test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, + (unsigned long *)&pmu->global_status); + } else + __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); kvm_make_request(KVM_REQ_PMU, pmc->vcpu); + if (skip_pmi) + return; + /* * Inject PMI. If vcpu was in a guest mode during NMI PMI * can be ejected on a guest mode re-entry. Otherwise we can't @@ -99,6 +108,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, bool exclude_kernel, bool intr, bool in_tx, bool in_tx_cp) { + struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu); struct perf_event *event; struct perf_event_attr attr = { .type = type, @@ -110,6 +120,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .exclude_kernel = exclude_kernel, .config = config, }; + bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); attr.sample_period = get_sample_period(pmc, pmc->counter); @@ -124,9 +135,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, attr.sample_period = 0; attr.config |= HSW_IN_TX_CHECKPOINTED; } + if (pebs) { + /* +* The non-zero precision level of guest event makes the ordinary +* guest event becomes a guest PEBS event and triggers the host +* PEBS PMI handler to determine whether the PEBS overflow PMI +* comes from the host counters or the guest. +* +* For most PEBS hardware events, the difference in the software +* precision levels of guest and host PEBS events will not affect +* the accuracy of the PEBS profiling result, because the "event IP" +* in the PEBS record is calibrated on the guest side. +*/ + attr.precise_ip = 1; + } event = perf_event_create_kernel_counter(&attr, -1, current, -intr ? kvm_perf_overflow_intr : +(intr || pebs) ? kvm_perf_overflow_intr : kvm_perf_overflow, pmc); if (IS_ERR(event)) { pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n", @@ -161,6 +186,10 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc) get_sample_period(pmc, pmc->counter))) return false; + if (!test_bit(pmc->idx, (unsigned long *)&pm
[PATCH v4 05/16] KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
The mask value of fixed counter control register should be dynamic adjusted with the number of fixed counters. This patch introduces a variable that includes the reserved bits of fixed counter control registers. This is needed for later Ice Lake fixed counter changes. Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/pmu_intel.c| 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a52f973bdff6..c560960544a3 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -444,6 +444,7 @@ struct kvm_pmu { unsigned nr_arch_fixed_counters; unsigned available_event_types; u64 fixed_ctr_ctrl; + u64 fixed_ctr_ctrl_mask; u64 global_ctrl; u64 global_status; u64 global_ovf_ctrl; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index d9dbebe03cae..ac7fe714e6c1 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -400,7 +400,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_FIXED_CTR_CTRL: if (pmu->fixed_ctr_ctrl == data) return 0; - if (!(data & 0xf444ull)) { + if (!(data & pmu->fixed_ctr_ctrl_mask)) { reprogram_fixed_counters(pmu, data); return 0; } @@ -470,6 +470,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) struct kvm_cpuid_entry2 *entry; union cpuid10_eax eax; union cpuid10_edx edx; + int i; pmu->nr_arch_gp_counters = 0; pmu->nr_arch_fixed_counters = 0; @@ -477,6 +478,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->counter_bitmask[KVM_PMC_FIXED] = 0; pmu->version = 0; pmu->reserved_bits = 0x0020ull; + pmu->fixed_ctr_ctrl_mask = ~0ull; entry = kvm_find_cpuid_entry(vcpu, 0xa, 0); if (!entry) @@ -511,6 +513,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) ((u64)1 << edx.split.bit_width_fixed) - 1; } + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) + pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4)); pmu->global_ctrl = ((1ull << pmu->nr_arch_gp_counters) - 1) | (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED); pmu->global_ctrl_mask = ~pmu->global_ctrl; -- 2.29.2
[PATCH v4 07/16] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed and general purpose counters have corresponding bits in IA32_PEBS_ENABLE that enable generation of PEBS records. The general-purpose counter bits start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at bit IA32_PEBS_ENABLE[32]. When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Co-developed-by: Luwei Kang Signed-off-by: Luwei Kang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c | 17 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/include/asm/msr-index.h | 6 ++ arch/x86/kvm/vmx/pmu_intel.c | 28 4 files changed, 48 insertions(+), 4 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index e8fee7cf767f..2ca8ed61f444 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3851,7 +3851,11 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); *nr = 1; - if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) { + if (x86_pmu.pebs) { + arr[1].msr = MSR_IA32_PEBS_ENABLE; + arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; + arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; + /* * If PMU counter has PEBS enabled it is not enough to * disable counter on a guest entry since PEBS memory @@ -3860,9 +3864,14 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) * * Don't do this if the CPU already enforces it. */ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; + if (x86_pmu.pebs_no_isolation) + arr[1].guest = 0; + + if (arr[1].guest) + arr[0].guest |= arr[1].guest; + else + arr[1].guest = arr[1].host; + *nr = 2; } diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 9b814bdc9137..f620485d7836 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -461,6 +461,7 @@ struct kvm_pmu { DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX); u64 pebs_enable; + u64 pebs_enable_mask; /* * The gate to release perf_events not marked in diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 546d6ecf0a35..9afcad882f4f 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -186,6 +186,12 @@ #define MSR_IA32_DS_AREA 0x0600 #define MSR_IA32_PERF_CAPABILITIES 0x0345 #define MSR_PEBS_LD_LAT_THRESHOLD 0x03f6 +#define PERF_CAP_PEBS_TRAP BIT_ULL(6) +#define PERF_CAP_ARCH_REG BIT_ULL(7) +#define PERF_CAP_PEBS_FORMAT 0xf00 +#define PERF_CAP_PEBS_BASELINE BIT_ULL(14) +#define PERF_CAP_PEBS_MASK (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \ + PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE) #define MSR_IA32_RTIT_CTL 0x0570 #define RTIT_CTL_TRACEEN BIT(0) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index ac7fe714e6c1..0700d6d739f7 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -220,6 +220,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu->version > 1; break; + case MSR_IA32_PEBS_ENABLE: + ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT; + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -367,6 +370,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: msr_info->data = pmu->global_ovf_ctrl; return 0; + case MSR_IA32_PEBS_ENABLE: + msr_info->data = pmu->pebs_enable; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -427,6 +433,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
[PATCH v4 04/16] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
On Intel platforms, software may uses IA32_MISC_ENABLE[7] bit to detect whether the performance monitoring facility is supported in the processor. It's dependent on the PMU being enabled for the guest and a write to this PMU available bit will be ignored. Cc: Yao Yuan Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 1 + arch/x86/kvm/x86.c | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 9efc1a6b8693..d9dbebe03cae 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) if (!pmu->version) return; + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON; perf_get_x86_pmu_capability(&x86_pmu); pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a9d95f90a048..536b64360b75 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3125,6 +3125,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) } break; case MSR_IA32_MISC_ENABLE: + data &= ~MSR_IA32_MISC_ENABLE_EMON; if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) && ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) { if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3)) -- 2.29.2
[PATCH v4 00/16] KVM: x86/pmu: Add basic support to enable Guest PEBS via DS
The guest Precise Event Based Sampling (PEBS) feature can provide an architectural state of the instruction executed after the guest instruction that exactly caused the event. It needs new hardware facility only available on Intel Ice Lake Server platforms. This patch set enables the basic PEBS via DS feature for KVM guests on ICX. We can use PEBS feature on the Linux guest like native: # perf record -e instructions:ppp ./br_instr a # perf record -c 10 -e instructions:pp ./br_instr a To emulate guest PEBS facility for the above perf usages, we need to implement 2 code paths: 1) Fast path This is when the host assigned physical PMC has an identical index as the virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0). This path is used in most common use cases. 2) Slow path This is when the host assigned physical PMC has a different index from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0) In this case, KVM needs to rewrite the PEBS records to change the applicable counter indexes to the virtual PMC indexes, which would otherwise contain the physical counter index written by PEBS facility, and switch the counter reset values to the offset corresponding to the physical counter indexes in the DS data structure. The previous version [0] enables both fast path and slow path, which seems a bit more complex as the first step. In this patchset, we want to start with the fast path to get the basic guest PEBS enabled while keeping the slow path disabled. More focused discussion on the slow path [1] is planned to be put to another patchset in the next step. Compared to later versions in subsequent steps, the functionality to support host-guest PEBS both enabled and the functionality to emulate guest PEBS when the counter is cross-mapped are missing in this patch set (neither of these are typical scenarios). With the basic support, the guest can retrieve the correct PEBS information from its own PEBS records on the Ice Lake servers. And we expect it should work when migrating to another Ice Lake and no regression about host perf is expected. Here are the results of pebs test from guest/host for same workload: perf report on guest: # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250 # Overhead Command Shared Object Symbol 57.74% br_instr br_instr [.] lfsr_cond 41.40% br_instr br_instr [.] cmp_end 0.21% br_instr [kernel.kallsyms] [k] __lock_acquire perf report on host: # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386 # Overhead Command Shared Object Symbol 57.90% br_instr br_instr [.] lfsr_cond 41.95% br_instr br_instr [.] cmp_end 0.05% br_instr [kernel.vmlinux] [k] lock_acquire Conclusion: the profiling results on the guest are similar tothat on the host. Please check more details in each commit and feel free to comment. Previous: [0] https://lore.kernel.org/kvm/20210104131542.495413-1-like...@linux.intel.com/ [1] https://lore.kernel.org/kvm/20210115191113.nktlnmivc3eds...@two.firstfloor.org/ v3->v4 Changelog: - Update this cover letter and propose a new upstream plan; [PERF] - Drop check host DS and move handler to handle_pmi_common(); - Pass "struct kvm_pmu *" to intel_guest_get_msrs(); - Propose new assignment logic for perf_guest_switch_msr(); - Introduce x86_pmu.pebs_vmx for future capability maintenance; [KVM] - Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability; - Raising PEBS PMI only when OVF_BIT 62 is not set; - Make vmx_icl_pebs_cpu specific for PEBS-PDIR emulation; - Fix a bug for fixed_ctr_ctrl_mask; - Add two minor refactoring patches for reuse; Like Xu (16): perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h KVM: x86/pmu: Disable guest PEBS before vm-entry in two cases KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability KVM: x86/cpuid: Refactor host/guest CPU model consistency check KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64 arch/x86/events/core.c| 5 +- arch/x86/events/intel/core.c | 93 +++--- arch/x86/events/perf_
[PATCH v4 03/16] perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
Splitting the logic for determining the guest values is unnecessarily confusing, and potentially fragile. Perf should have full knowledge and control of what values are loaded for the guest. If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it can generate the full set of guest values by grabbing guest ds_area and pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is vendor agnostic, so we don't see any reason to not just pass the pointer. Suggested-by: Sean Christopherson Signed-off-by: Like Xu --- arch/x86/events/core.c| 4 ++-- arch/x86/events/intel/core.c | 4 ++-- arch/x86/events/perf_event.h | 2 +- arch/x86/include/asm/perf_event.h | 4 ++-- arch/x86/kvm/vmx/vmx.c| 3 ++- 5 files changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 06bef6ba8a9b..7e2264a8c3f7 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -673,9 +673,9 @@ void x86_pmu_disable_all(void) } } -struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr) +struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data) { - return static_call(x86_pmu_guest_get_msrs)(nr); + return static_call(x86_pmu_guest_get_msrs)(nr, data); } EXPORT_SYMBOL_GPL(perf_guest_get_msrs); diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index af9ac48fe840..e8fee7cf767f 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3837,7 +3837,7 @@ static int intel_pmu_hw_config(struct perf_event *event) return 0; } -static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) +static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; @@ -3869,7 +3869,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) return arr; } -static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr) +static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 85dc4e1d4514..e52b35333e1f 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -809,7 +809,7 @@ struct x86_pmu { /* * Intel host/guest support (KVM) */ - struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr); + struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr, void *data); /* * Check period value for PERF_EVENT_IOC_PERIOD ioctl. diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 6a6e707905be..d5957b68906b 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -491,10 +491,10 @@ static inline void perf_check_microcode(void) { } #endif #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL) -extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); +extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data); extern int x86_perf_get_lbr(struct x86_pmu_lbr *lbr); #else -struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); +struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data); static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { return -1; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c8a4a548e96b..8063cb7e8387 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6513,9 +6513,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx) { int i, nr_msrs; struct perf_guest_switch_msr *msrs; + struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu); /* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */ - msrs = perf_guest_get_msrs(&nr_msrs); + msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu); if (!msrs) return; -- 2.29.2
[PATCH v4 02/16] perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest
With PEBS virtualization, the guest PEBS records get delivered to the guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest() to distinguish whether the PMI comes from the guest code like Intel PT. No matter how many guest PEBS counters are overflowed, only triggering one fake event is enough. The fake event causes the KVM PMI callback to be called, thereby injecting the PEBS overflow PMI into the guest. KVM will inject the PMI with BUFFER_OVF set, even if the guest DS is empty. That should really be harmless. Thus the guest PEBS handler would retrieve the correct information from its own PEBS records buffer. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c | 45 +++- 1 file changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 591d60cc8436..af9ac48fe840 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2747,6 +2747,46 @@ static void intel_pmu_reset(void) local_irq_restore(flags); } +/* + * We may be running with guest PEBS events created by KVM, and the + * PEBS records are logged into the guest's DS and invisible to host. + * + * In the case of guest PEBS overflow, we only trigger a fake event + * to emulate the PEBS overflow PMI for guest PBES counters in KVM. + * The guest will then vm-entry and check the guest DS area to read + * the guest PEBS records. + * + * The contents and other behavior of the guest event do not matter. + */ +static int x86_pmu_handle_guest_pebs(struct pt_regs *regs, + struct perf_sample_data *data) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; + struct perf_event *event = NULL; + int bit; + + if (!x86_pmu.pebs_active || !guest_pebs_idxs) + return 0; + + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs, + INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) { + + event = cpuc->events[bit]; + if (!event->attr.precise_ip) + continue; + + perf_sample_data_init(data, 0, event->hw.last_period); + if (perf_event_overflow(event, data, regs)) + x86_pmu_stop(event, 0); + + /* Inject one fake event is enough. */ + return 1; + } + + return 0; +} + static int handle_pmi_common(struct pt_regs *regs, u64 status) { struct perf_sample_data data; @@ -2797,7 +2837,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) u64 pebs_enabled = cpuc->pebs_enabled; handled++; - x86_pmu.drain_pebs(regs, &data); + if (x86_pmu.pebs_vmx && perf_guest_cbs && perf_guest_cbs->is_in_guest()) + x86_pmu_handle_guest_pebs(regs, &data); + else + x86_pmu.drain_pebs(regs, &data); status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI; /* -- 2.29.2
[PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
The new hardware facility supporting guest PEBS is only available on Intel Ice Lake Server platforms for now. KVM will check this field through perf_get_x86_pmu_capability() instead of hard coding the cpu models in the KVM code. If it is supported, the guest PBES capability will be exposed to the guest. Signed-off-by: Like Xu --- arch/x86/events/core.c| 1 + arch/x86/events/intel/core.c | 1 + arch/x86/events/perf_event.h | 3 ++- arch/x86/include/asm/perf_event.h | 1 + 4 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..06bef6ba8a9b 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2776,5 +2776,6 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap) cap->bit_width_fixed= x86_pmu.cntval_bits; cap->events_mask= (unsigned int)x86_pmu.events_maskl; cap->events_mask_len= x86_pmu.events_mask_len; + cap->pebs_vmx = x86_pmu.pebs_vmx; } EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability); diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 7bbb5bb98d8c..591d60cc8436 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5574,6 +5574,7 @@ __init int intel_pmu_init(void) case INTEL_FAM6_ICELAKE_X: case INTEL_FAM6_ICELAKE_D: + x86_pmu.pebs_vmx = 1; pmem = true; fallthrough; case INTEL_FAM6_ICELAKE_L: diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 53b2b5fc23bc..85dc4e1d4514 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -729,7 +729,8 @@ struct x86_pmu { pebs_prec_dist :1, pebs_no_tlb :1, pebs_no_isolation :1, - pebs_block :1; + pebs_block :1, + pebs_vmx:1; int pebs_record_size; int pebs_buffer_size; int max_pebs_events; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 544f41a179fb..6a6e707905be 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -192,6 +192,7 @@ struct x86_pmu_capability { int bit_width_fixed; unsigned intevents_mask; int events_mask_len; + unsigned intpebs_vmx:1; }; /* -- 2.29.2
[PATCH v5 4/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional might_sleep_if() deep in kmem_cache_alloc() will generate the following trace, and potentially cause a deadlock when another LBR event is added: [ 243.115549] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [ 243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, name: perf [ 243.119326] INFO: lockdep is turned off. [ 243.120249] irq event stamp: 0 [ 243.120967] hardirqs last enabled at (0): [<>] 0x0 [ 243.122415] hardirqs last disabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.124302] softirqs last enabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.126255] softirqs last disabled at (0): [<>] 0x0 [ 243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8 [ 243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 243.131520] Call Trace: [ 243.132112] dump_stack+0x8d/0xb5 [ 243.132896] ___might_sleep.cold.106+0xb3/0xc3 [ 243.133984] slab_pre_alloc_hook.constprop.85+0x96/0xd0 [ 243.135208] ? intel_pmu_lbr_add+0x152/0x170 [ 243.136207] kmem_cache_alloc+0x36/0x250 [ 243.137126] intel_pmu_lbr_add+0x152/0x170 [ 243.138088] x86_pmu_add+0x83/0xd0 [ 243.138889] ? lock_acquire+0x158/0x350 [ 243.139791] ? lock_acquire+0x158/0x350 [ 243.140694] ? lock_acquire+0x158/0x350 [ 243.141625] ? lock_acquired+0x1e3/0x360 [ 243.142544] ? lock_release+0x1bf/0x340 [ 243.143726] ? trace_hardirqs_on+0x1a/0xd0 [ 243.144823] ? lock_acquired+0x1e3/0x360 [ 243.145742] ? lock_release+0x1bf/0x340 [ 243.147107] ? __slab_free+0x49/0x540 [ 243.147966] ? trace_hardirqs_on+0x1a/0xd0 [ 243.148924] event_sched_in.isra.129+0xf8/0x2a0 [ 243.149989] merge_sched_in+0x261/0x3e0 [ 243.150889] ? trace_hardirqs_on+0x1a/0xd0 [ 243.151869] visit_groups_merge.constprop.135+0x130/0x4a0 [ 243.153122] ? sched_clock_cpu+0xc/0xb0 [ 243.154023] ctx_sched_in+0x101/0x210 [ 243.154884] ctx_resched+0x6f/0xc0 [ 243.155686] perf_event_exec+0x21e/0x2e0 [ 243.156641] begin_new_exec+0x5e5/0xbd0 [ 243.157540] load_elf_binary+0x6af/0x1770 [ 243.158478] ? __kernel_read+0x19d/0x2b0 [ 243.159977] ? lock_acquire+0x158/0x350 [ 243.160876] ? __kernel_read+0x19d/0x2b0 [ 243.161796] bprm_execve+0x3c8/0x840 [ 243.162638] do_execveat_common.isra.38+0x1a5/0x1c0 [ 243.163776] __x64_sys_execve+0x32/0x40 [ 243.164676] do_syscall_64+0x33/0x40 [ 243.165514] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 243.166746] RIP: 0033:0x7f6180a26feb [ 243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1. [ 243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 003b [ 243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb [ 243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30 [ 243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: [ 243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410 [ 243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410 One of the solution is to use GFP_ATOMIC, but it will make the code less reliable under memory pressue. Let's move the memory allocation out of the sleeping region and put it into the x86_reserve_hardware(). The LBR xsave buffer is a per-CPU buffer, not a per-event buffer. This buffer is allocated once when initializing the first event. The disadvantage of this fix is that the cpuc->lbr_xsave memory will be allocated for each cpu like the legacy ds_buffer. Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Suggested-by: Kan Liang Tested-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/lbr.c | 23 +-- arch/x86/events/perf_event.h | 8 +++- 4 files changed, 30 insertions(+), 11 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..a4ce669cc78d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) if (atomic_read(&pmc_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(&pmc_refcount); @@ -634,7 +636,7 @@ static int __x86_p
[PATCH v5 3/5] perf/x86: Skip checking MSR for MSR 0x000
The Architecture LBR does not have MSR_LBR_TOS (0x01c9). When ARCH_LBR we don't set lbr_tos, the failure from the check_msr() against MSR 0x000 will make x86_pmu.lbr_nr = 0, thereby preventing the initialization of the guest LBR. Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Signed-off-by: Like Xu Reviewed-by: Kan Liang --- arch/x86/events/intel/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 382dd3994463..564c9851dd34 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4593,10 +4593,10 @@ static bool check_msr(unsigned long msr, u64 mask) u64 val_old, val_new, val_tmp; /* -* Disable the check for real HW, so we don't +* Disable the check for real HW or non-sense msr, so we don't * mess with potentionaly enabled registers: */ - if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR) || !msr) return true; /* -- 2.29.2
[PATCH v5 2/5] perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers
The x86_pmu.lbr_info is 0 unless explicitly initialized, so there's no point checking x86_pmu.intel_cap.lbr_format. Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/lbr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 21890dacfcfe..355ea70f1879 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1832,12 +1832,10 @@ void __init intel_pmu_arch_lbr_init(void) */ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { - int lbr_fmt = x86_pmu.intel_cap.lbr_format; - lbr->nr = x86_pmu.lbr_nr; lbr->from = x86_pmu.lbr_from; lbr->to = x86_pmu.lbr_to; - lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? x86_pmu.lbr_info : 0; + lbr->info = x86_pmu.lbr_info; return 0; } -- 2.29.2
[PATCH v5 5/5] perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h
The ARCH_LBR_CTL_MASK will be reused for LBR emulation in the KVM. Signed-off-by: Like Xu --- arch/x86/events/intel/lbr.c | 2 -- arch/x86/include/asm/msr-index.h | 1 + 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 6df9a802613f..70afda9d4878 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -168,8 +168,6 @@ enum { ARCH_LBR_RETURN|\ ARCH_LBR_OTHER_BRANCH) -#define ARCH_LBR_CTL_MASK 0x7f000e - static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc); static __always_inline bool is_lbr_call_stack_bit_set(u64 config) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 546d6ecf0a35..8f3375961efc 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -169,6 +169,7 @@ #define LBR_INFO_BR_TYPE (0xfull << LBR_INFO_BR_TYPE_OFFSET) #define MSR_ARCH_LBR_CTL 0x14ce +#define ARCH_LBR_CTL_MASK 0x7f000e #define ARCH_LBR_CTL_LBREN BIT(0) #define ARCH_LBR_CTL_CPL_OFFSET1 #define ARCH_LBR_CTL_CPL (0x3ull << ARCH_LBR_CTL_CPL_OFFSET) -- 2.29.2
[PATCH v5 0/5] perf/x86: Some minor changes to support guest Arch LBR
Hi Peter, Please help review these minor perf/x86 changes in this patch set, and we need some of them to support Guest Architectural LBR in KVM. This version keeps reserve_lbr_buffers() as is because the LBR xsave buffer is a per-CPU buffer, not a per-event buffer. We only need to allocate the buffer once when initializing the first event. If you are interested in the KVM emulation, please check https://lore.kernel.org/kvm/20210314155225.206661-1-like...@linux.intel.com/ Please check more details in each commit and feel free to comment. Previous: https://lore.kernel.org/lkml/20210322060635.821531-1-like...@linux.intel.com/ v4->v5 Changelog: - Add "Tested-by: Kan Liang" - Make the commit message simpler - Make check_msr() to ignore msr==0 - Use kmem_cache_alloc_node() [Namhyung] Like Xu (5): perf/x86/intel: Fix the comment about guest LBR support on KVM perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers perf/x86: Skip checking MSR for MSR 0x000 perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/core.c | 7 +++ arch/x86/events/intel/lbr.c | 29 ++--- arch/x86/events/perf_event.h | 8 +++- arch/x86/include/asm/msr-index.h | 1 + 6 files changed, 35 insertions(+), 20 deletions(-) -- 2.29.2
[PATCH v5 1/5] perf/x86/intel: Fix the comment about guest LBR support on KVM
Starting from v5.12, KVM reports guest LBR and extra_regs support when the host has relevant support. Just delete this part of the comment and fix a typo incidentally. Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 37ce38403cb8..382dd3994463 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5737,8 +5737,7 @@ __init int intel_pmu_init(void) /* * Access LBR MSR may cause #GP under certain circumstances. -* E.g. KVM doesn't support LBR MSR -* Check all LBT MSR here. +* Check all LBR MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL)) -- 2.29.2
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On 2021/3/24 12:04, Namhyung Kim wrote: On Wed, Mar 24, 2021 at 12:47 PM Like Xu wrote: Hi Namhyung, On 2021/3/24 9:32, Namhyung Kim wrote: Hello, On Mon, Mar 22, 2021 at 3:14 PM Like Xu wrote: +void reserve_lbr_buffers(struct perf_event *event) +{ + struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache; + struct cpu_hw_events *cpuc; + int cpu; + + if (!static_cpu_has(X86_FEATURE_ARCH_LBR)) + return; + + for_each_possible_cpu(cpu) { + cpuc = per_cpu_ptr(&cpu_hw_events, cpu); + if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip) + cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL); + } +} I think we should use kmem_cache_alloc_node(). "kmem_cache_alloc_node - Allocate an object on the specified node" The reserve_lbr_buffers() is called in __x86_pmu_event_init(). When the LBR perf_event is scheduled to another node, it seems that we will not call init() and allocate again. Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ? I assume cpuc->lbr_xsave will be accessed for that cpu only. Then it needs to allocate it in the node that cpu belongs to. Something like below.. cpuc->lbr_xsave = kmem_cache_alloc_node(kmem_cache, GFP_KERNEL, cpu_to_node(cpu)); Thanks, it helps and I will apply it in the next version. Thanks, Namhyung
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
Hi Namhyung, On 2021/3/24 9:32, Namhyung Kim wrote: Hello, On Mon, Mar 22, 2021 at 3:14 PM Like Xu wrote: +void reserve_lbr_buffers(struct perf_event *event) +{ + struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache; + struct cpu_hw_events *cpuc; + int cpu; + + if (!static_cpu_has(X86_FEATURE_ARCH_LBR)) + return; + + for_each_possible_cpu(cpu) { + cpuc = per_cpu_ptr(&cpu_hw_events, cpu); + if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip) + cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL); + } +} I think we should use kmem_cache_alloc_node(). "kmem_cache_alloc_node - Allocate an object on the specified node" The reserve_lbr_buffers() is called in __x86_pmu_event_init(). When the LBR perf_event is scheduled to another node, it seems that we will not call init() and allocate again. Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ? Thanks, Namhyung
Re: [PATCH v4 RESEND 4/5] perf/x86/lbr: Skip checking for the existence of LBR_TOS for Arch LBR
On 2021/3/24 5:49, Peter Zijlstra wrote: On Mon, Mar 22, 2021 at 02:06:34PM +0800, Like Xu wrote: The Architecture LBR does not have MSR_LBR_TOS (0x01c9). KVM will generate #GP for this MSR access, thereby preventing the initialization of the guest LBR. Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 382dd3994463..7f6d748421f2 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5740,7 +5740,8 @@ __init int intel_pmu_init(void) * Check all LBR MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ - if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL)) + if (x86_pmu.lbr_nr && !boot_cpu_has(X86_FEATURE_ARCH_LBR) && + !check_msr(x86_pmu.lbr_tos, 0x3UL)) x86_pmu.lbr_nr = 0; But when ARCH_LBR we don't set lbr_tos, so we check MSR 0x000, not 0x1c9. It's true. Do we want check_msr() to ignore msr==0 ? Considering another target of check_msr() is for uncore msrs, how about this change: diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 759226919a36..06fa31a01a5b 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4704,10 +4704,10 @@ static bool check_msr(unsigned long msr, u64 mask) u64 val_old, val_new, val_tmp; /* -* Disable the check for real HW, so we don't +* Disable the check for real HW or non-sense msr, so we don't * mess with potentionaly enabled registers: */ - if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR) || !msr) return true; /* Additionally, do we want a check for lbr_info ? I am not inclined to do this because we may have virtualized model-specific guest LBR support which may break the cpu_model assumption. for (i = 0; i < x86_pmu.lbr_nr; i++) { if (!(check_msr(x86_pmu.lbr_from + i, 0xUL) && -- 2.29.2
Re: [PATCH v4 RESEND 2/5] perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers
On 2021/3/24 5:38, Peter Zijlstra wrote: On Mon, Mar 22, 2021 at 02:06:32PM +0800, Like Xu wrote: If the platform supports LBR_INFO register, the x86_pmu.lbr_info will be assigned in intel_pmu_?_lbr_init_?() and it's safe to expose LBR_INFO You mean: intel_pmu_lbr_*init*(). '?' is a single character glob and you've got too many '_'s. in the x86_perf_get_lbr() directly, instead of relying on lbr_format check. But, afaict, not every model calls one of those. CORE_YONAH for example doesn't. Also Architectural LBR has IA32_LBR_x_INFO instead of LBR_FORMAT_INFO_x to hold metadata for the operation, including mispredict, TSX, and elapsed cycle time information. Relevance? Wouldn't it be much simpler to simple say something like: "x86_pmu.lbr_info is 0 unless explicitly initialized, so there's no point checking lbr_fmt" Yes, it is simpler and I will apply it in the next version. Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/lbr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 21890dacfcfe..355ea70f1879 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1832,12 +1832,10 @@ void __init intel_pmu_arch_lbr_init(void) */ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { - int lbr_fmt = x86_pmu.intel_cap.lbr_format; - lbr->nr = x86_pmu.lbr_nr; lbr->from = x86_pmu.lbr_from; lbr->to = x86_pmu.lbr_to; - lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? x86_pmu.lbr_info : 0; + lbr->info = x86_pmu.lbr_info; return 0; } -- 2.29.2
[PATCH v4 RESEND 5/5] perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h
The ARCH_LBR_CTL_MASK will be reused for Arch LBR emulation in the KVM. Signed-off-by: Like Xu --- arch/x86/events/intel/lbr.c | 2 -- arch/x86/include/asm/msr-index.h | 1 + 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 237876733e12..f60339ff0c13 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -168,8 +168,6 @@ enum { ARCH_LBR_RETURN|\ ARCH_LBR_OTHER_BRANCH) -#define ARCH_LBR_CTL_MASK 0x7f000e - static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc); static __always_inline bool is_lbr_call_stack_bit_set(u64 config) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 546d6ecf0a35..8f3375961efc 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -169,6 +169,7 @@ #define LBR_INFO_BR_TYPE (0xfull << LBR_INFO_BR_TYPE_OFFSET) #define MSR_ARCH_LBR_CTL 0x14ce +#define ARCH_LBR_CTL_MASK 0x7f000e #define ARCH_LBR_CTL_LBREN BIT(0) #define ARCH_LBR_CTL_CPL_OFFSET1 #define ARCH_LBR_CTL_CPL (0x3ull << ARCH_LBR_CTL_CPL_OFFSET) -- 2.29.2
[PATCH v4 RESEND 0/5] x86: The perf/x86 changes to support guest Arch LBR
Hi Peter, Please help review these minor perf/x86 changes in this patch set, and we need some of them to support Guest Architectural LBR in KVM. If you are interested in the KVM emulation, please check https://lore.kernel.org/kvm/20210314155225.206661-1-like...@linux.intel.com/ Please check more details in each commit and feel free to comment. Like Xu (5): perf/x86/intel: Fix the comment about guest LBR support on KVM perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region perf/x86/lbr: Skip checking for the existence of LBR_TOS for Arch LBR perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/core.c | 6 +++--- arch/x86/events/intel/lbr.c | 28 +--- arch/x86/events/perf_event.h | 8 +++- arch/x86/include/asm/msr-index.h | 1 + 6 files changed, 34 insertions(+), 19 deletions(-) -- 2.29.2
[PATCH v4 RESEND 1/5] perf/x86/intel: Fix the comment about guest LBR support on KVM
Starting from v5.12, KVM reports guest LBR and extra_regs support when the host has relevant support. Just delete this part of the comment and fix a typo incidentally. Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 37ce38403cb8..382dd3994463 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5737,8 +5737,7 @@ __init int intel_pmu_init(void) /* * Access LBR MSR may cause #GP under certain circumstances. -* E.g. KVM doesn't support LBR MSR -* Check all LBT MSR here. +* Check all LBR MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL)) -- 2.29.2
[PATCH v4 RESEND 4/5] perf/x86/lbr: Skip checking for the existence of LBR_TOS for Arch LBR
The Architecture LBR does not have MSR_LBR_TOS (0x01c9). KVM will generate #GP for this MSR access, thereby preventing the initialization of the guest LBR. Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 382dd3994463..7f6d748421f2 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5740,7 +5740,8 @@ __init int intel_pmu_init(void) * Check all LBR MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ - if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL)) + if (x86_pmu.lbr_nr && !boot_cpu_has(X86_FEATURE_ARCH_LBR) && + !check_msr(x86_pmu.lbr_tos, 0x3UL)) x86_pmu.lbr_nr = 0; for (i = 0; i < x86_pmu.lbr_nr; i++) { if (!(check_msr(x86_pmu.lbr_from + i, 0xUL) && -- 2.29.2
[PATCH v4 RESEND 2/5] perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers
If the platform supports LBR_INFO register, the x86_pmu.lbr_info will be assigned in intel_pmu_?_lbr_init_?() and it's safe to expose LBR_INFO in the x86_perf_get_lbr() directly, instead of relying on lbr_format check. Also Architectural LBR has IA32_LBR_x_INFO instead of LBR_FORMAT_INFO_x to hold metadata for the operation, including mispredict, TSX, and elapsed cycle time information. Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/lbr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 21890dacfcfe..355ea70f1879 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1832,12 +1832,10 @@ void __init intel_pmu_arch_lbr_init(void) */ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { - int lbr_fmt = x86_pmu.intel_cap.lbr_format; - lbr->nr = x86_pmu.lbr_nr; lbr->from = x86_pmu.lbr_from; lbr->to = x86_pmu.lbr_to; - lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? x86_pmu.lbr_info : 0; + lbr->info = x86_pmu.lbr_info; return 0; } -- 2.29.2
[PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional might_sleep_if() deep in kmem_cache_alloc() will generate the following trace, and potentially cause a deadlock when another LBR event is added: [ 243.115549] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [ 243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, name: perf [ 243.119326] INFO: lockdep is turned off. [ 243.120249] irq event stamp: 0 [ 243.120967] hardirqs last enabled at (0): [<>] 0x0 [ 243.122415] hardirqs last disabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.124302] softirqs last enabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.126255] softirqs last disabled at (0): [<>] 0x0 [ 243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8 [ 243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 243.131520] Call Trace: [ 243.132112] dump_stack+0x8d/0xb5 [ 243.132896] ___might_sleep.cold.106+0xb3/0xc3 [ 243.133984] slab_pre_alloc_hook.constprop.85+0x96/0xd0 [ 243.135208] ? intel_pmu_lbr_add+0x152/0x170 [ 243.136207] kmem_cache_alloc+0x36/0x250 [ 243.137126] intel_pmu_lbr_add+0x152/0x170 [ 243.138088] x86_pmu_add+0x83/0xd0 [ 243.138889] ? lock_acquire+0x158/0x350 [ 243.139791] ? lock_acquire+0x158/0x350 [ 243.140694] ? lock_acquire+0x158/0x350 [ 243.141625] ? lock_acquired+0x1e3/0x360 [ 243.142544] ? lock_release+0x1bf/0x340 [ 243.143726] ? trace_hardirqs_on+0x1a/0xd0 [ 243.144823] ? lock_acquired+0x1e3/0x360 [ 243.145742] ? lock_release+0x1bf/0x340 [ 243.147107] ? __slab_free+0x49/0x540 [ 243.147966] ? trace_hardirqs_on+0x1a/0xd0 [ 243.148924] event_sched_in.isra.129+0xf8/0x2a0 [ 243.149989] merge_sched_in+0x261/0x3e0 [ 243.150889] ? trace_hardirqs_on+0x1a/0xd0 [ 243.151869] visit_groups_merge.constprop.135+0x130/0x4a0 [ 243.153122] ? sched_clock_cpu+0xc/0xb0 [ 243.154023] ctx_sched_in+0x101/0x210 [ 243.154884] ctx_resched+0x6f/0xc0 [ 243.155686] perf_event_exec+0x21e/0x2e0 [ 243.156641] begin_new_exec+0x5e5/0xbd0 [ 243.157540] load_elf_binary+0x6af/0x1770 [ 243.158478] ? __kernel_read+0x19d/0x2b0 [ 243.159977] ? lock_acquire+0x158/0x350 [ 243.160876] ? __kernel_read+0x19d/0x2b0 [ 243.161796] bprm_execve+0x3c8/0x840 [ 243.162638] do_execveat_common.isra.38+0x1a5/0x1c0 [ 243.163776] __x64_sys_execve+0x32/0x40 [ 243.164676] do_syscall_64+0x33/0x40 [ 243.165514] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 243.166746] RIP: 0033:0x7f6180a26feb [ 243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1. [ 243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 003b [ 243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb [ 243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30 [ 243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: [ 243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410 [ 243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410 One of the solution is to use GFP_ATOMIC, but it will make the code less reliable under memory pressue. Let's move the memory allocation out of the sleeping region and put it into the x86_reserve_hardware(). The disadvantage of this fix is that the cpuc->lbr_xsave memory will be allocated for each cpu like the legacy ds_buffer. Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Suggested-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/lbr.c | 22 -- arch/x86/events/perf_event.h | 8 +++- 4 files changed, 29 insertions(+), 11 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..a4ce669cc78d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) if (atomic_read(&pmc_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(&pmc_refcount); @@ -634,7 +636,7 @@ static int __x86_pmu_event_init(struct perf_event *event) if (!x86_pmu_initialized()) return -ENODEV; - err = x86_reserve_hardware(); +
[PATCH v2] x86: Update guest LBR tests for Architectural LBR
This unit-test is intended to test the basic KVM's support for Architectural LBRs which is a Architectural performance monitor unit (PMU) feature on Intel processors including negative testing on the MSR LBR_DEPTH values. If the LBR bit is set to 1 in the MSR_ARCH_LBR_CTL, the processor will record a running trace of the most recent branches guest taken in the LBR entries for guest to read. Signed-off-by: Like Xu --- x86/pmu_lbr.c | 88 +-- 1 file changed, 79 insertions(+), 9 deletions(-) diff --git a/x86/pmu_lbr.c b/x86/pmu_lbr.c index 3bd9e9f..8cde208 100644 --- a/x86/pmu_lbr.c +++ b/x86/pmu_lbr.c @@ -6,6 +6,7 @@ #define MAX_NUM_LBR_ENTRY32 #define DEBUGCTLMSR_LBR (1UL << 0) #define PMU_CAP_LBR_FMT 0x3f +#define KVM_ARCH_LBR_CTL_MASK0x7f000f #define MSR_LBR_NHM_FROM 0x0680 #define MSR_LBR_NHM_TO 0x06c0 @@ -13,6 +14,10 @@ #define MSR_LBR_CORE_TO0x0060 #define MSR_LBR_TOS0x01c9 #define MSR_LBR_SELECT 0x01c8 +#define MSR_ARCH_LBR_CTL 0x14ce +#define MSR_ARCH_LBR_DEPTH 0x14cf +#define MSR_ARCH_LBR_FROM_00x1500 +#define MSR_ARCH_LBR_TO_0 0x1600 volatile int count; @@ -61,11 +66,26 @@ static bool test_init_lbr_from_exception(u64 index) return test_for_exception(GP_VECTOR, init_lbr, &index); } +static void change_archlbr_depth(void *depth) +{ + wrmsr(MSR_ARCH_LBR_DEPTH, *(u64 *)depth); +} + +static bool test_change_archlbr_depth_from_exception(u64 depth) +{ + return test_for_exception(GP_VECTOR, change_archlbr_depth, &depth); +} + int main(int ac, char **av) { struct cpuid id = cpuid(10); + struct cpuid id_7 = cpuid(7); + struct cpuid id_1c; u64 perf_cap; int max, i; + bool arch_lbr = false; + u32 ctl_msr = MSR_IA32_DEBUGCTLMSR; + u64 ctl_value = DEBUGCTLMSR_LBR; setup_vm(); perf_cap = rdmsr(MSR_IA32_PERF_CAPABILITIES); @@ -80,8 +100,19 @@ int main(int ac, char **av) return report_summary(); } + if (id_7.d & (1UL << 19)) { + arch_lbr = true; + ctl_msr = MSR_ARCH_LBR_CTL; + /* DEPTH defaults to the maximum number of LBRs entries. */ + max = rdmsr(MSR_ARCH_LBR_DEPTH) - 1; + ctl_value = KVM_ARCH_LBR_CTL_MASK; + } + printf("PMU version: %d\n", eax.split.version_id); - printf("LBR version: %ld\n", perf_cap & PMU_CAP_LBR_FMT); + if (!arch_lbr) + printf("LBR version: %ld\n", perf_cap & PMU_CAP_LBR_FMT); + else + printf("Architectural LBR depth: %d\n", max + 1); /* Look for LBR from and to MSRs */ lbr_from = MSR_LBR_CORE_FROM; @@ -90,32 +121,71 @@ int main(int ac, char **av) lbr_from = MSR_LBR_NHM_FROM; lbr_to = MSR_LBR_NHM_TO; } + if (test_init_lbr_from_exception(0)) { + lbr_from = MSR_ARCH_LBR_FROM_0; + lbr_to = MSR_ARCH_LBR_TO_0; + } if (test_init_lbr_from_exception(0)) { printf("LBR on this platform is not supported!\n"); return report_summary(); } - wrmsr(MSR_LBR_SELECT, 0); - wrmsr(MSR_LBR_TOS, 0); - for (max = 0; max < MAX_NUM_LBR_ENTRY; max++) { - if (test_init_lbr_from_exception(max)) - break; + if (arch_lbr) { + /* +* On processors that support Architectural LBRs, +* IA32_PERF_CAPABILITIES.LBR_FMT will have the value 03FH. +*/ + report(0x3f == (perf_cap & PMU_CAP_LBR_FMT), "The guest LBR_FMT value is good."); } + /* Reset the guest LBR entries. */ + if (arch_lbr) { + /* On a software write to IA32_LBR_DEPTH, all LBR entries are reset to 0.*/ + wrmsr(MSR_ARCH_LBR_DEPTH, max + 1); + } else { + wrmsr(MSR_LBR_SELECT, 0); + wrmsr(MSR_LBR_TOS, 0); + for (max = 0; max < MAX_NUM_LBR_ENTRY; max++) { + if (test_init_lbr_from_exception(max)) + break; + } + } report(max > 0, "The number of guest LBR entries is good."); + /* Check the guest LBR entries are initialized. */ + for (i = 0; i < max; ++i) { + if (rdmsr(lbr_to + i) || rdmsr(lbr_from + i)) + break; + } + report(i == max, "The guest LBR initialized FROM_IP/TO_IP values are good."); + /* Do some branch instructions. */ - wrmsr(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR); + wrmsr(ctl_msr, ctl_value);
[PATCH v4 10/11] KVM: x86: Refine the matching and clearing logic for supported_xss
Refine the code path of the existing clearing of supported_xss in this way: initialize the supported_xss with the filter of KVM_SUPPORTED_XSS mask and update its value in a bit clear manner (rather than bit setting). Suggested-by: Sean Christopherson Signed-off-by: Like Xu --- arch/x86/kvm/vmx/vmx.c | 5 +++-- arch/x86/kvm/x86.c | 6 +- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 03c0faf16a7d..14ed3251376f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7302,9 +7302,10 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_UMIP); /* CPUID 0xD.1 */ - supported_xss = 0; - if (!cpu_has_vmx_xsaves()) + if (!cpu_has_vmx_xsaves()) { kvm_cpu_cap_clear(X86_FEATURE_XSAVES); + supported_xss = 0; + } /* CPUID 0x8001 */ if (!cpu_has_vmx_rdtscp()) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4bcf5b130e38..171605dcbd65 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -205,6 +205,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs; | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ | XFEATURE_MASK_PKRU) +#define KVM_SUPPORTED_XSS 0 + u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); @@ -10450,8 +10452,10 @@ int kvm_arch_hardware_setup(void *opaque) rdmsrl_safe(MSR_EFER, &host_efer); - if (boot_cpu_has(X86_FEATURE_XSAVES)) + if (boot_cpu_has(X86_FEATURE_XSAVES)) { rdmsrl(MSR_IA32_XSS, host_xss); + supported_xss = host_xss & KVM_SUPPORTED_XSS; + } r = ops->hardware_setup(); if (r != 0) -- 2.29.2
[PATCH v4 11/11] KVM: x86: Add XSAVE Support for Architectural LBRs
On processors whose XSAVE feature set supports XSAVES and XRSTORS, the availability of support for Architectural LBR configuration state save and restore can be determined from CPUID.(EAX=0DH, ECX=1):EDX:ECX[bit 15]. The detailed leaf for Arch LBRs is enumerated in CPUID.(EAX=0DH, ECX=0FH). XSAVES provides a faster means than RDMSR for guest to read all LBRs. When guest IA32_XSS[bit 15] is set, the Arch LBRs state can be saved using XSAVES and restored by XRSTORS with the appropriate RFBM. If the KVM fails to pass-through the LBR msrs to the guest, the LBR msrs will be reset to prevent the leakage of host records via XSAVES. In this case, the guest results may be inaccurate as the legacy LBR. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 4 +++- arch/x86/kvm/x86.c | 2 +- 3 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 9199d3974d57..7666292094ec 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -772,6 +772,8 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu) return; warn: + if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) + wrmsrl(MSR_ARCH_LBR_DEPTH, lbr_desc->records.nr); pr_warn_ratelimited("kvm: vcpu-%d: fail to passthrough LBR.\n", vcpu->vcpu_id); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 14ed3251376f..659be0d708ac 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7295,8 +7295,10 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_clear(X86_FEATURE_INVPCID); if (vmx_pt_mode_is_host_guest()) kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); - if (!cpu_has_vmx_arch_lbr()) + if (!cpu_has_vmx_arch_lbr()) { kvm_cpu_cap_clear(X86_FEATURE_ARCH_LBR); + supported_xss &= ~XFEATURE_MASK_LBR; + } if (vmx_umip_emulated()) kvm_cpu_cap_set(X86_FEATURE_UMIP); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 171605dcbd65..2e0935795502 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -205,7 +205,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs; | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ | XFEATURE_MASK_PKRU) -#define KVM_SUPPORTED_XSS 0 +#define KVM_SUPPORTED_XSS XFEATURE_MASK_LBR u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); -- 2.29.2
[PATCH v4 09/11] KVM: x86: Expose Architectural LBR CPUID leaf
If CPUID.(EAX=07H, ECX=0):EDX[19] is set to 1, then KVM supports Arch LBRs and CPUID leaf 01CH indicates details of the Arch LBRs capabilities. Currently, KVM only supports the current host LBR depth for guests, which is also the maximum supported depth on the host. Signed-off-by: Like Xu --- arch/x86/kvm/cpuid.c | 25 - arch/x86/kvm/vmx/vmx.c | 2 ++ 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b4247f821277..4473324fe7be 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -450,7 +450,7 @@ void kvm_set_cpu_caps(void) F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) | - F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) + F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) | F(ARCH_LBR) ); /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ @@ -805,6 +805,29 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) goto out; } break; + /* Architectural LBR */ + case 0x1c: + { + u64 lbr_depth_mask = entry->eax & 0xff; + + if (!lbr_depth_mask || !kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) { + entry->eax = entry->ebx = entry->ecx = entry->edx = 0; + break; + } + + /* +* KVM only exposes the maximum supported depth, +* which is also the fixed value used on the host. +* +* KVM doesn't allow VMM user sapce to adjust depth +* per guest, because the guest LBR emulation depends +* on the implementation of the host LBR driver. +*/ + lbr_depth_mask = 1UL << (fls(lbr_depth_mask) - 1); + entry->eax &= ~0xff; + entry->eax |= lbr_depth_mask; + break; + } case KVM_CPUID_SIGNATURE: { static const char signature[12] = "KVMKVMKVM\0\0"; const u32 *sigptr = (const u32 *)signature; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 43e73ea12ba6..03c0faf16a7d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7295,6 +7295,8 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_clear(X86_FEATURE_INVPCID); if (vmx_pt_mode_is_host_guest()) kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); + if (!cpu_has_vmx_arch_lbr()) + kvm_cpu_cap_clear(X86_FEATURE_ARCH_LBR); if (vmx_umip_emulated()) kvm_cpu_cap_set(X86_FEATURE_UMIP); -- 2.29.2
[PATCH v4 08/11] KVM: vmx/pmu: Add Arch LBR emulation and its VMCS field
New VMX controls bits for Arch LBR are added. When bit 21 in vmentry_ctrl is set, VM entry will write the value from the "Guest IA32_LBR_CTL" guest state field to IA32_LBR_CTL. When bit 26 in vmexit_ctrl is set, VM exit will clear IA32_LBR_CTL after the value has been saved to the "Guest IA32_LBR_CTL" guest state field. The host value would be saved before vm-entry and restored after vm-exit like the legacy host_debugctlmsr; To enable guest Arch LBR, KVM should set both the "Load Guest IA32_LBR_CTL" entry control and the "Clear IA32_LBR_CTL" exit control bits. If these two conditions cannot be met, KVM will clear the LBR_FMT bits and will not expose the Arch LBR feature. If Arch LBR is exposed on KVM, the guest should set both the ARCH_LBR CPUID and the same LBR_FMT value as the host via MSR_IA32_PERF_CAPABILITIES to enable guest Arch LBR. KVM will bypass the host/guest x86 cpu model check and the records msrs can still be pass-through to guest as usual and work like a model-specific LBR. KVM is consistent with the host and does not support the LER entry. Signed-off-by: Like Xu --- arch/x86/include/asm/vmx.h | 2 ++ arch/x86/kvm/vmx/capabilities.h | 25 + arch/x86/kvm/vmx/pmu_intel.c| 27 ++- arch/x86/kvm/vmx/vmx.c | 32 ++-- arch/x86/kvm/vmx/vmx.h | 1 + 5 files changed, 72 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 6826fd0e8d1a..973bf16720c2 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -95,6 +95,7 @@ #define VM_EXIT_CLEAR_BNDCFGS 0x0080 #define VM_EXIT_PT_CONCEAL_PIP 0x0100 #define VM_EXIT_CLEAR_IA32_RTIT_CTL0x0200 +#define VM_EXIT_CLEAR_IA32_LBR_CTL 0x0400 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff @@ -108,6 +109,7 @@ #define VM_ENTRY_LOAD_BNDCFGS 0x0001 #define VM_ENTRY_PT_CONCEAL_PIP0x0002 #define VM_ENTRY_LOAD_IA32_RTIT_CTL0x0004 +#define VM_ENTRY_LOAD_IA32_LBR_CTL 0x0020 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x11ff diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index d1d77985e889..73fceb534c7c 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -378,20 +378,29 @@ static inline bool vmx_pt_mode_is_host_guest(void) return pt_mode == PT_MODE_HOST_GUEST; } -static inline u64 vmx_get_perf_capabilities(void) +static inline bool cpu_has_vmx_arch_lbr(void) { - u64 perf_cap = 0; - - if (boot_cpu_has(X86_FEATURE_PDCM)) - rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); - - perf_cap &= PMU_CAP_LBR_FMT; + return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_IA32_LBR_CTL) && + (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_LBR_CTL); +} +static inline u64 vmx_get_perf_capabilities(void) +{ /* * Since counters are virtualized, KVM would support full * width counting unconditionally, even if the host lacks it. */ - return PMU_CAP_FW_WRITES | perf_cap; + u64 perf_cap = PMU_CAP_FW_WRITES; + u64 host_perf_cap = 0; + + if (boot_cpu_has(X86_FEATURE_PDCM)) + rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap); + + perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; + if (boot_cpu_has(X86_FEATURE_ARCH_LBR) && !cpu_has_vmx_arch_lbr()) + perf_cap &= ~PMU_CAP_LBR_FMT; + + return perf_cap; } static inline u64 vmx_supported_debugctl(void) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 15490d31b828..9199d3974d57 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -181,12 +181,16 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr) bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu) { + if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) + return guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); + /* * As a first step, a guest could only enable LBR feature if its * cpu model is the same as the host because the LBR registers * would be pass-through to the guest and they're model specific. */ - return boot_cpu_data.x86_model == guest_cpuid_model(vcpu); + return !boot_cpu_has(X86_FEATURE_ARCH_LBR) && + boot_cpu_data.x86_model == guest_cpuid_model(vcpu); } bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu) @@ -204,8 +208,11 @@ static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index) if (!intel_pmu_lbr_is_enabled(vcpu)) return ret; - ret = (index == MSR_LBR_SELECT) || (index == MSR_LBR_TOS) || -
[PATCH v4 06/11] KVM: vmx/pmu: Add MSR_ARCH_LBR_DEPTH emulation for Arch LBR
The number of Arch LBR entries available for recording operations is dictated by the value in MSR_ARCH_LBR_DEPTH.DEPTH. The supported LBR depth values can be found in CPUID.(EAX=01CH, ECX=0):EAX[7:0] and for each bit "n" set in this field, the MSR_ARCH_LBR_DEPTH.DEPTH value of "8*(n+1)" is supported. On a guest write to MSR_ARCH_LBR_DEPTH, all LBR entries are reset to 0. KVM emulates the reset behavior by introducing lbr_desc->arch_lbr_reset. KVM writes the guest requested value to the native ARCH_LBR_DEPTH MSR (this is safe because the two values will be the same) when the Arch LBR records MSRs are pass-through to the guest. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 43 arch/x86/kvm/vmx/vmx.h | 3 +++ 2 files changed, 46 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 9efc1a6b8693..d9c9cb6c9a4b 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -220,6 +220,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu->version > 1; break; + case MSR_ARCH_LBR_DEPTH: + ret = guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -250,6 +253,7 @@ static inline void intel_pmu_release_guest_lbr_event(struct kvm_vcpu *vcpu) if (lbr_desc->event) { perf_event_release_kernel(lbr_desc->event); lbr_desc->event = NULL; + lbr_desc->arch_lbr_reset = false; vcpu_to_pmu(vcpu)->event_count--; } } @@ -348,10 +352,26 @@ static bool intel_pmu_handle_lbr_msrs_access(struct kvm_vcpu *vcpu, return true; } +/* + * Check if the requested depth values is supported + * based on the bits [0:7] of the guest cpuid.1c.eax. + */ +static bool arch_lbr_depth_is_valid(struct kvm_vcpu *vcpu, u64 depth) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x1c, 0); + if (best && depth && (depth < 65) && !(depth & 7)) + return best->eax & BIT_ULL(depth / 8 - 1); + + return false; +} + static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); u32 msr = msr_info->index; switch (msr) { @@ -367,6 +387,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: msr_info->data = pmu->global_ovf_ctrl; return 0; + case MSR_ARCH_LBR_DEPTH: + msr_info->data = lbr_desc->records.nr; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -393,6 +416,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); u32 msr = msr_info->index; u64 data = msr_info->data; @@ -427,6 +451,12 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_ARCH_LBR_DEPTH: + if (!arch_lbr_depth_is_valid(vcpu, data)) + return 1; + lbr_desc->records.nr = data; + lbr_desc->arch_lbr_reset = true; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -566,6 +596,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu) lbr_desc->records.nr = 0; lbr_desc->event = NULL; lbr_desc->msr_passthrough = false; + lbr_desc->arch_lbr_reset = false; } static void intel_pmu_reset(struct kvm_vcpu *vcpu) @@ -623,6 +654,15 @@ static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu) intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu); } +static void intel_pmu_arch_lbr_reset(struct kvm_vcpu *vcpu) +{ + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); + + /* On a software write to IA32_LBR_DEPTH, all LBR entries are reset to 0. */ + wrmsrl(MSR_ARCH_LBR_DEPTH, lbr_desc->records.nr); + lbr_desc->arch_lbr_reset = false; +} + static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set) { struct x86_pmu_lbr
[PATCH v4 03/11] perf/x86/lbr: Skip checking for the existence of LBR_TOS for Arch LBR
The Architecture LBR does not have MSR_LBR_TOS (0x01c9). KVM will generate #GP for this MSR access, thereby preventing the initialization of the guest LBR. Cc: Kan Liang Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Ingo Molnar Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 7bb96ac87615..0338e354826d 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5568,7 +5568,8 @@ __init int intel_pmu_init(void) * Check all LBR MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ - if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL)) + if (x86_pmu.lbr_nr && !boot_cpu_has(X86_FEATURE_ARCH_LBR) && + !check_msr(x86_pmu.lbr_tos, 0x3UL)) x86_pmu.lbr_nr = 0; for (i = 0; i < x86_pmu.lbr_nr; i++) { if (!(check_msr(x86_pmu.lbr_from + i, 0xUL) && -- 2.29.2
[PATCH v4 05/11] perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h
The ARCH_LBR_CTL_MASK will be reused for LBR emulation in the KVM. Cc: Kan Liang Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Ingo Molnar Signed-off-by: Like Xu --- arch/x86/events/intel/lbr.c | 2 -- arch/x86/include/asm/msr-index.h | 1 + 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 237876733e12..f60339ff0c13 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -168,8 +168,6 @@ enum { ARCH_LBR_RETURN|\ ARCH_LBR_OTHER_BRANCH) -#define ARCH_LBR_CTL_MASK 0x7f000e - static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc); static __always_inline bool is_lbr_call_stack_bit_set(u64 config) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 546d6ecf0a35..8f3375961efc 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -169,6 +169,7 @@ #define LBR_INFO_BR_TYPE (0xfull << LBR_INFO_BR_TYPE_OFFSET) #define MSR_ARCH_LBR_CTL 0x14ce +#define ARCH_LBR_CTL_MASK 0x7f000e #define ARCH_LBR_CTL_LBREN BIT(0) #define ARCH_LBR_CTL_CPL_OFFSET1 #define ARCH_LBR_CTL_CPL (0x3ull << ARCH_LBR_CTL_CPL_OFFSET) -- 2.29.2
[PATCH v4 07/11] KVM: vmx/pmu: Add MSR_ARCH_LBR_CTL emulation for Arch LBR
Arch LBRs are enabled by setting MSR_ARCH_LBR_CTL.LBREn to 1. A new guest state field named "Guest IA32_LBR_CTL" is added to enhance guest LBR usage. When guest Arch LBR is enabled, a guest LBR event will be created like the model-specific LBR does. On processors that support Arch LBR, MSR_IA32_DEBUGCTLMSR[bit 0] has no meaning. It can be written to 0 or 1, but reads will always return 0. Like IA32_DEBUGCTL, IA32_ARCH_LBR_CTL msr is also reserved on INIT. Signed-off-by: Like Xu --- arch/x86/include/asm/vmx.h | 2 ++ arch/x86/kvm/vmx/pmu_intel.c | 31 ++- arch/x86/kvm/vmx/vmx.c | 9 + 3 files changed, 37 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 358707f60d99..6826fd0e8d1a 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -245,6 +245,8 @@ enum vmcs_field { GUEST_BNDCFGS_HIGH = 0x2813, GUEST_IA32_RTIT_CTL = 0x2814, GUEST_IA32_RTIT_CTL_HIGH= 0x2815, + GUEST_IA32_LBR_CTL = 0x2816, + GUEST_IA32_LBR_CTL_HIGH = 0x2817, HOST_IA32_PAT = 0x2c00, HOST_IA32_PAT_HIGH = 0x2c01, HOST_IA32_EFER = 0x2c02, diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index d9c9cb6c9a4b..15490d31b828 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -19,6 +19,12 @@ #include "pmu.h" #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) +/* + * Regardless of the Arch LBR or legacy LBR, when the LBREn bit 0 of the + * corresponding control MSR is set to 1, LBR recording will be enabled. + */ +#define LBR_CTL_EN BIT(0) +#define KVM_ARCH_LBR_CTL_MASK (ARCH_LBR_CTL_MASK | LBR_CTL_EN) static struct kvm_event_hw_type_mapping intel_arch_events[] = { /* Index must match CPUID 0x0A.EBX bit vector */ @@ -221,6 +227,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) ret = pmu->version > 1; break; case MSR_ARCH_LBR_DEPTH: + case MSR_ARCH_LBR_CTL: ret = guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); break; default: @@ -390,6 +397,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_ARCH_LBR_DEPTH: msr_info->data = lbr_desc->records.nr; return 0; + case MSR_ARCH_LBR_CTL: + msr_info->data = vmcs_read64(GUEST_IA32_LBR_CTL); + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -457,6 +467,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) lbr_desc->records.nr = data; lbr_desc->arch_lbr_reset = true; return 0; + case MSR_ARCH_LBR_CTL: + if (data & ~KVM_ARCH_LBR_CTL_MASK) + break; + vmcs_write64(GUEST_IA32_LBR_CTL, data); + if (intel_pmu_lbr_is_enabled(vcpu) && !lbr_desc->event && + (data & ARCH_LBR_CTL_LBREN)) + intel_pmu_create_guest_lbr_event(vcpu); + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -635,12 +653,15 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu) */ static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu) { - u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL); + u32 lbr_ctl_field = GUEST_IA32_DEBUGCTL; - if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) { - data &= ~DEBUGCTLMSR_LBR; - vmcs_write64(GUEST_IA32_DEBUGCTL, data); - } + if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)) + return; + + if (guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR)) + lbr_ctl_field = GUEST_IA32_LBR_CTL; + + vmcs_write64(lbr_ctl_field, vmcs_read64(lbr_ctl_field) & ~LBR_CTL_EN); } static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index ef826594365f..38007daba935 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2054,6 +2054,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) VM_EXIT_SAVE_DEBUG_CONTROLS) get_vmcs12(vcpu)->guest_ia32_debugctl = data; + /* +* For Arch LBR, IA32_DEBUGCTL[bit 0] has no meaning. +* It can be written to 0 or 1, but r
[PATCH v4 04/11] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional might_sleep_if() deep in kmem_cache_alloc() will generate the following trace, and potentially cause a deadlock when another LBR event is added: [ 243.115549] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [ 243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, name: perf [ 243.119326] INFO: lockdep is turned off. [ 243.120249] irq event stamp: 0 [ 243.120967] hardirqs last enabled at (0): [<>] 0x0 [ 243.122415] hardirqs last disabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.124302] softirqs last enabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.126255] softirqs last disabled at (0): [<>] 0x0 [ 243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8 [ 243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 243.131520] Call Trace: [ 243.132112] dump_stack+0x8d/0xb5 [ 243.132896] ___might_sleep.cold.106+0xb3/0xc3 [ 243.133984] slab_pre_alloc_hook.constprop.85+0x96/0xd0 [ 243.135208] ? intel_pmu_lbr_add+0x152/0x170 [ 243.136207] kmem_cache_alloc+0x36/0x250 [ 243.137126] intel_pmu_lbr_add+0x152/0x170 [ 243.138088] x86_pmu_add+0x83/0xd0 [ 243.138889] ? lock_acquire+0x158/0x350 [ 243.139791] ? lock_acquire+0x158/0x350 [ 243.140694] ? lock_acquire+0x158/0x350 [ 243.141625] ? lock_acquired+0x1e3/0x360 [ 243.142544] ? lock_release+0x1bf/0x340 [ 243.143726] ? trace_hardirqs_on+0x1a/0xd0 [ 243.144823] ? lock_acquired+0x1e3/0x360 [ 243.145742] ? lock_release+0x1bf/0x340 [ 243.147107] ? __slab_free+0x49/0x540 [ 243.147966] ? trace_hardirqs_on+0x1a/0xd0 [ 243.148924] event_sched_in.isra.129+0xf8/0x2a0 [ 243.149989] merge_sched_in+0x261/0x3e0 [ 243.150889] ? trace_hardirqs_on+0x1a/0xd0 [ 243.151869] visit_groups_merge.constprop.135+0x130/0x4a0 [ 243.153122] ? sched_clock_cpu+0xc/0xb0 [ 243.154023] ctx_sched_in+0x101/0x210 [ 243.154884] ctx_resched+0x6f/0xc0 [ 243.155686] perf_event_exec+0x21e/0x2e0 [ 243.156641] begin_new_exec+0x5e5/0xbd0 [ 243.157540] load_elf_binary+0x6af/0x1770 [ 243.158478] ? __kernel_read+0x19d/0x2b0 [ 243.159977] ? lock_acquire+0x158/0x350 [ 243.160876] ? __kernel_read+0x19d/0x2b0 [ 243.161796] bprm_execve+0x3c8/0x840 [ 243.162638] do_execveat_common.isra.38+0x1a5/0x1c0 [ 243.163776] __x64_sys_execve+0x32/0x40 [ 243.164676] do_syscall_64+0x33/0x40 [ 243.165514] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 243.166746] RIP: 0033:0x7f6180a26feb [ 243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1. [ 243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 003b [ 243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb [ 243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30 [ 243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: [ 243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410 [ 243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410 One of the solution is to use GFP_ATOMIC, but it will make the code less reliable under memory pressue. Let's move the memory allocation out of the sleeping region and put it into the x86_reserve_hardware(). The disadvantage of this fix is that the cpuc->lbr_xsave memory will be allocated for each cpu like the legacy ds_buffer. Cc: Kan Liang Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Ingo Molnar Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Suggested-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/lbr.c | 22 -- arch/x86/events/perf_event.h | 8 +++- 4 files changed, 29 insertions(+), 11 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index e37de298a495..b55f43481272 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -365,7 +365,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -374,8 +374,10 @@ int x86_reserve_hardware(void) if (atomic_read(&pmc_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(&pmc_refcount); @@ -626,7 +628,7 @@ static int __x86_pmu_event_init(struct perf_event *event) if (!x86_pmu_initial
[PATCH v4 02/11] perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers
If the platform supports LBR_INFO register, the x86_pmu.lbr_info will be assigned in intel_pmu_?_lbr_init_?() and it's safe to expose LBR_INFO in the x86_perf_get_lbr() directly, instead of relying on lbr_format check. Also Architectural LBR has IA32_LBR_x_INFO instead of LBR_FORMAT_INFO_x to hold metadata for the operation, including mispredict, TSX, and elapsed cycle time information. Cc: Kan Liang Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Ingo Molnar Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/lbr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 21890dacfcfe..355ea70f1879 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1832,12 +1832,10 @@ void __init intel_pmu_arch_lbr_init(void) */ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { - int lbr_fmt = x86_pmu.intel_cap.lbr_format; - lbr->nr = x86_pmu.lbr_nr; lbr->from = x86_pmu.lbr_from; lbr->to = x86_pmu.lbr_to; - lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? x86_pmu.lbr_info : 0; + lbr->info = x86_pmu.lbr_info; return 0; } -- 2.29.2
[PATCH v4 01/11] perf/x86/intel: Fix the comment about guest LBR support on KVM
Starting from v5.12, KVM reports guest LBR and extra_regs support when the host has relevant support. Just delete this part of the comment and fix a typo. Cc: Kan Liang Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Ingo Molnar Signed-off-by: Like Xu Reviewed-by: Kan Liang Reviewed-by: Andi Kleen --- arch/x86/events/intel/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index d4569bfa83e3..7bb96ac87615 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5565,8 +5565,7 @@ __init int intel_pmu_init(void) /* * Access LBR MSR may cause #GP under certain circumstances. -* E.g. KVM doesn't support LBR MSR -* Check all LBT MSR here. +* Check all LBR MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL)) -- 2.29.2
[PATCH v4 00/11] KVM: x86/pmu: Guest Architectural LBR Enabling
Hi geniuses, Please help review the new version of Arch LBR enabling patch set. The Architectural Last Branch Records (LBRs) is publiced in the 319433-040 release of Intel Architecture Instruction Set Extensions and Future Features Programming Reference[0]. The main advantages for the Arch LBR users are [1]: - Faster context switching due to XSAVES support and faster reset of LBR MSRs via the new DEPTH MSR - Faster LBR read for a non-PEBS event due to XSAVES support, which lowers the overhead of the NMI handler. - Linux kernel can support the LBR features without knowing the model number of the current CPU. It's based on the kvm/queue tree plus two commits from kvm/intel tree: - 'fea4ab260645 ("KVM: x86: Refresh CPUID on writes to MSR_IA32_XSS")' - '0ccd14126cb2 ("KVM: x86: Report XSS as an MSR to be saved if there are supported features")' Please check more details in each commit and feel free to comment. [0] https://software.intel.com/content/www/us/en/develop/download/ intel-architecture-instruction-set-extensions-and-future-features-programming-reference.html [1] https://lore.kernel.org/lkml/1593780569-62993-1-git-send-email-kan.li...@linux.intel.com/ --- v3->v4 Changelog: - Add one more host patch to reuse ARCH_LBR_CTL_MASK; - Add reserve_lbr_buffers() instead of using GFP_ATOMIC; - Fia a bug in the arch_lbr_depth_is_valid(); - Add LBR_CTL_EN to unify DEBUGCTLMSR_LBR and ARCH_LBR_CTL_LBREN; - Add vmx->host_lbrctlmsr to save/restore host values; - Add KVM_SUPPORTED_XSS to refactoring supported_xss; - Clear Arch_LBR ans its XSS bit if it's not supported; - Add negative testing to the related kvm-unit-tests; - Refine code and commit messages; Previous: https://lore.kernel.org/kvm/20210303135756.1546253-1-like...@linux.intel.com/ Like Xu (11): perf/x86/intel: Fix the comment about guest LBR support on KVM perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers perf/x86/lbr: Skip checking for the existence of LBR_TOS for Arch LBR perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region perf/x86: Move ARCH_LBR_CTL_MASK definition to include/asm/msr-index.h KVM: vmx/pmu: Add MSR_ARCH_LBR_DEPTH emulation for Arch LBR KVM: vmx/pmu: Add MSR_ARCH_LBR_CTL emulation for Arch LBR KVM: vmx/pmu: Add Arch LBR emulation and its VMCS field KVM: x86: Expose Architectural LBR CPUID leaf KVM: x86: Refine the matching and clearing logic for supported_xss KVM: x86: Add XSAVE Support for Architectural LBRs arch/x86/events/core.c | 8 ++- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/core.c | 6 +- arch/x86/events/intel/lbr.c | 28 + arch/x86/events/perf_event.h | 8 ++- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/vmx.h | 4 ++ arch/x86/kvm/cpuid.c | 25 +++- arch/x86/kvm/vmx/capabilities.h | 25 +--- arch/x86/kvm/vmx/pmu_intel.c | 103 --- arch/x86/kvm/vmx/vmx.c | 50 +-- arch/x86/kvm/vmx/vmx.h | 4 ++ arch/x86/kvm/x86.c | 6 +- 13 files changed, 227 insertions(+), 43 deletions(-) -- 2.29.2
Re: [PATCH] x86/perf: Fix guest_get_msrs static call if there is no PMU
On 2021/3/8 15:12, Dmitry Vyukov wrote: On Mon, Mar 8, 2021 at 3:26 AM Xu, Like wrote: On 2021/3/6 6:33, Sean Christopherson wrote: Handle a NULL x86_pmu.guest_get_msrs at invocation instead of patching in perf_guest_get_msrs_nop() during setup. If there is no PMU, setup "If there is no PMU" ... How to set up this kind of environment, and what changes are needed in .config or boot parameters ? Hi Xu, This can be reproduced in qemu with "-cpu max,-pmu" flag using this reproducer: https://groups.google.com/g/syzkaller-bugs/c/D8eHw3LIOd0/m/L2G0lVkVBAAJ Sorry, I couldn't reproduce any VMX abort with "-cpu max,-pmu". Doe this patch fix this "unexpected kernel reboot" issue ? If so, you may add "Tested-by" for more attention. bails before updating the static calls, leaving x86_pmu.guest_get_msrs NULL and thus a complete nop. Ultimately, this causes VMX abort on VM-Exit due to KVM putting random garbage from the stack into the MSR load list. Fixes: abd562df94d1 ("x86/perf: Use static_call for x86_pmu.guest_get_msrs") Cc: Like Xu Cc: Paolo Bonzini Cc: Jim Mattson Cc: k...@vger.kernel.org Reported-by: Dmitry Vyukov Signed-off-by: Sean Christopherson --- arch/x86/events/core.c | 16 +--- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 6ddeed3cd2ac..ff874461f14c 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -671,7 +671,11 @@ void x86_pmu_disable_all(void) struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr) { - return static_call(x86_pmu_guest_get_msrs)(nr); + if (x86_pmu.guest_get_msrs) + return static_call(x86_pmu_guest_get_msrs)(nr); How about using "static_call_cond" per commit "452cddbff7" ? + + *nr = 0; + return NULL; } EXPORT_SYMBOL_GPL(perf_guest_get_msrs); @@ -1944,13 +1948,6 @@ static void _x86_pmu_read(struct perf_event *event) x86_perf_event_update(event); } -static inline struct perf_guest_switch_msr * -perf_guest_get_msrs_nop(int *nr) -{ - *nr = 0; - return NULL; -} - static int __init init_hw_perf_events(void) { struct x86_pmu_quirk *quirk; @@ -2024,9 +2021,6 @@ static int __init init_hw_perf_events(void) if (!x86_pmu.read) x86_pmu.read = _x86_pmu_read; - if (!x86_pmu.guest_get_msrs) - x86_pmu.guest_get_msrs = perf_guest_get_msrs_nop; - x86_pmu_static_call_update(); /*
Re: [PATCH v3 9/9] KVM: x86: Add XSAVE Support for Architectural LBRs
On 2021/3/4 2:03, Sean Christopherson wrote: On Wed, Mar 03, 2021, Like Xu wrote: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 034708a3df20..ec4593e0ee6d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7268,6 +7268,8 @@ static __init void vmx_set_cpu_caps(void) supported_xss = 0; if (!cpu_has_vmx_xsaves()) kvm_cpu_cap_clear(X86_FEATURE_XSAVES); + else if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) + supported_xss |= XFEATURE_MASK_LBR; /* CPUID 0x8001 */ if (!cpu_has_vmx_rdtscp()) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d773836ceb7a..bca2e318ff24 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10433,6 +10433,8 @@ int kvm_arch_hardware_setup(void *opaque) if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; + else + supported_xss &= host_xss; Not your fault by any means, but I would prefer to have matching logic for XSS and XCR0. The existing clearing of supported_xss here is pointless. E.g. I'd prefer something like the following, though Paolo may have a different opinion. I have no preference for where to do rdmsrl() in kvm_arch_init() or kvm_arch_hardware_setup(). It's true the assignment of supported_xss in the kvm/intel tree is redundant and introducing KVM_SUPPORTED_XSS is also fine to me. diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6d7e760fdfa0..c781034463e5 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7244,12 +7244,15 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_clear(X86_FEATURE_INVPCID); if (vmx_pt_mode_is_host_guest()) kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); + if (!cpu_has_vmx_arch_lbr()) { + kvm_cpu_cap_clear(X86_FEATURE_ARCH_LBR); + supported_xss &= ~XFEATURE_MASK_LBR; + } I will move the above part to the LBR patch and leave the left part as a pre-patch for Paolo's review. if (vmx_umip_emulated()) kvm_cpu_cap_set(X86_FEATURE_UMIP); /* CPUID 0xD.1 */ - supported_xss = 0; if (!cpu_has_vmx_xsaves()) kvm_cpu_cap_clear(X86_FEATURE_XSAVES); if (!cpu_has_vmx_xsaves()) supported_xss = 0; kvm_cpu_cap_clear(X86_FEATURE_XSAVES); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7b0adebec1ef..5f9eb1f5b840 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -205,6 +205,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs; | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ | XFEATURE_MASK_PKRU) +#define KVM_SUPPORTED_XSS XFEATURE_MASK_LBR + u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); @@ -8037,6 +8039,11 @@ int kvm_arch_init(void *opaque) supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0; } + if (boot_cpu_has(X86_FEATURE_XSAVES)) { + rdmsrl(MSR_IA32_XSS, host_xss); + supported_xss = host_xss & KVM_SUPPORTED_XSS; + } + if (pi_inject_timer == -1) pi_inject_timer = housekeeping_enabled(HK_FLAG_TIMER); #ifdef CONFIG_X86_64 @@ -10412,9 +10419,6 @@ int kvm_arch_hardware_setup(void *opaque) rdmsrl_safe(MSR_EFER, &host_efer); - if (boot_cpu_has(X86_FEATURE_XSAVES)) - rdmsrl(MSR_IA32_XSS, host_xss); - r = ops->hardware_setup(); if (r != 0) return r; @@ -10422,9 +10426,6 @@ int kvm_arch_hardware_setup(void *opaque) memcpy(&kvm_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); kvm_ops_static_call_update(); - if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) - supported_xss = 0; - #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f) cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_); #undef __kvm_cpu_cap_has
[PATCH v3 8/9] KVM: x86: Expose Architectural LBR CPUID leaf
If CPUID.(EAX=07H, ECX=0):EDX[19] is set to 1, then KVM supports Arch LBRs and CPUID leaf 01CH indicates details of the Arch LBRs capabilities. Currently, KVM only supports the current host LBR depth for guests, which is also the maximum supported depth on the host. Signed-off-by: Like Xu --- arch/x86/kvm/cpuid.c | 25 - arch/x86/kvm/vmx/vmx.c | 2 ++ 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b4247f821277..4473324fe7be 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -450,7 +450,7 @@ void kvm_set_cpu_caps(void) F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) | - F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) + F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) | F(ARCH_LBR) ); /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ @@ -805,6 +805,29 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) goto out; } break; + /* Architectural LBR */ + case 0x1c: + { + u64 lbr_depth_mask = entry->eax & 0xff; + + if (!lbr_depth_mask || !kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) { + entry->eax = entry->ebx = entry->ecx = entry->edx = 0; + break; + } + + /* +* KVM only exposes the maximum supported depth, +* which is also the fixed value used on the host. +* +* KVM doesn't allow VMM user sapce to adjust depth +* per guest, because the guest LBR emulation depends +* on the implementation of the host LBR driver. +*/ + lbr_depth_mask = 1UL << (fls(lbr_depth_mask) - 1); + entry->eax &= ~0xff; + entry->eax |= lbr_depth_mask; + break; + } case KVM_CPUID_SIGNATURE: { static const char signature[12] = "KVMKVMKVM\0\0"; const u32 *sigptr = (const u32 *)signature; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 2f307689a14b..034708a3df20 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7258,6 +7258,8 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_clear(X86_FEATURE_INVPCID); if (vmx_pt_mode_is_host_guest()) kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); + if (cpu_has_vmx_arch_lbr()) + kvm_cpu_cap_check_and_set(X86_FEATURE_ARCH_LBR); if (vmx_umip_emulated()) kvm_cpu_cap_set(X86_FEATURE_UMIP); -- 2.29.2
[kvm-unit-tests PATCH] x86: Update guest LBR tests for Architectural LBR
This unit-test is intended to test the KVM's support for the Architectural LBRs which is a Architectural performance monitor unit (PMU) feature on Intel processors. If the LBR bit is set to 1 in the MSR_ARCH_LBR_CTL, the processor will record a running trace of the most recent branches guest taken in the LBR entries for guest to read. Signed-off-by: Like Xu --- x86/pmu_lbr.c | 62 ++- 1 file changed, 52 insertions(+), 10 deletions(-) diff --git a/x86/pmu_lbr.c b/x86/pmu_lbr.c index 3bd9e9f..588aec8 100644 --- a/x86/pmu_lbr.c +++ b/x86/pmu_lbr.c @@ -6,6 +6,7 @@ #define MAX_NUM_LBR_ENTRY32 #define DEBUGCTLMSR_LBR (1UL << 0) #define PMU_CAP_LBR_FMT 0x3f +#define KVM_ARCH_LBR_CTL_MASK 0x7f000f #define MSR_LBR_NHM_FROM 0x0680 #define MSR_LBR_NHM_TO 0x06c0 @@ -13,6 +14,10 @@ #define MSR_LBR_CORE_TO0x0060 #define MSR_LBR_TOS0x01c9 #define MSR_LBR_SELECT 0x01c8 +#define MSR_ARCH_LBR_CTL 0x14ce +#define MSR_ARCH_LBR_DEPTH 0x14cf +#define MSR_ARCH_LBR_FROM_00x1500 +#define MSR_ARCH_LBR_TO_0 0x1600 volatile int count; @@ -66,6 +71,9 @@ int main(int ac, char **av) struct cpuid id = cpuid(10); u64 perf_cap; int max, i; + bool arch_lbr = false; + u32 ctl_msr = MSR_IA32_DEBUGCTLMSR; + u64 ctl_value = DEBUGCTLMSR_LBR; setup_vm(); perf_cap = rdmsr(MSR_IA32_PERF_CAPABILITIES); @@ -80,8 +88,23 @@ int main(int ac, char **av) return report_summary(); } + /* +* On processors that support Architectural LBRs, +* IA32_PERF_CAPABILITIES.LBR_FMT will have the value 03FH. +*/ + if (0x3f == (perf_cap & PMU_CAP_LBR_FMT)) { + arch_lbr = true; + ctl_msr = MSR_ARCH_LBR_CTL; + /* DEPTH defaults to the maximum number of LBRs entries. */ + max = rdmsr(MSR_ARCH_LBR_DEPTH) - 1; + ctl_value = KVM_ARCH_LBR_CTL_MASK; + } + printf("PMU version: %d\n", eax.split.version_id); - printf("LBR version: %ld\n", perf_cap & PMU_CAP_LBR_FMT); + if (!arch_lbr) + printf("LBR version: %ld\n", perf_cap & PMU_CAP_LBR_FMT); + else + printf("Architectural LBR depth: %d\n", max + 1); /* Look for LBR from and to MSRs */ lbr_from = MSR_LBR_CORE_FROM; @@ -90,27 +113,46 @@ int main(int ac, char **av) lbr_from = MSR_LBR_NHM_FROM; lbr_to = MSR_LBR_NHM_TO; } + if (test_init_lbr_from_exception(0)) { + lbr_from = MSR_ARCH_LBR_FROM_0; + lbr_to = MSR_ARCH_LBR_TO_0; + } if (test_init_lbr_from_exception(0)) { printf("LBR on this platform is not supported!\n"); return report_summary(); } - wrmsr(MSR_LBR_SELECT, 0); - wrmsr(MSR_LBR_TOS, 0); - for (max = 0; max < MAX_NUM_LBR_ENTRY; max++) { - if (test_init_lbr_from_exception(max)) - break; + /* Reset the guest LBR entries. */ + if (arch_lbr) { + /* On a software write to IA32_LBR_DEPTH, all LBR entries are reset to 0.*/ + wrmsr(MSR_ARCH_LBR_DEPTH, max + 1); + } else { + wrmsr(MSR_LBR_SELECT, 0); + wrmsr(MSR_LBR_TOS, 0); + for (max = 0; max < MAX_NUM_LBR_ENTRY; max++) { + if (test_init_lbr_from_exception(max)) + break; + } } - report(max > 0, "The number of guest LBR entries is good."); + /* Check the guest LBR entries are initialized. */ + for (i = 0; i < max; ++i) { + if (rdmsr(lbr_to + i) || rdmsr(lbr_from + i)) + break; + } + report(i == max, "The guest LBR initialized FROM_IP/TO_IP values are good."); + /* Do some branch instructions. */ - wrmsr(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR); + wrmsr(ctl_msr, ctl_value); lbr_test(); - wrmsr(MSR_IA32_DEBUGCTLMSR, 0); + wrmsr(ctl_msr, 0); - report(rdmsr(MSR_LBR_TOS) != 0, "The guest LBR MSR_LBR_TOS value is good."); + /* Check if the guest LBR has recorded some branches. */ + if (!arch_lbr) { + report(rdmsr(MSR_LBR_TOS) != 0, "The guest LBR MSR_LBR_TOS value is good."); + } for (i = 0; i < max; ++i) { if (!rdmsr(lbr_to + i) || !rdmsr(lbr_from + i)) break; -- 2.29.2
[PATCH v3 9/9] KVM: x86: Add XSAVE Support for Architectural LBRs
On processors whose XSAVE feature set supports XSAVES and XRSTORS, the availability of support for Architectural LBR configuration state save and restore can be determined from CPUID.(EAX=0DH, ECX=1):EDX:ECX[bit 15]. The detailed leaf for Arch LBRs is enumerated in CPUID.(EAX=0DH, ECX=0FH). XSAVES provides a faster means than RDMSR for guest to read all LBRs. When guest IA32_XSS[bit 15] is set, the Arch LBRs state can be saved using XSAVES and restored by XRSTORS with the appropriate RFBM. If the KVM fails to pass-through the LBR msrs to the guest, the LBR msrs will be reset to prevent the leakage of host records via XSAVES. In this case, the guest results may be inaccurate as the legacy LBR. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 2 ++ arch/x86/kvm/x86.c | 2 ++ 3 files changed, 6 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 48a817be60ab..08114f70c496 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -768,6 +768,8 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu) return; warn: + if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) + wrmsrl(MSR_ARCH_LBR_DEPTH, lbr_desc->records.nr); pr_warn_ratelimited("kvm: vcpu-%d: fail to passthrough LBR.\n", vcpu->vcpu_id); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 034708a3df20..ec4593e0ee6d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7268,6 +7268,8 @@ static __init void vmx_set_cpu_caps(void) supported_xss = 0; if (!cpu_has_vmx_xsaves()) kvm_cpu_cap_clear(X86_FEATURE_XSAVES); + else if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) + supported_xss |= XFEATURE_MASK_LBR; /* CPUID 0x8001 */ if (!cpu_has_vmx_rdtscp()) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d773836ceb7a..bca2e318ff24 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10433,6 +10433,8 @@ int kvm_arch_hardware_setup(void *opaque) if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; + else + supported_xss &= host_xss; #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f) cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_); -- 2.29.2
[PATCH v3 4/9] perf/x86/lbr: Use GFP_ATOMIC for cpuc->lbr_xsave memory allocation
When allocating the cpuc->lbr_xsave memory in the guest Arch LBR driver, we may get a stacktrace due to relatively slow execution like below: [ 54.283563] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [ 54.285218] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 830, name: perf [ 54.286684] INFO: lockdep is turned off. [ 54.287448] irq event stamp: 8644 [ 54.288098] hardirqs last enabled at (8643): [] __local_bh_enable_ip+0x82/0xd0 [ 54.289806] hardirqs last disabled at (8644): [] perf_event_exec+0x1c7/0x3c0 [ 54.291418] softirqs last enabled at (8642): [] fpu__clear+0x92/0x190 [ 54.292921] softirqs last disabled at (8638): [] fpu__clear+0x5/0x190 [ 54.294418] CPU: 3 PID: 830 Comm: perf Not tainted 5.11.0-guest+ #1145 [ 54.295635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 54.297136] Call Trace: [ 54.297603] dump_stack+0x8b/0xb0 [ 54.298246] ___might_sleep.cold+0xb6/0xc6 [ 54.299022] ? intel_pmu_lbr_add+0x147/0x160 [ 54.299823] kmem_cache_alloc+0x26d/0x2f0 [ 54.300587] intel_pmu_lbr_add+0x147/0x160 [ 54.301358] x86_pmu_add+0x85/0xe0 [ 54.302009] ? check_irq_usage+0x147/0x460 [ 54.302793] ? __bfs+0x210/0x210 [ 54.303420] ? stack_trace_save+0x3b/0x50 [ 54.304190] ? check_noncircular+0x66/0xf0 [ 54.304978] ? save_trace+0x3f/0x2f0 [ 54.305670] event_sched_in+0xf5/0x2a0 [ 54.306401] merge_sched_in+0x1a0/0x3b0 [ 54.307141] visit_groups_merge.constprop.0.isra.0+0x16e/0x490 [ 54.308255] ctx_sched_in+0xcc/0x200 [ 54.308948] ctx_resched+0x84/0xe0 [ 54.309606] perf_event_exec+0x2c0/0x3c0 [ 54.310370] begin_new_exec+0x627/0xbc0 [ 54.311096] load_elf_binary+0x734/0x17a0 [ 54.311853] ? lock_acquire+0xbc/0x360 [ 54.312562] ? bprm_execve+0x346/0x860 [ 54.313272] ? kvm_sched_clock_read+0x14/0x30 [ 54.314095] ? sched_clock+0x5/0x10 [ 54.314760] ? sched_clock_cpu+0xc/0xb0 [ 54.315492] bprm_execve+0x337/0x860 [ 54.316176] do_execveat_common+0x164/0x1d0 [ 54.316971] __x64_sys_execve+0x39/0x50 [ 54.317698] do_syscall_64+0x33/0x40 [ 54.318390] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix it by allocating this part of memory with GFP_ATOMIC mask. Cc: Peter Zijlstra Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Suggested-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/lbr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 355ea70f1879..495466b12480 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -700,7 +700,7 @@ void intel_pmu_lbr_add(struct perf_event *event) if (static_cpu_has(X86_FEATURE_ARCH_LBR) && kmem_cache && !cpuc->lbr_xsave && (cpuc->lbr_users != cpuc->lbr_pebs_users)) - cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL); + cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_ATOMIC); } void release_lbr_buffers(void) -- 2.29.2
[PATCH v3 6/9] KVM: vmx/pmu: Add MSR_ARCH_LBR_CTL emulation for Arch LBR
Arch LBRs are enabled by setting MSR_ARCH_LBR_CTL.LBREn to 1. A new guest state field named "Guest IA32_LBR_CTL" is added to enhance guest LBR usage. When guest Arch LBR is enabled, a guest LBR event will be created like the model-specific LBR does. On processors that support Arch LBR, MSR_IA32_DEBUGCTLMSR[bit 0] has no meaning. It can be written to 0 or 1, but reads will always return 0. On the vmx_vcpu_reset(), the IA32_LBR_CTL will be cleared to 0. Signed-off-by: Like Xu --- arch/x86/include/asm/vmx.h | 2 ++ arch/x86/kvm/vmx/pmu_intel.c | 27 ++- arch/x86/kvm/vmx/vmx.c | 9 + 3 files changed, 33 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 358707f60d99..8ec7bc24b37a 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -245,6 +245,8 @@ enum vmcs_field { GUEST_BNDCFGS_HIGH = 0x2813, GUEST_IA32_RTIT_CTL = 0x2814, GUEST_IA32_RTIT_CTL_HIGH= 0x2815, + GUEST_IA32_LBR_CTL = 0x2816, + GUEST_IA32_LBR_CTL_HIGH = 0x2817, HOST_IA32_PAT = 0x2c00, HOST_IA32_PAT_HIGH = 0x2c01, HOST_IA32_EFER = 0x2c02, diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 25d620685ae7..d14a14eb712d 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -19,6 +19,7 @@ #include "pmu.h" #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) +#define KVM_ARCH_LBR_CTL_MASK 0x7f000f static struct kvm_event_hw_type_mapping intel_arch_events[] = { /* Index must match CPUID 0x0A.EBX bit vector */ @@ -221,6 +222,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) ret = pmu->version > 1; break; case MSR_ARCH_LBR_DEPTH: + case MSR_ARCH_LBR_CTL: ret = guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); break; default: @@ -390,6 +392,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_ARCH_LBR_DEPTH: msr_info->data = lbr_desc->records.nr; return 0; + case MSR_ARCH_LBR_CTL: + msr_info->data = vmcs_read64(GUEST_IA32_LBR_CTL); + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -457,6 +462,15 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) lbr_desc->records.nr = data; lbr_desc->arch_lbr_reset = true; return 0; + case MSR_ARCH_LBR_CTL: + if (!(data & ~KVM_ARCH_LBR_CTL_MASK)) { + vmcs_write64(GUEST_IA32_LBR_CTL, data); + if (intel_pmu_lbr_is_enabled(vcpu) && !lbr_desc->event && + (data & ARCH_LBR_CTL_LBREN)) + intel_pmu_create_guest_lbr_event(vcpu); + return 0; + } + break; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -635,12 +649,15 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu) */ static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu) { - u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL); + u32 lbr_ctl_field = GUEST_IA32_DEBUGCTL; - if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) { - data &= ~DEBUGCTLMSR_LBR; - vmcs_write64(GUEST_IA32_DEBUGCTL, data); - } + if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)) + return; + + if (guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR)) + lbr_ctl_field = GUEST_IA32_LBR_CTL; + + vmcs_write64(lbr_ctl_field, vmcs_read64(lbr_ctl_field) & ~BIT(0)); } static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6d7e760fdfa0..a0660b9934c6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2036,6 +2036,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) VM_EXIT_SAVE_DEBUG_CONTROLS) get_vmcs12(vcpu)->guest_ia32_debugctl = data; + /* +* For Arch LBR, IA32_DEBUGCTL[bit 0] has no meaning. +* It can be written to 0 or 1, but reads will always return 0. +*/ + if (guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR)) + data &= ~DEB
[PATCH v3 7/9] KVM: vmx/pmu: Add Arch LBR emulation and its VMCS field
New VMX controls bits for Arch LBR are added. When bit 21 in vmentry_ctrl is set, VM entry will write the value from the "Guest IA32_LBR_CTL" guest state field to IA32_LBR_CTL. When bit 26 in vmexit_ctrl is set, VM exit will clear IA32_LBR_CTL after the value has been saved to the "Guest IA32_LBR_CTL" guest state field. To enable guest Arch LBR, KVM should set both the "Load Guest IA32_LBR_CTL" entry control and the "Clear IA32_LBR_CTL" exit control bits. If these two conditions cannot be met, KVM will clear the LBR_FMT bits and will not expose the Arch LBR feature. If Arch LBR is exposed on KVM, the guest should set both the ARCH_LBR CPUID and the same LBR_FMT value as the host via MSR_IA32_PERF_CAPABILITIES to enable guest Arch LBR. KVM will bypass the host/guest x86 cpu model check and the records msrs can still be pass-through to guest as usual and work like a model-specific LBR. KVM is consistent with the host and does not support the LER entry. Signed-off-by: Like Xu --- arch/x86/include/asm/vmx.h | 2 ++ arch/x86/kvm/vmx/capabilities.h | 25 + arch/x86/kvm/vmx/pmu_intel.c| 27 ++- arch/x86/kvm/vmx/vmx.c | 9 +++-- 4 files changed, 48 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 8ec7bc24b37a..c8186ec46fca 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -95,6 +95,7 @@ #define VM_EXIT_CLEAR_BNDCFGS 0x0080 #define VM_EXIT_PT_CONCEAL_PIP 0x0100 #define VM_EXIT_CLEAR_IA32_RTIT_CTL0x0200 +#define VM_EXIT_CLEAR_IA32_LBR_CTL 0x0400 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff @@ -108,6 +109,7 @@ #define VM_ENTRY_LOAD_BNDCFGS 0x0001 #define VM_ENTRY_PT_CONCEAL_PIP0x0002 #define VM_ENTRY_LOAD_IA32_RTIT_CTL0x0004 +#define VM_ENTRY_LOAD_IA32_LBR_CTL 0x0020 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x11ff diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index d1d77985e889..73fceb534c7c 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -378,20 +378,29 @@ static inline bool vmx_pt_mode_is_host_guest(void) return pt_mode == PT_MODE_HOST_GUEST; } -static inline u64 vmx_get_perf_capabilities(void) +static inline bool cpu_has_vmx_arch_lbr(void) { - u64 perf_cap = 0; - - if (boot_cpu_has(X86_FEATURE_PDCM)) - rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); - - perf_cap &= PMU_CAP_LBR_FMT; + return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_IA32_LBR_CTL) && + (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_LBR_CTL); +} +static inline u64 vmx_get_perf_capabilities(void) +{ /* * Since counters are virtualized, KVM would support full * width counting unconditionally, even if the host lacks it. */ - return PMU_CAP_FW_WRITES | perf_cap; + u64 perf_cap = PMU_CAP_FW_WRITES; + u64 host_perf_cap = 0; + + if (boot_cpu_has(X86_FEATURE_PDCM)) + rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap); + + perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; + if (boot_cpu_has(X86_FEATURE_ARCH_LBR) && !cpu_has_vmx_arch_lbr()) + perf_cap &= ~PMU_CAP_LBR_FMT; + + return perf_cap; } static inline u64 vmx_supported_debugctl(void) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index d14a14eb712d..48a817be60ab 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -176,12 +176,16 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr) bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu) { + if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) + return guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); + /* * As a first step, a guest could only enable LBR feature if its * cpu model is the same as the host because the LBR registers * would be pass-through to the guest and they're model specific. */ - return boot_cpu_data.x86_model == guest_cpuid_model(vcpu); + return !boot_cpu_has(X86_FEATURE_ARCH_LBR) && + boot_cpu_data.x86_model == guest_cpuid_model(vcpu); } bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu) @@ -199,8 +203,11 @@ static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index) if (!intel_pmu_lbr_is_enabled(vcpu)) return ret; - ret = (index == MSR_LBR_SELECT) || (index == MSR_LBR_TOS) || - (index >= records->from && index < records->from + records->nr) || + if (!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR)) +
[PATCH v3 5/9] KVM: vmx/pmu: Add MSR_ARCH_LBR_DEPTH emulation for Arch LBR
The number of Arch LBR entries available for recording operations is dictated by the value in MSR_ARCH_LBR_DEPTH.DEPTH. The supported LBR depth values can be found in CPUID.(EAX=01CH, ECX=0):EAX[7:0] and for each bit "n" set in this field, the MSR_ARCH_LBR_DEPTH.DEPTH value of "8*(n+1)" is supported. On a guest write to MSR_ARCH_LBR_DEPTH, all LBR entries are reset to 0. KVM emulates the reset behavior by introducing lbr_desc->arch_lbr_reset. KVM writes the guest requested value to the native ARCH_LBR_DEPTH MSR (this is safe because the two values will be the same) when the Arch LBR records MSRs are pass-through to the guest. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 43 arch/x86/kvm/vmx/vmx.h | 3 +++ 2 files changed, 46 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 9efc1a6b8693..25d620685ae7 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -220,6 +220,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu->version > 1; break; + case MSR_ARCH_LBR_DEPTH: + ret = guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -250,6 +253,7 @@ static inline void intel_pmu_release_guest_lbr_event(struct kvm_vcpu *vcpu) if (lbr_desc->event) { perf_event_release_kernel(lbr_desc->event); lbr_desc->event = NULL; + lbr_desc->arch_lbr_reset = false; vcpu_to_pmu(vcpu)->event_count--; } } @@ -348,10 +352,26 @@ static bool intel_pmu_handle_lbr_msrs_access(struct kvm_vcpu *vcpu, return true; } +/* + * Check if the requested depth values is supported + * based on the bits [0:7] of the guest cpuid.1c.eax. + */ +static bool arch_lbr_depth_is_valid(struct kvm_vcpu *vcpu, u64 depth) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x1c, 0); + if (best && depth && !(depth % 8)) + return (best->eax & 0xff) & (1ULL << (depth / 8 - 1)); + + return false; +} + static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); u32 msr = msr_info->index; switch (msr) { @@ -367,6 +387,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: msr_info->data = pmu->global_ovf_ctrl; return 0; + case MSR_ARCH_LBR_DEPTH: + msr_info->data = lbr_desc->records.nr; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -393,6 +416,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); u32 msr = msr_info->index; u64 data = msr_info->data; @@ -427,6 +451,12 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_ARCH_LBR_DEPTH: + if (!arch_lbr_depth_is_valid(vcpu, data)) + return 1; + lbr_desc->records.nr = data; + lbr_desc->arch_lbr_reset = true; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -566,6 +596,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu) lbr_desc->records.nr = 0; lbr_desc->event = NULL; lbr_desc->msr_passthrough = false; + lbr_desc->arch_lbr_reset = false; } static void intel_pmu_reset(struct kvm_vcpu *vcpu) @@ -623,6 +654,15 @@ static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu) intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu); } +static void intel_pmu_arch_lbr_reset(struct kvm_vcpu *vcpu) +{ + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); + + /* On a software write to IA32_LBR_DEPTH, all LBR entries are reset to 0. */ + wrmsrl(MSR_ARCH_LBR_DEPTH, lbr_desc->records.nr); + lbr_desc->arch_lbr_reset = false; +} + static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set) { struct x86_pmu_lbr *lbr
[PATCH v3 3/9] perf/x86/lbr: Skip checking for the existence of LBR_TOS for Arch LBR
The Architecture LBR does not have MSR_LBR_TOS (0x01c9). KVM will generate #GP for this MSR access, thereby preventing the initialization of the guest LBR. Cc: Peter Zijlstra Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Signed-off-by: Like Xu Reviewed-by: Kan Liang --- arch/x86/events/intel/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index a32acc7733a7..3cf065185fb0 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5569,7 +5569,8 @@ __init int intel_pmu_init(void) * Check all LBT MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ - if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL)) + if (x86_pmu.lbr_nr && !boot_cpu_has(X86_FEATURE_ARCH_LBR) && + !check_msr(x86_pmu.lbr_tos, 0x3UL)) x86_pmu.lbr_nr = 0; for (i = 0; i < x86_pmu.lbr_nr; i++) { if (!(check_msr(x86_pmu.lbr_from + i, 0xUL) && -- 2.29.2
[PATCH v3 0/9] KVM: x86/pmu: Guest Architectural LBR Enabling
Hi geniuses, Please help review the new version of Arch LBR enabling patch set. The Architectural Last Branch Records (LBRs) is publiced in the 319433-040 release of Intel Architecture Instruction Set Extensions and Future Features Programming Reference[0]. The main advantages for the Arch LBR users are [1]: - Faster context switching due to XSAVES support and faster reset of LBR MSRs via the new DEPTH MSR - Faster LBR read for a non-PEBS event due to XSAVES support, which lowers the overhead of the NMI handler. - Linux kernel can support the LBR features without knowing the model number of the current CPU. It's based on the kvm/queue tree plus two commits from kvm/intel tree: - 'fea4ab260645 ("KVM: x86: Refresh CPUID on writes to MSR_IA32_XSS")' - '0ccd14126cb2 ("KVM: x86: Report XSS as an MSR to be saved if there are supported features")' Please check more details in each commit and feel free to comment. [0] https://software.intel.com/content/www/us/en/develop/download/ intel-architecture-instruction-set-extensions-and-future-features-programming-reference.html [1] https://lore.kernel.org/lkml/1593780569-62993-1-git-send-email-kan.li...@linux.intel.com/ --- v2->v3 Changelog: - Add host patches (0001-0004) to support guest Arch LBR; - Fix arch_lbr_depth_is_valid() check condition; [Sean] - Fix usage of KVM_ARCH_LBR_CTL_MASK; - Fix intel_pmu_legacy_freezing_lbrs_on_pmi(); - Reset GUEST_IA32_LBR_CTL in the vmx_vcpu_reset(); - Refine intel_pmu_lbr_is_compatible(); - Simplify lbr_enable check and its usage; - Add Arch LBR msrs to is_valid_passthrough_msr(); - Make XSAVE support for Arch LBR as a separate patch; Previous: https://lore.kernel.org/kvm/20210203135714.318356-1-like...@linux.intel.com/ Like Xu (9): perf/x86/intel: Fix a comment about guest LBR support perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers perf/x86/lbr: Skip checking for the existence of LBR_TOS for Arch LBR perf/x86/lbr: Use GFP_ATOMIC for cpuc->lbr_xsave memory allocation KVM: vmx/pmu: Add MSR_ARCH_LBR_DEPTH emulation for Arch LBR KVM: vmx/pmu: Add MSR_ARCH_LBR_CTL emulation for Arch LBR KVM: vmx/pmu: Add Arch LBR emulation and its VMCS field KVM: x86: Expose Architectural LBR CPUID leaf KVM: x86: Add XSAVE Support for Architectural LBRs arch/x86/events/intel/core.c| 5 +- arch/x86/events/intel/lbr.c | 6 +- arch/x86/include/asm/vmx.h | 4 ++ arch/x86/kvm/cpuid.c| 25 - arch/x86/kvm/vmx/capabilities.h | 25 ++--- arch/x86/kvm/vmx/pmu_intel.c| 99 + arch/x86/kvm/vmx/vmx.c | 22 +++- arch/x86/kvm/vmx/vmx.h | 3 + arch/x86/kvm/x86.c | 2 + 9 files changed, 164 insertions(+), 27 deletions(-) -- 2.29.2
[PATCH v3 2/9] perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers
If the platform supports LBR_INFO register, the x86_pmu.lbr_info will be assigned in intel_pmu_?_lbr_init_?() and it's safe to expose LBR_INFO in the x86_perf_get_lbr() directly, instead of relying on lbr_format check. Also Architectural LBR has IA32_LBR_x_INFO instead of LBR_FORMAT_INFO_x to hold metadata for the operation, including mispredict, TSX, and elapsed cycle time information. Cc: Peter Zijlstra Reviewed-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/lbr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 21890dacfcfe..355ea70f1879 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1832,12 +1832,10 @@ void __init intel_pmu_arch_lbr_init(void) */ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { - int lbr_fmt = x86_pmu.intel_cap.lbr_format; - lbr->nr = x86_pmu.lbr_nr; lbr->from = x86_pmu.lbr_from; lbr->to = x86_pmu.lbr_to; - lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? x86_pmu.lbr_info : 0; + lbr->info = x86_pmu.lbr_info; return 0; } -- 2.29.2
[PATCH v3 1/9] perf/x86/intel: Fix a comment about guest LBR support
Starting from v5.12, KVM reports guest LBR and extra_regs support when the host has relevant support. Cc: Peter Zijlstra Reviewed-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index d4569bfa83e3..a32acc7733a7 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5565,7 +5565,7 @@ __init int intel_pmu_init(void) /* * Access LBR MSR may cause #GP under certain circumstances. -* E.g. KVM doesn't support LBR MSR +* E.g. KVM doesn't support LBR MSR before v5.12. * Check all LBT MSR here. * Disable LBR access if any LBR MSRs can not be accessed. */ -- 2.29.2
Re: [PATCH v2 1/4] KVM: vmx/pmu: Add MSR_ARCH_LBR_DEPTH emulation for Arch LBR
On 2021/3/2 6:34, Sean Christopherson wrote: On Wed, Feb 03, 2021, Like Xu wrote: @@ -348,10 +352,26 @@ static bool intel_pmu_handle_lbr_msrs_access(struct kvm_vcpu *vcpu, return true; } +/* + * Check if the requested depth values is supported + * based on the bits [0:7] of the guest cpuid.1c.eax. + */ +static bool arch_lbr_depth_is_valid(struct kvm_vcpu *vcpu, u64 depth) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x1c, 0); + if (depth && best) + return (best->eax & 0xff) & (1ULL << (depth / 8 - 1)); I believe this will genereate undefined behavior if depth > 64. Or if depth < 8. And I believe this check also needs to enforce that depth is a multiple of 8. For each bit n set in this field, the IA32_LBR_DEPTH.DEPTH value 8*(n+1) is supported. Thus it's impossible for 0-7, 9-15, etc... to be legal depths. Thank you! How about: best = kvm_find_cpuid_entry(vcpu, 0x1c, 0); if (best && depth && !(depth % 8)) return (best->eax & 0xff) & (1ULL << (depth / 8 - 1)); return false; + + return false; +} +
[PATCH 2/2] KVM: vmx/pmu: Clear DEBUGCTLMSR_LBR bit on the debug breakpoint event
When the processor that support model-specific LBR generates a debug breakpoint event, it automatically clears the LBR flag. This action does not clear previously stored LBR stack MSRs. (Intel SDM 17.4.2) Signed-off-by: Like Xu --- arch/x86/kvm/vmx/vmx.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e0a3a9be654b..4951b535eb7f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4795,6 +4795,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) u32 intr_info, ex_no, error_code; unsigned long cr2, rip, dr6; u32 vect_info; + u64 lbr_ctl; vect_info = vmx->idt_vectoring_info; intr_info = vmx_get_intr_info(vcpu); @@ -4886,6 +4887,10 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) rip = kvm_rip_read(vcpu); kvm_run->debug.arch.pc = vmcs_readl(GUEST_CS_BASE) + rip; kvm_run->debug.arch.exception = ex_no; + /* On the debug breakpoint event, the LBREn bit is cleared. */ + lbr_ctl = vmcs_read64(GUEST_IA32_DEBUGCTL); + if (lbr_ctl & DEBUGCTLMSR_LBR) + vmcs_write64(GUEST_IA32_DEBUGCTL, lbr_ctl & ~DEBUGCTLMSR_LBR); break; case AC_VECTOR: if (guest_inject_ac(vcpu)) { -- 2.29.2
[PATCH 1/2] KVM: vmx/pmu: Fix dummy check if lbr_desc->event is created
If lbr_desc->event is successfully created, the intel_pmu_create_ guest_lbr_event() will return 0, otherwise it will return -ENOENT, and then jump to LBR msrs dummy handling. Fixes: 1b5ac3226a1a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE") Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index d1df618cb7de..d6a5fe19ff09 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -320,7 +320,7 @@ static bool intel_pmu_handle_lbr_msrs_access(struct kvm_vcpu *vcpu, if (!intel_pmu_is_valid_lbr_msr(vcpu, index)) return false; - if (!lbr_desc->event && !intel_pmu_create_guest_lbr_event(vcpu)) + if (!lbr_desc->event && intel_pmu_create_guest_lbr_event(vcpu)) goto dummy; /* -- 2.29.2
Re: [PATCH] perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers
Hi Peter, Would you help pick up this patch so that we can enable guest Arch LBR? --- thx,likexu On 2021/2/3 15:03, Like Xu wrote: If the platform supports LBR_INFO register, the x86_pmu.lbr_info will be assigned in intel_pmu_?_lbr_init_?() and it's safe to expose LBR_INFO in the x86_perf_get_lbr() directly, instead of relying on lbr_format check. Also Architectural LBR has IA32_LBR_x_INFO instead of LBR_FORMAT_INFO_x to hold metadata for the operation, including mispredict, TSX, and elapsed cycle time information. Cc: Kan Liang Cc: Peter Zijlstra (Intel) Signed-off-by: Like Xu --- arch/x86/events/intel/lbr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 21890dacfcfe..355ea70f1879 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1832,12 +1832,10 @@ void __init intel_pmu_arch_lbr_init(void) */ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { - int lbr_fmt = x86_pmu.intel_cap.lbr_format; - lbr->nr = x86_pmu.lbr_nr; lbr->from = x86_pmu.lbr_from; lbr->to = x86_pmu.lbr_to; - lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? x86_pmu.lbr_info : 0; + lbr->info = x86_pmu.lbr_info; return 0; }
[PATCH v2 4/4] KVM: x86: Expose Architectural LBR CPUID and its XSAVES bit
If CPUID.(EAX=07H, ECX=0):EDX[19] is exposed to 1, the KVM supports Arch LBRs and CPUID leaf 01CH indicates details of the Arch LBRs capabilities. As the first step, KVM only exposes the current LBR depth on the host for guest, which is likely to be the maximum supported value on the host. If KVM supports XSAVES, the CPUID.(EAX=0DH, ECX=1):EDX:ECX[bit 15] is also exposed to 1, which means the availability of support for Arch LBR configuration state save and restore. When available, guest software operating at CPL=0 can use XSAVES/XRSTORS manage supervisor state component Arch LBR for own purposes once IA32_XSS [bit 15] is set. XSAVE support for Arch LBRs is enumerated in CPUID.(EAX=0DH, ECX=0FH). Signed-off-by: Like Xu --- arch/x86/kvm/cpuid.c | 23 +++ arch/x86/kvm/vmx/vmx.c | 2 ++ arch/x86/kvm/x86.c | 10 +- 3 files changed, 34 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 944f518ca91b..900149eec42d 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -778,6 +778,29 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->edx = 0; } break; + /* Architectural LBR */ + case 0x1c: + { + u64 lbr_depth_mask = 0; + + if (!kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) { + entry->eax = entry->ebx = entry->ecx = entry->edx = 0; + break; + } + + /* +* KVM only exposes the maximum supported depth, +* which is also the fixed value used on the host. +* +* KVM doesn't allow VMM user sapce to adjust depth +* per guest, because the guest LBR emulation depends +* on the implementation of the host LBR driver. +*/ + lbr_depth_mask = 1UL << fls(entry->eax & 0xff); + entry->eax &= ~0xff; + entry->eax |= lbr_depth_mask; + break; + } /* Intel PT */ case 0x14: if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9ddf0a14d75c..c22175d9564e 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7498,6 +7498,8 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID); if (vmx_pt_mode_is_host_guest()) kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); + if (cpu_has_vmx_arch_lbr()) + kvm_cpu_cap_check_and_set(X86_FEATURE_ARCH_LBR); if (vmx_umip_emulated()) kvm_cpu_cap_set(X86_FEATURE_UMIP); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 667d0042d0b7..107f2e72f526 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10385,8 +10385,16 @@ int kvm_arch_hardware_setup(void *opaque) if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; - else + else { supported_xss &= host_xss; + /* +* The host doesn't always set ARCH_LBR bit to hoss_xss since this +* Arch_LBR component is used on demand in the Arch LBR driver. +* Check e649b3f0188f "Support dynamic supervisor feature for LBR". +*/ + if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR)) + supported_xss |= XFEATURE_MASK_LBR; + } /* Update CET features now that supported_xss is finalized. */ if (!kvm_cet_supported()) { -- 2.29.2
[PATCH v2 3/4] KVM: vmx/pmu: Add Arch LBR emulation and its VMCS field
When set bit 21 in vmentry_ctrl, VM entry will write the value from the "Guest IA32_LBR_CTL" guest state field to IA32_LBR_CTL. When set bit 26 in vmexit_ctrl, VM exit will clear IA32_LBR_CTL after the value has been saved to the "Guest IA32_LBR_CTL" guest state field. To enable guest Arch LBR, KVM should set both the "Load Guest IA32_LBR_CTL" entry control and the "Clear IA32_LBR_CTL" exit control. If these two conditions cannot be met, the vmx_get_perf_capabilities() will clear the LBR_FMT bits. If Arch LBR is exposed on KVM, the guest could set X86_FEATURE_ARCH_LBR to enable guest LBR, which is equivalent to the legacy LBR_FMT setting. The Arch LBR feature could bypass the host/guest x86_model check and the records msrs can still be pass-through to guest as usual and work like the legacy LBR. Signed-off-by: Like Xu --- arch/x86/include/asm/vmx.h | 2 ++ arch/x86/kvm/vmx/capabilities.h | 25 + arch/x86/kvm/vmx/pmu_intel.c| 17 ++--- arch/x86/kvm/vmx/vmx.c | 6 -- 4 files changed, 37 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index c099c3d17612..755179c0a5da 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -95,6 +95,7 @@ #define VM_EXIT_CLEAR_BNDCFGS 0x0080 #define VM_EXIT_PT_CONCEAL_PIP 0x0100 #define VM_EXIT_CLEAR_IA32_RTIT_CTL0x0200 +#define VM_EXIT_CLEAR_IA32_LBR_CTL 0x0400 #define VM_EXIT_LOAD_CET_STATE 0x1000 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff @@ -110,6 +111,7 @@ #define VM_ENTRY_PT_CONCEAL_PIP0x0002 #define VM_ENTRY_LOAD_IA32_RTIT_CTL0x0004 #define VM_ENTRY_LOAD_CET_STATE 0x0010 +#define VM_ENTRY_LOAD_IA32_LBR_CTL 0x0020 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x11ff diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 473c55c824b1..d84af64314fc 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -383,20 +383,29 @@ static inline bool vmx_pt_mode_is_host_guest(void) return pt_mode == PT_MODE_HOST_GUEST; } -static inline u64 vmx_get_perf_capabilities(void) +static inline bool cpu_has_vmx_arch_lbr(void) { - u64 perf_cap = 0; - - if (boot_cpu_has(X86_FEATURE_PDCM)) - rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); - - perf_cap &= PMU_CAP_LBR_FMT; + return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_IA32_LBR_CTL) && + (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_LBR_CTL); +} +static inline u64 vmx_get_perf_capabilities(void) +{ /* * Since counters are virtualized, KVM would support full * width counting unconditionally, even if the host lacks it. */ - return PMU_CAP_FW_WRITES | perf_cap; + u64 perf_cap = PMU_CAP_FW_WRITES; + u64 host_perf_cap = 0; + + if (boot_cpu_has(X86_FEATURE_PDCM)) + rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap); + + perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; + if (boot_cpu_has(X86_FEATURE_ARCH_LBR) && !cpu_has_vmx_arch_lbr()) + perf_cap &= ~PMU_CAP_LBR_FMT; + + return perf_cap; } static inline u64 vmx_supported_debugctl(void) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index a00d89c93eb7..7f20a8e75306 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -176,12 +176,17 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr) bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu) { + if (kvm_cpu_cap_has(X86_FEATURE_ARCH_LBR) != + guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR)) + return false; + /* * As a first step, a guest could only enable LBR feature if its * cpu model is the same as the host because the LBR registers * would be pass-through to the guest and they're model specific. */ - return boot_cpu_data.x86_model == guest_cpuid_model(vcpu); + return !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR) && + boot_cpu_data.x86_model == guest_cpuid_model(vcpu); } bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu) @@ -199,8 +204,11 @@ static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index) if (!intel_pmu_lbr_is_enabled(vcpu)) return ret; - ret = (index == MSR_LBR_SELECT) || (index == MSR_LBR_TOS) || - (index >= records->from && index < records->from + records->nr) || + if (!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR)) + ret = (index == MSR_LBR_SELECT) || (index == MSR_LBR_TOS); + + if (!ret)
[PATCH v2 1/4] KVM: vmx/pmu: Add MSR_ARCH_LBR_DEPTH emulation for Arch LBR
The number of Arch LBR entries available for recording operations is dictated by the value in MSR_ARCH_LBR_DEPTH.DEPTH. The supported LBR depth values can be found in CPUID.(EAX=01CH, ECX=0):EAX[7:0] and for each bit n set in this field, the MSR_ARCH_LBR_DEPTH.DEPTH value 8*(n+1) is supported. On a software write to MSR_ARCH_LBR_DEPTH, all LBR entries are reset to 0. Emulate the reset behavior by introducing lbr_desc->arch_lbr_reset and sync it to the host MSR_ARCH_LBR_DEPTH msr when the guest LBR event is ACTIVE and the LBR records msrs are pass-through to the guest. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/pmu_intel.c | 43 arch/x86/kvm/vmx/vmx.h | 3 +++ 2 files changed, 46 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index d1df618cb7de..b550c4a6ce33 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -220,6 +220,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu->version > 1; break; + case MSR_ARCH_LBR_DEPTH: + ret = guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -250,6 +253,7 @@ static inline void intel_pmu_release_guest_lbr_event(struct kvm_vcpu *vcpu) if (lbr_desc->event) { perf_event_release_kernel(lbr_desc->event); lbr_desc->event = NULL; + lbr_desc->arch_lbr_reset = false; vcpu_to_pmu(vcpu)->event_count--; } } @@ -348,10 +352,26 @@ static bool intel_pmu_handle_lbr_msrs_access(struct kvm_vcpu *vcpu, return true; } +/* + * Check if the requested depth values is supported + * based on the bits [0:7] of the guest cpuid.1c.eax. + */ +static bool arch_lbr_depth_is_valid(struct kvm_vcpu *vcpu, u64 depth) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x1c, 0); + if (depth && best) + return (best->eax & 0xff) & (1ULL << (depth / 8 - 1)); + + return false; +} + static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); u32 msr = msr_info->index; switch (msr) { @@ -367,6 +387,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: msr_info->data = pmu->global_ovf_ctrl; return 0; + case MSR_ARCH_LBR_DEPTH: + msr_info->data = lbr_desc->records.nr; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -393,6 +416,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); u32 msr = msr_info->index; u64 data = msr_info->data; @@ -427,6 +451,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_ARCH_LBR_DEPTH: + if (!arch_lbr_depth_is_valid(vcpu, data)) + return 1; + lbr_desc->records.nr = data; + lbr_desc->arch_lbr_reset = true; + __set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use); + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -566,6 +597,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu) lbr_desc->records.nr = 0; lbr_desc->event = NULL; lbr_desc->msr_passthrough = false; + lbr_desc->arch_lbr_reset = false; } static void intel_pmu_reset(struct kvm_vcpu *vcpu) @@ -623,6 +655,14 @@ static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu) intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu); } +static void intel_pmu_arch_lbr_reset(struct kvm_vcpu *vcpu) +{ + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); + + wrmsrl(MSR_ARCH_LBR_DEPTH, lbr_desc->records.nr); + lbr_desc->arch_lbr_reset = false; +} + static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set) { struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu); @@ -654,6 +694,9 @@ static inline void vmx_enable_lbr_msrs_passthrough(struct kvm_vcpu *vcp
[PATCH v2 2/4] KVM: vmx/pmu: Add MSR_ARCH_LBR_CTL emulation for Arch LBR
Arch LBRs are enabled by setting MSR_ARCH_LBR_CTL.LBREn to 1. On processors that support Arch LBR, MSR_IA32_DEBUGCTLMSR[bit 0] has no meaning. It can be written to 0 or 1, but reads will always return 0. A new guest state field named "Guest IA32_LBR_CTL" has been added to enhance guest LBR usage and the guest value of MSR_ARCH_LBR_CTL is written to this field on all VM exits. Signed-off-by: Like Xu --- arch/x86/include/asm/vmx.h | 2 ++ arch/x86/kvm/vmx/pmu_intel.c | 14 ++ arch/x86/kvm/vmx/vmx.c | 7 +++ 3 files changed, 23 insertions(+) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 1b387713eddd..c099c3d17612 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -247,6 +247,8 @@ enum vmcs_field { GUEST_BNDCFGS_HIGH = 0x2813, GUEST_IA32_RTIT_CTL = 0x2814, GUEST_IA32_RTIT_CTL_HIGH= 0x2815, + GUEST_IA32_LBR_CTL = 0x2816, + GUEST_IA32_LBR_CTL_HIGH = 0x2817, HOST_IA32_PAT = 0x2c00, HOST_IA32_PAT_HIGH = 0x2c01, HOST_IA32_EFER = 0x2c02, diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index b550c4a6ce33..a00d89c93eb7 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -19,6 +19,7 @@ #include "pmu.h" #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) +#define ARCH_LBR_CTL_MASK 0x7f000e static struct kvm_event_hw_type_mapping intel_arch_events[] = { /* Index must match CPUID 0x0A.EBX bit vector */ @@ -221,6 +222,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) ret = pmu->version > 1; break; case MSR_ARCH_LBR_DEPTH: + case MSR_ARCH_LBR_CTL: ret = guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR); break; default: @@ -390,6 +392,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_ARCH_LBR_DEPTH: msr_info->data = lbr_desc->records.nr; return 0; + case MSR_ARCH_LBR_CTL: + msr_info->data = vmcs_read64(GUEST_IA32_LBR_CTL); + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -458,6 +463,15 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) lbr_desc->arch_lbr_reset = true; __set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use); return 0; + case MSR_ARCH_LBR_CTL: + if (!(data & ARCH_LBR_CTL_MASK)) { + vmcs_write64(GUEST_IA32_LBR_CTL, data); + if (intel_pmu_lbr_is_enabled(vcpu) && !lbr_desc->event && + (data & DEBUGCTLMSR_LBR)) + intel_pmu_create_guest_lbr_event(vcpu); + return 0; + } + break; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index beb5a912014d..edecf2961924 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2109,6 +2109,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) VM_EXIT_SAVE_DEBUG_CONTROLS) get_vmcs12(vcpu)->guest_ia32_debugctl = data; + /* +* For Arch LBR, IA32_DEBUGCTL[bit 0] has no meaning. +* It can be written to 0 or 1, but reads will always return 0. +*/ + if (guest_cpuid_has(vcpu, X86_FEATURE_ARCH_LBR)) + data &= ~DEBUGCTLMSR_LBR; + vmcs_write64(GUEST_IA32_DEBUGCTL, data); if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event && (data & DEBUGCTLMSR_LBR)) -- 2.29.2
[PATCH v2 0/4] KVM: x86/pmu: Guest Architectural LBR Enabling
Hi geniuses, Please help review the new version of Arch LBR enabling on KVM based on the latest kvm/queue tree. The Architectural Last Branch Records (LBRs) is publiced in the 319433-040 release of Intel Architecture Instruction Set Extensions and Future Features Programming Reference[0]. The main advantages for the Arch LBR users are [1]: - Faster context switching due to XSAVES support and faster reset of LBR MSRs via the new DEPTH MSR - Faster LBR read for a non-PEBS event due to XSAVES support, which lowers the overhead of the NMI handler. (For a PEBS event, the LBR information is recorded in the PEBS records. There is no impact on the PEBS event.) - Linux kernel can support the LBR features without knowing the model number of the current CPU. Please check more details in each commit and feel free to comment. [0] https://software.intel.com/content/www/us/en/develop/download/ intel-architecture-instruction-set-extensions-and-future-features-programming-reference.html [1] https://lore.kernel.org/lkml/1593780569-62993-1-git-send-email-kan.li...@linux.intel.com/ --- v1->v2 Changelog: - rebased on the latest kvm/queue tree; - refine some comments for guest usage; Previous: https://lore.kernel.org/kvm/20200731074402.8879-1-like...@linux.intel.com/ Like Xu (4): KVM: vmx/pmu: Add MSR_ARCH_LBR_DEPTH emulation for Arch LBR KVM: vmx/pmu: Add MSR_ARCH_LBR_CTL emulation for Arch LBR KVM: vmx/pmu: Add Arch LBR emulation and its VMCS field KVM: x86: Expose Architectural LBR CPUID and its XSAVES bit arch/x86/include/asm/vmx.h | 4 ++ arch/x86/kvm/cpuid.c| 23 ++ arch/x86/kvm/vmx/capabilities.h | 25 +++ arch/x86/kvm/vmx/pmu_intel.c| 74 +++-- arch/x86/kvm/vmx/vmx.c | 15 ++- arch/x86/kvm/vmx/vmx.h | 3 ++ arch/x86/kvm/x86.c | 10 - 7 files changed, 140 insertions(+), 14 deletions(-) -- 2.29.2
[PATCH] perf/x86/lbr: Simplify the exposure check for the LBR_INFO registers
If the platform supports LBR_INFO register, the x86_pmu.lbr_info will be assigned in intel_pmu_?_lbr_init_?() and it's safe to expose LBR_INFO in the x86_perf_get_lbr() directly, instead of relying on lbr_format check. Also Architectural LBR has IA32_LBR_x_INFO instead of LBR_FORMAT_INFO_x to hold metadata for the operation, including mispredict, TSX, and elapsed cycle time information. Cc: Kan Liang Cc: Peter Zijlstra (Intel) Signed-off-by: Like Xu --- arch/x86/events/intel/lbr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 21890dacfcfe..355ea70f1879 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1832,12 +1832,10 @@ void __init intel_pmu_arch_lbr_init(void) */ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) { - int lbr_fmt = x86_pmu.intel_cap.lbr_format; - lbr->nr = x86_pmu.lbr_nr; lbr->from = x86_pmu.lbr_from; lbr->to = x86_pmu.lbr_to; - lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? x86_pmu.lbr_info : 0; + lbr->info = x86_pmu.lbr_info; return 0; } -- 2.29.2
[PATCH] KVM: vmx/pmu: Add VMCS fields check before exposing LBR_FMT
Before KVM exposes guest LBR_FMT perf capabilities, it needs to check whether VMCS has GUEST_IA32_DEBUGCTL guest status field and vmx switch support on IA32_DEBUGCTL MSR (including VM_EXIT_SAVE_DEBUG_CONTROLS and VM_ENTRY_LOAD_DEBUG_CONTROLS). It helps nested LBR enablement. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/capabilities.h | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index d1d77985e889..ac3af06953a8 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -378,6 +378,12 @@ static inline bool vmx_pt_mode_is_host_guest(void) return pt_mode == PT_MODE_HOST_GUEST; } +static inline bool cpu_has_vmx_lbr(void) +{ + return (vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_DEBUG_CONTROLS) && + (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_DEBUG_CONTROLS); +} + static inline u64 vmx_get_perf_capabilities(void) { u64 perf_cap = 0; @@ -385,7 +391,8 @@ static inline u64 vmx_get_perf_capabilities(void) if (boot_cpu_has(X86_FEATURE_PDCM)) rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); - perf_cap &= PMU_CAP_LBR_FMT; + if (cpu_has_vmx_lbr()) + perf_cap &= PMU_CAP_LBR_FMT; /* * Since counters are virtualized, KVM would support full -- 2.29.2
[PATCH v14 07/11] KVM: vmx/pmu: Reduce the overhead of LBR pass-through or cancellation
When the LBR records msrs has already been pass-through, there is no need to call vmx_update_intercept_for_lbr_msrs() again and again, and vice versa. Signed-off-by: Like Xu Reviewed-by: Andi Kleen --- arch/x86/kvm/vmx/pmu_intel.c | 13 + arch/x86/kvm/vmx/vmx.h | 3 +++ 2 files changed, 16 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 287fc14f0445..60f395e18446 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -550,6 +550,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu) vcpu->arch.perf_capabilities = 0; lbr_desc->records.nr = 0; lbr_desc->event = NULL; + lbr_desc->msr_passthrough = false; } static void intel_pmu_reset(struct kvm_vcpu *vcpu) @@ -596,12 +597,24 @@ static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set) static inline void vmx_disable_lbr_msrs_passthrough(struct kvm_vcpu *vcpu) { + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); + + if (!lbr_desc->msr_passthrough) + return; + vmx_update_intercept_for_lbr_msrs(vcpu, true); + lbr_desc->msr_passthrough = false; } static inline void vmx_enable_lbr_msrs_passthrough(struct kvm_vcpu *vcpu) { + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); + + if (lbr_desc->msr_passthrough) + return; + vmx_update_intercept_for_lbr_msrs(vcpu, false); + lbr_desc->msr_passthrough = true; } /* diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 863bb3fe73d4..4d6a2624a204 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -90,6 +90,9 @@ struct lbr_desc { * The records may be inaccurate if the host reclaims the LBR. */ struct perf_event *event; + + /* True if LBRs are marked as not intercepted in the MSR bitmap */ + bool msr_passthrough; }; /* -- 2.29.2
[PATCH v14 11/11] selftests: kvm/x86: add test for pmu msr MSR_IA32_PERF_CAPABILITIES
This test will check the effect of various CPUID settings on the MSR_IA32_PERF_CAPABILITIES MSR, check that whatever user space writes with KVM_SET_MSR is _not_ modified from the guest and can be retrieved with KVM_GET_MSR, and check that invalid LBR formats are rejected. Signed-off-by: Like Xu --- tools/testing/selftests/kvm/.gitignore| 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/vmx_pmu_msrs_test.c | 149 ++ 3 files changed, 151 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index ce8f4ad39684..28b71efe52a0 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -25,6 +25,7 @@ /x86_64/vmx_set_nested_state_test /x86_64/vmx_tsc_adjust_test /x86_64/xss_msr_test +/x86_64/vmx_pmu_msrs_test /demand_paging_test /dirty_log_test /dirty_log_perf_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index fe41c6a0fa67..cf8737828dd4 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -59,6 +59,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test TEST_GEN_PROGS_x86_64 += x86_64/debug_regs TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test +TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_msrs_test TEST_GEN_PROGS_x86_64 += demand_paging_test TEST_GEN_PROGS_x86_64 += dirty_log_test TEST_GEN_PROGS_x86_64 += dirty_log_perf_test diff --git a/tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c b/tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c new file mode 100644 index ..b3ad63e6ff12 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c @@ -0,0 +1,149 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * VMX-pmu related msrs test + * + * Copyright (C) 2021 Intel Corporation + * + * Test to check the effect of various CPUID settings + * on the MSR_IA32_PERF_CAPABILITIES MSR, and check that + * whatever we write with KVM_SET_MSR is _not_ modified + * in the guest and test it can be retrieved with KVM_GET_MSR. + * + * Test to check that invalid LBR formats are rejected. + */ + +#define _GNU_SOURCE /* for program_invocation_short_name */ +#include + +#include "kvm_util.h" +#include "vmx.h" + +#define VCPU_ID 0 + +#define X86_FEATURE_PDCM (1<<15) +#define PMU_CAP_FW_WRITES (1ULL << 13) +#define PMU_CAP_LBR_FMT0x3f + +union cpuid10_eax { + struct { + unsigned int version_id:8; + unsigned int num_counters:8; + unsigned int bit_width:8; + unsigned int mask_length:8; + } split; + unsigned int full; +}; + +union perf_capabilities { + struct { + u64 lbr_format:6; + u64 pebs_trap:1; + u64 pebs_arch_reg:1; + u64 pebs_format:4; + u64 smm_freeze:1; + u64 full_width_write:1; + u64 pebs_baseline:1; + u64 perf_metrics:1; + u64 pebs_output_pt_available:1; + u64 anythread_deprecated:1; + }; + u64 capabilities; +}; + +uint64_t rdmsr_on_cpu(uint32_t reg) +{ + uint64_t data; + int fd; + char msr_file[64]; + + sprintf(msr_file, "/dev/cpu/%d/msr", 0); + fd = open(msr_file, O_RDONLY); + if (fd < 0) + exit(KSFT_SKIP); + + if (pread(fd, &data, sizeof(data), reg) != sizeof(data)) + exit(KSFT_SKIP); + + close(fd); + return data; +} + +static void guest_code(void) +{ + wrmsr(MSR_IA32_PERF_CAPABILITIES, PMU_CAP_LBR_FMT); +} + +int main(int argc, char *argv[]) +{ + struct kvm_cpuid2 *cpuid; + struct kvm_cpuid_entry2 *entry_1_0; + struct kvm_cpuid_entry2 *entry_a_0; + bool pdcm_supported = false; + struct kvm_vm *vm; + int ret; + union cpuid10_eax eax; + union perf_capabilities host_cap; + + host_cap.capabilities = rdmsr_on_cpu(MSR_IA32_PERF_CAPABILITIES); + host_cap.capabilities &= (PMU_CAP_FW_WRITES | PMU_CAP_LBR_FMT); + + /* Create VM */ + vm = vm_create_default(VCPU_ID, 0, guest_code); + cpuid = kvm_get_supported_cpuid(); + + if (kvm_get_cpuid_max_basic() >= 0xa) { + entry_1_0 = kvm_get_supported_cpuid_index(1, 0); + entry_a_0 = kvm_get_supported_cpuid_index(0xa, 0); + pdcm_supported = entry_1_0 && !!(entry_1_0->ecx & X86_FEATURE_PDCM); + eax.full = entry_a_0->eax; + } + if (!pdcm_supported) { + print_skip("MSR_IA32_PERF_CAPABILITIES is not supported by the vCPU"); + exit(KSFT_SKIP
[PATCH v14 10/11] KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES
Userspace could enable guest LBR feature when the exactly supported LBR format value is initialized to the MSR_IA32_PERF_CAPABILITIES and the LBR is also compatible with vPMU version and host cpu model. The LBR could be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/capabilities.h | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 57b940c613ab..c49f3ee8eca8 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -374,11 +374,18 @@ static inline bool vmx_pt_mode_is_host_guest(void) static inline u64 vmx_get_perf_capabilities(void) { + u64 perf_cap; + + if (boot_cpu_has(X86_FEATURE_PDCM)) + rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); + + perf_cap &= PMU_CAP_LBR_FMT; + /* * Since counters are virtualized, KVM would support full * width counting unconditionally, even if the host lacks it. */ - return PMU_CAP_FW_WRITES; + return PMU_CAP_FW_WRITES | perf_cap; } static inline u64 vmx_supported_debugctl(void) -- 2.29.2
[PATCH v14 08/11] KVM: vmx/pmu: Emulate legacy freezing LBRs on virtual PMI
The current vPMU only supports Architecture Version 2. According to Intel SDM "17.4.7 Freezing LBR and Performance Counters on PMI", if IA32_DEBUGCTL.Freeze_LBR_On_PMI = 1, the LBR is frozen on the virtual PMI and the KVM would emulate to clear the LBR bit (bit 0) in IA32_DEBUGCTL. Also, guest needs to re-enable IA32_DEBUGCTL.LBR to resume recording branches. Signed-off-by: Like Xu Reviewed-by: Andi Kleen --- arch/x86/kvm/pmu.c | 5 - arch/x86/kvm/pmu.h | 1 + arch/x86/kvm/vmx/capabilities.h | 4 +++- arch/x86/kvm/vmx/pmu_intel.c| 30 ++ arch/x86/kvm/vmx/vmx.c | 2 +- 5 files changed, 39 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 67741d2a0308..405890c723a1 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -383,8 +383,11 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) { - if (lapic_in_kernel(vcpu)) + if (lapic_in_kernel(vcpu)) { + if (kvm_x86_ops.pmu_ops->deliver_pmi) + kvm_x86_ops.pmu_ops->deliver_pmi(vcpu); kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC); + } } bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 067fef51760c..742a4e98df8c 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -39,6 +39,7 @@ struct kvm_pmu_ops { void (*refresh)(struct kvm_vcpu *vcpu); void (*init)(struct kvm_vcpu *vcpu); void (*reset)(struct kvm_vcpu *vcpu); + void (*deliver_pmi)(struct kvm_vcpu *vcpu); }; static inline u64 pmc_bitmask(struct kvm_pmc *pmc) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 62aa7a701ebb..57b940c613ab 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -21,6 +21,8 @@ extern int __read_mostly pt_mode; #define PMU_CAP_FW_WRITES (1ULL << 13) #define PMU_CAP_LBR_FMT0x3f +#define DEBUGCTLMSR_LBR_MASK (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) + struct nested_vmx_msrs { /* * We only store the "true" versions of the VMX capability MSRs. We @@ -384,7 +386,7 @@ static inline u64 vmx_supported_debugctl(void) u64 debugctl = DEBUGCTLMSR_BTF; if (vmx_get_perf_capabilities() & PMU_CAP_LBR_FMT) - debugctl |= DEBUGCTLMSR_LBR; + debugctl |= DEBUGCTLMSR_LBR_MASK; return debugctl; } diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 60f395e18446..51edd9c1adfa 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -579,6 +579,35 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu) intel_pmu_release_guest_lbr_event(vcpu); } +/* + * Emulate LBR_On_PMI behavior for 1 < pmu.version < 4. + * + * If Freeze_LBR_On_PMI = 1, the LBR is frozen on PMI and + * the KVM emulates to clear the LBR bit (bit 0) in IA32_DEBUGCTL. + * + * Guest needs to re-enable LBR to resume branches recording. + */ +static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu) +{ + u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL); + + if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) { + data &= ~DEBUGCTLMSR_LBR; + vmcs_write64(GUEST_IA32_DEBUGCTL, data); + } +} + +static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu) +{ + u8 version = vcpu_to_pmu(vcpu)->version; + + if (!intel_pmu_lbr_is_enabled(vcpu)) + return; + + if (version > 1 && version < 4) + intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu); +} + static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set) { struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu); @@ -665,4 +694,5 @@ struct kvm_pmu_ops intel_pmu_ops = { .refresh = intel_pmu_refresh, .init = intel_pmu_init, .reset = intel_pmu_reset, + .deliver_pmi = intel_pmu_deliver_pmi, }; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 40fdeb394328..5389032ca4ad 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1963,7 +1963,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu) u64 debugctl = vmx_supported_debugctl(); if (!intel_pmu_lbr_is_enabled(vcpu)) - debugctl &= ~DEBUGCTLMSR_LBR; + debugctl &= ~DEBUGCTLMSR_LBR_MASK; return debugctl; } -- 2.29.2
[PATCH v14 10/11] KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES
Userspace could enable guest LBR feature when the exactly supported LBR format value is initialized to the MSR_IA32_PERF_CAPABILITIES and the LBR is also compatible with vPMU version and host cpu model. The LBR could be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/capabilities.h | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 57b940c613ab..c49f3ee8eca8 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -374,11 +374,18 @@ static inline bool vmx_pt_mode_is_host_guest(void) static inline u64 vmx_get_perf_capabilities(void) { + u64 perf_cap; + + if (boot_cpu_has(X86_FEATURE_PDCM)) + rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap); + + perf_cap &= PMU_CAP_LBR_FMT; + /* * Since counters are virtualized, KVM would support full * width counting unconditionally, even if the host lacks it. */ - return PMU_CAP_FW_WRITES; + return PMU_CAP_FW_WRITES | perf_cap; } static inline u64 vmx_supported_debugctl(void) -- 2.29.2
[PATCH v14 09/11] KVM: vmx/pmu: Release guest LBR event via lazy release mechanism
The vPMU uses GUEST_LBR_IN_USE_IDX (bit 58) in 'pmu->pmc_in_use' to indicate whether a guest LBR event is still needed by the vcpu. If the vcpu no longer accesses LBR related registers within a scheduling time slice, and the enable bit of LBR has been unset, vPMU will treat the guest LBR event as a bland event of a vPMC counter and release it as usual. Also, the pass-through state of LBR records msrs is cancelled. Signed-off-by: Like Xu --- arch/x86/kvm/pmu.c | 3 +++ arch/x86/kvm/pmu.h | 1 + arch/x86/kvm/vmx/pmu_intel.c | 21 - 3 files changed, 24 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 405890c723a1..136dc2f3c5d3 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -476,6 +476,9 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu) pmc_stop_counter(pmc); } + if (kvm_x86_ops.pmu_ops->cleanup) + kvm_x86_ops.pmu_ops->cleanup(vcpu); + bitmap_zero(pmu->pmc_in_use, X86_PMC_IDX_MAX); } diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 742a4e98df8c..7b30bc967af3 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -40,6 +40,7 @@ struct kvm_pmu_ops { void (*init)(struct kvm_vcpu *vcpu); void (*reset)(struct kvm_vcpu *vcpu); void (*deliver_pmi)(struct kvm_vcpu *vcpu); + void (*cleanup)(struct kvm_vcpu *vcpu); }; static inline u64 pmc_bitmask(struct kvm_pmc *pmc) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 51edd9c1adfa..23cd31b849f4 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -288,8 +288,10 @@ int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu) PERF_SAMPLE_BRANCH_USER, }; - if (unlikely(lbr_desc->event)) + if (unlikely(lbr_desc->event)) { + __set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use); return 0; + } event = perf_event_create_kernel_counter(&attr, -1, current, NULL, NULL); @@ -300,6 +302,7 @@ int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu) } lbr_desc->event = event; pmu->event_count++; + __set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use); return 0; } @@ -332,9 +335,11 @@ static bool intel_pmu_handle_lbr_msrs_access(struct kvm_vcpu *vcpu, rdmsrl(index, msr_info->data); else wrmsrl(index, msr_info->data); + __set_bit(INTEL_PMC_IDX_FIXED_VLBR, vcpu_to_pmu(vcpu)->pmc_in_use); local_irq_enable(); return true; } + clear_bit(INTEL_PMC_IDX_FIXED_VLBR, vcpu_to_pmu(vcpu)->pmc_in_use); local_irq_enable(); dummy: @@ -463,6 +468,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) struct kvm_cpuid_entry2 *entry; union cpuid10_eax eax; union cpuid10_edx edx; + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); pmu->nr_arch_gp_counters = 0; pmu->nr_arch_fixed_counters = 0; @@ -482,6 +488,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) return; perf_get_x86_pmu_capability(&x86_pmu); + if (lbr_desc->records.nr) + bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED_VLBR, 1); pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters, x86_pmu.num_counters_gp); @@ -658,17 +666,21 @@ static inline void vmx_enable_lbr_msrs_passthrough(struct kvm_vcpu *vcpu) */ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu) { + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); if (!lbr_desc->event) { vmx_disable_lbr_msrs_passthrough(vcpu); if (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR) goto warn; + if (test_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use)) + goto warn; return; } if (lbr_desc->event->state < PERF_EVENT_STATE_ACTIVE) { vmx_disable_lbr_msrs_passthrough(vcpu); + __clear_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use); goto warn; } else vmx_enable_lbr_msrs_passthrough(vcpu); @@ -680,6 +692,12 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu) vcpu->vcpu_id); } +static void intel_pmu_cleanup(struct kvm_vcpu *vcpu) +{ + if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR)) + intel_pmu_release_guest_lbr_event(vcpu); +} + struct kvm_pmu_ops intel_pmu_ops = { .find_arch_event = intel_find_arch_event, .find_fixe
[PATCH v14 11/11] selftests: kvm/x86: add test for pmu msr MSR_IA32_PERF_CAPABILITIES
This test will check the effect of various CPUID settings on the MSR_IA32_PERF_CAPABILITIES MSR, check that whatever user space writes with KVM_SET_MSR is _not_ modified from the guest and can be retrieved with KVM_GET_MSR, and check that invalid LBR formats are rejected. Signed-off-by: Like Xu --- tools/testing/selftests/kvm/.gitignore| 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/vmx_pmu_msrs_test.c | 149 ++ 3 files changed, 151 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index ce8f4ad39684..28b71efe52a0 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -25,6 +25,7 @@ /x86_64/vmx_set_nested_state_test /x86_64/vmx_tsc_adjust_test /x86_64/xss_msr_test +/x86_64/vmx_pmu_msrs_test /demand_paging_test /dirty_log_test /dirty_log_perf_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index fe41c6a0fa67..cf8737828dd4 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -59,6 +59,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test TEST_GEN_PROGS_x86_64 += x86_64/debug_regs TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test +TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_msrs_test TEST_GEN_PROGS_x86_64 += demand_paging_test TEST_GEN_PROGS_x86_64 += dirty_log_test TEST_GEN_PROGS_x86_64 += dirty_log_perf_test diff --git a/tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c b/tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c new file mode 100644 index ..b3ad63e6ff12 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c @@ -0,0 +1,149 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * VMX-pmu related msrs test + * + * Copyright (C) 2021 Intel Corporation + * + * Test to check the effect of various CPUID settings + * on the MSR_IA32_PERF_CAPABILITIES MSR, and check that + * whatever we write with KVM_SET_MSR is _not_ modified + * in the guest and test it can be retrieved with KVM_GET_MSR. + * + * Test to check that invalid LBR formats are rejected. + */ + +#define _GNU_SOURCE /* for program_invocation_short_name */ +#include + +#include "kvm_util.h" +#include "vmx.h" + +#define VCPU_ID 0 + +#define X86_FEATURE_PDCM (1<<15) +#define PMU_CAP_FW_WRITES (1ULL << 13) +#define PMU_CAP_LBR_FMT0x3f + +union cpuid10_eax { + struct { + unsigned int version_id:8; + unsigned int num_counters:8; + unsigned int bit_width:8; + unsigned int mask_length:8; + } split; + unsigned int full; +}; + +union perf_capabilities { + struct { + u64 lbr_format:6; + u64 pebs_trap:1; + u64 pebs_arch_reg:1; + u64 pebs_format:4; + u64 smm_freeze:1; + u64 full_width_write:1; + u64 pebs_baseline:1; + u64 perf_metrics:1; + u64 pebs_output_pt_available:1; + u64 anythread_deprecated:1; + }; + u64 capabilities; +}; + +uint64_t rdmsr_on_cpu(uint32_t reg) +{ + uint64_t data; + int fd; + char msr_file[64]; + + sprintf(msr_file, "/dev/cpu/%d/msr", 0); + fd = open(msr_file, O_RDONLY); + if (fd < 0) + exit(KSFT_SKIP); + + if (pread(fd, &data, sizeof(data), reg) != sizeof(data)) + exit(KSFT_SKIP); + + close(fd); + return data; +} + +static void guest_code(void) +{ + wrmsr(MSR_IA32_PERF_CAPABILITIES, PMU_CAP_LBR_FMT); +} + +int main(int argc, char *argv[]) +{ + struct kvm_cpuid2 *cpuid; + struct kvm_cpuid_entry2 *entry_1_0; + struct kvm_cpuid_entry2 *entry_a_0; + bool pdcm_supported = false; + struct kvm_vm *vm; + int ret; + union cpuid10_eax eax; + union perf_capabilities host_cap; + + host_cap.capabilities = rdmsr_on_cpu(MSR_IA32_PERF_CAPABILITIES); + host_cap.capabilities &= (PMU_CAP_FW_WRITES | PMU_CAP_LBR_FMT); + + /* Create VM */ + vm = vm_create_default(VCPU_ID, 0, guest_code); + cpuid = kvm_get_supported_cpuid(); + + if (kvm_get_cpuid_max_basic() >= 0xa) { + entry_1_0 = kvm_get_supported_cpuid_index(1, 0); + entry_a_0 = kvm_get_supported_cpuid_index(0xa, 0); + pdcm_supported = entry_1_0 && !!(entry_1_0->ecx & X86_FEATURE_PDCM); + eax.full = entry_a_0->eax; + } + if (!pdcm_supported) { + print_skip("MSR_IA32_PERF_CAPABILITIES is not supported by the vCPU"); + exit(KSFT_SKIP
[PATCH v14 08/11] KVM: vmx/pmu: Emulate legacy freezing LBRs on virtual PMI
The current vPMU only supports Architecture Version 2. According to Intel SDM "17.4.7 Freezing LBR and Performance Counters on PMI", if IA32_DEBUGCTL.Freeze_LBR_On_PMI = 1, the LBR is frozen on the virtual PMI and the KVM would emulate to clear the LBR bit (bit 0) in IA32_DEBUGCTL. Also, guest needs to re-enable IA32_DEBUGCTL.LBR to resume recording branches. Signed-off-by: Like Xu Reviewed-by: Andi Kleen --- arch/x86/kvm/pmu.c | 5 - arch/x86/kvm/pmu.h | 1 + arch/x86/kvm/vmx/capabilities.h | 4 +++- arch/x86/kvm/vmx/pmu_intel.c| 30 ++ arch/x86/kvm/vmx/vmx.c | 2 +- 5 files changed, 39 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 67741d2a0308..405890c723a1 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -383,8 +383,11 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) { - if (lapic_in_kernel(vcpu)) + if (lapic_in_kernel(vcpu)) { + if (kvm_x86_ops.pmu_ops->deliver_pmi) + kvm_x86_ops.pmu_ops->deliver_pmi(vcpu); kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC); + } } bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 067fef51760c..742a4e98df8c 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -39,6 +39,7 @@ struct kvm_pmu_ops { void (*refresh)(struct kvm_vcpu *vcpu); void (*init)(struct kvm_vcpu *vcpu); void (*reset)(struct kvm_vcpu *vcpu); + void (*deliver_pmi)(struct kvm_vcpu *vcpu); }; static inline u64 pmc_bitmask(struct kvm_pmc *pmc) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 62aa7a701ebb..57b940c613ab 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -21,6 +21,8 @@ extern int __read_mostly pt_mode; #define PMU_CAP_FW_WRITES (1ULL << 13) #define PMU_CAP_LBR_FMT0x3f +#define DEBUGCTLMSR_LBR_MASK (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) + struct nested_vmx_msrs { /* * We only store the "true" versions of the VMX capability MSRs. We @@ -384,7 +386,7 @@ static inline u64 vmx_supported_debugctl(void) u64 debugctl = DEBUGCTLMSR_BTF; if (vmx_get_perf_capabilities() & PMU_CAP_LBR_FMT) - debugctl |= DEBUGCTLMSR_LBR; + debugctl |= DEBUGCTLMSR_LBR_MASK; return debugctl; } diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 60f395e18446..51edd9c1adfa 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -579,6 +579,35 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu) intel_pmu_release_guest_lbr_event(vcpu); } +/* + * Emulate LBR_On_PMI behavior for 1 < pmu.version < 4. + * + * If Freeze_LBR_On_PMI = 1, the LBR is frozen on PMI and + * the KVM emulates to clear the LBR bit (bit 0) in IA32_DEBUGCTL. + * + * Guest needs to re-enable LBR to resume branches recording. + */ +static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu) +{ + u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL); + + if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) { + data &= ~DEBUGCTLMSR_LBR; + vmcs_write64(GUEST_IA32_DEBUGCTL, data); + } +} + +static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu) +{ + u8 version = vcpu_to_pmu(vcpu)->version; + + if (!intel_pmu_lbr_is_enabled(vcpu)) + return; + + if (version > 1 && version < 4) + intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu); +} + static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set) { struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu); @@ -665,4 +694,5 @@ struct kvm_pmu_ops intel_pmu_ops = { .refresh = intel_pmu_refresh, .init = intel_pmu_init, .reset = intel_pmu_reset, + .deliver_pmi = intel_pmu_deliver_pmi, }; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 40fdeb394328..5389032ca4ad 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1963,7 +1963,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu) u64 debugctl = vmx_supported_debugctl(); if (!intel_pmu_lbr_is_enabled(vcpu)) - debugctl &= ~DEBUGCTLMSR_LBR; + debugctl &= ~DEBUGCTLMSR_LBR_MASK; return debugctl; } -- 2.29.2
[PATCH v14 07/11] KVM: vmx/pmu: Reduce the overhead of LBR pass-through or cancellation
When the LBR records msrs has already been pass-through, there is no need to call vmx_update_intercept_for_lbr_msrs() again and again, and vice versa. Signed-off-by: Like Xu Reviewed-by: Andi Kleen --- arch/x86/kvm/vmx/pmu_intel.c | 13 + arch/x86/kvm/vmx/vmx.h | 3 +++ 2 files changed, 16 insertions(+) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 287fc14f0445..60f395e18446 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -550,6 +550,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu) vcpu->arch.perf_capabilities = 0; lbr_desc->records.nr = 0; lbr_desc->event = NULL; + lbr_desc->msr_passthrough = false; } static void intel_pmu_reset(struct kvm_vcpu *vcpu) @@ -596,12 +597,24 @@ static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set) static inline void vmx_disable_lbr_msrs_passthrough(struct kvm_vcpu *vcpu) { + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); + + if (!lbr_desc->msr_passthrough) + return; + vmx_update_intercept_for_lbr_msrs(vcpu, true); + lbr_desc->msr_passthrough = false; } static inline void vmx_enable_lbr_msrs_passthrough(struct kvm_vcpu *vcpu) { + struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); + + if (lbr_desc->msr_passthrough) + return; + vmx_update_intercept_for_lbr_msrs(vcpu, false); + lbr_desc->msr_passthrough = true; } /* diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 863bb3fe73d4..4d6a2624a204 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -90,6 +90,9 @@ struct lbr_desc { * The records may be inaccurate if the host reclaims the LBR. */ struct perf_event *event; + + /* True if LBRs are marked as not intercepted in the MSR bitmap */ + bool msr_passthrough; }; /* -- 2.29.2