On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang <dongli.zh...@oracle.com> wrote: > > The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM > side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to > the KVM side. > > However, only the Intel gp/fixed/global pmu registers are involved. There > is not any implementation for AMD pmu registers. The > 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are > calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for > AMD. Before AMD PerfMonV2, the number of gp registers is decided based on > the CPU version.
Updating the relevant documentation to clarify this part of the deficiency would be a good first step. > > This patch is to add the support for AMD version=1 pmu, to get and put AMD > pmu registers. Otherwise, there will be a bug: AMD version=1 ? AMD does not have version 1, just directly has 2, perhaps because of x86 compatibility. AMD also does not have the so-called architectural pmu. Maybe need to rename has_architectural_pmu_version for AMD. It might be more helpful to add similar support for AMD PerfMonV2. > > 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it > is running "perf top". The pmu registers are not disabled gracefully. > > 2. Although the x86_cpu_reset() resets many registers to zero, the > kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, > some pmu events are still enabled at the KVM side. I agree that we should have done that, especially if guest pmu is enabled on the AMD platforms. > > 3. The KVM pmc_speculative_in_use() always returns true so that the events > will not be reclaimed. The kvm_pmc->perf_event is still active. > > 4. After the reboot, the VM kernel reports below error: > > [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS > detected, complain to your hardware vendor. > [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR > c0010200 is 530076) > > 5. In a worse case, the active kvm_pmc->perf_event is still able to > inject unknown NMIs randomly to the VM kernel. > > [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. > > The patch is to fix the issue by resetting AMD pmu registers during the > reset. I'm not sure if the qemu_reset or VM kexec will necessarily trigger kvm::amd_pmu_reset(). > > Cc: Joe Jin <joe....@oracle.com> > Cc: Like Xu <lik...@tencent.com> > Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com> > --- > target/i386/cpu.h | 5 +++ > target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 86 insertions(+), 2 deletions(-) > > diff --git a/target/i386/cpu.h b/target/i386/cpu.h > index cd047e0410..b8ba72e87a 100644 > --- a/target/i386/cpu.h > +++ b/target/i386/cpu.h > @@ -471,6 +471,11 @@ typedef enum X86Seg { > #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f > #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 > > +#define MSR_K7_EVNTSEL0 0xc0010000 > +#define MSR_K7_PERFCTR0 0xc0010004 > +#define MSR_F15H_PERF_CTL0 0xc0010200 > +#define MSR_F15H_PERF_CTR0 0xc0010201 > + > #define MSR_MC0_CTL 0x400 > #define MSR_MC0_STATUS 0x401 > #define MSR_MC0_ADDR 0x402 > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > index bf4136fa1b..a0f7273dad 100644 > --- a/target/i386/kvm/kvm.c > +++ b/target/i386/kvm/kvm.c > @@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs) > } > } > > + /* > + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to > + * disable the AMD pmu virtualization. > + * > + * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled > + * indicates the KVM side has already disabled the pmu virtualization. > + */ > + if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) { > + int64_t family; > + > + family = (env->cpuid_version >> 8) & 0xf; > + if (family == 0xf) { > + family += (env->cpuid_version >> 20) & 0xff; > + } > + > + if (family >= 6) { > + has_architectural_pmu_version = 1; > + > + if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { > + num_architectural_pmu_gp_counters = 6; Please make the code a little more readable with some macro definitions. #define AMD64_NUM_COUNTERS 4 #define AMD64_NUM_COUNTERS_CORE 6 > + } else { > + num_architectural_pmu_gp_counters = 4; > + } > + } > + } > + > cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused); > > for (i = 0x80000000; i <= limit; i++) { > @@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) > kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, > env->poll_control_msr); > } > > - if (has_architectural_pmu_version > 0) { > + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { It seems to be saying here that it is possible to perform kvm_msr_entry_add() for Intel-related pmu on AMD platforms. So the reverse is expected to be feasible as well. Why need distinction? > if (has_architectural_pmu_version > 1) { > /* Stop the counter. */ > kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); > @@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) > env->msr_global_ctrl); > } > } > + > + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { > + uint32_t sel_base = MSR_K7_EVNTSEL0; > + uint32_t ctr_base = MSR_K7_PERFCTR0; > + uint32_t step = 1; > + > + if (num_architectural_pmu_gp_counters == 6) { > + sel_base = MSR_F15H_PERF_CTL0; > + ctr_base = MSR_F15H_PERF_CTR0; > + step = 2; > + } > + > + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { > + kvm_msr_entry_add(cpu, ctr_base + i * step, > + env->msr_gp_counters[i]); > + kvm_msr_entry_add(cpu, sel_base + i * step, > + env->msr_gp_evtsel[i]); > + } > + } > + > /* > * Hyper-V partition-wide MSRs: to avoid clearing them on cpu > hot-add, > * only sync them to KVM on the first cpu > @@ -3929,7 +3975,7 @@ static int kvm_get_msrs(X86CPU *cpu) > if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) { > kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1); > } > - if (has_architectural_pmu_version > 0) { > + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { > if (has_architectural_pmu_version > 1) { > kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); > kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0); > @@ -3945,6 +3991,25 @@ static int kvm_get_msrs(X86CPU *cpu) > } > } > > + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { > + uint32_t sel_base = MSR_K7_EVNTSEL0; > + uint32_t ctr_base = MSR_K7_PERFCTR0; > + uint32_t step = 1; > + > + if (num_architectural_pmu_gp_counters == 6) { > + sel_base = MSR_F15H_PERF_CTL0; > + ctr_base = MSR_F15H_PERF_CTR0; > + step = 2; Perhaps it would be more helpful to add a little commentary on why step 2 is needed. > + } > + > + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { > + kvm_msr_entry_add(cpu, ctr_base + i * step, > + env->msr_gp_counters[i]); > + kvm_msr_entry_add(cpu, sel_base + i * step, > + env->msr_gp_evtsel[i]); > + } > + } > + > if (env->mcg_cap) { > kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0); > kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0); > @@ -4230,6 +4295,20 @@ static int kvm_get_msrs(X86CPU *cpu) > case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1: > env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data; > break; > + case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3: > + env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data; > + break; > + case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3: > + env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data; > + break; > + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb: > + index = index - MSR_F15H_PERF_CTL0; > + if (index & 0x1) { > + env->msr_gp_counters[index] = msrs[i].data; > + } else { > + env->msr_gp_evtsel[index] = msrs[i].data; > + } > + break; > case HV_X64_MSR_HYPERCALL: > env->msr_hv_hypercall = msrs[i].data; > break; > -- > 2.34.1 >