Hi Zide,

On 1/6/26 1:03 PM, Chen, Zide wrote:
> 
> 
> On 1/5/2026 12:21 PM, Dongli Zhang wrote:
>> Hi Zide,
>>
>> On 1/2/26 2:59 PM, Chen, Zide wrote:
>>>
>>>
>>> On 12/29/2025 11:42 PM, Dongli Zhang wrote:
>>
>> [snip]
>>
>>>>  
>>>>  static struct kvm_cpuid2 *cpuid_cache;
>>>>  static struct kvm_cpuid2 *hv_cpuid_cache;
>>>> @@ -2068,23 +2072,30 @@ int kvm_arch_pre_create_vcpu(CPUState *cpu, Error 
>>>> **errp)
>>>>      if (first) {
>>>>          first = false;
>>>>  
>>>> -        /*
>>>> -         * Since Linux v5.18, KVM provides a VM-level capability to easily
>>>> -         * disable PMUs; however, QEMU has been providing PMU property per
>>>> -         * CPU since v1.6. In order to accommodate both, have to configure
>>>> -         * the VM-level capability here.
>>>> -         *
>>>> -         * KVM_PMU_CAP_DISABLE doesn't change the PMU
>>>> -         * behavior on Intel platform because current "pmu" property works
>>>> -         * as expected.
>>>> -         */
>>>> -        if ((pmu_cap & KVM_PMU_CAP_DISABLE) && !X86_CPU(cpu)->enable_pmu) 
>>>> {
>>>> -            ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_PMU_CAPABILITY, 0,
>>>> -                                    KVM_PMU_CAP_DISABLE);
>>>> -            if (ret < 0) {
>>>> -                error_setg_errno(errp, -ret,
>>>> -                                 "Failed to set KVM_PMU_CAP_DISABLE");
>>>> -                return ret;
>>>> +        if (X86_CPU(cpu)->enable_pmu) {
>>>> +            if (kvm_pmu_disabled) {
>>>> +                warn_report("Failed to enable PMU since "
>>>> +                            "KVM's enable_pmu parameter is disabled");
>>>
>>> I'm wondering about the intended value of this patch?
>>>
>>> If enable_pmu is true in QEMU but the corresponding KVM parameter is
>>> false, then KVM_GET_SUPPORTED_CPUID or KVM_GET_MSRS should be able to
>>> tell that the PMU feature is not supported by host.
>>>
>>> The logic implemented in this patch seems somewhat redundant.
>>
>> For Intel, the QEMU userspace can determine if the vPMU is disabled by KVM
>> through the use of KVM_GET_SUPPORTED_CPUID.
>>
>> However, this approach does not apply to AMD. Unlike Intel, AMD does not 
>> rely on
>> CPUID to detect whether PMU is supported. By default, we can assume that PMU 
>> is
>> always available, except for the recent PerfMonV2 feature.
>>
>> The main objective of this PATCH 4/7 is to introduce the variable
>> 'kvm_pmu_disabled', which will be reused in PATCH 5/7 to skip any PMU
>> initialization if the parameter is set to 'N'.
>>
>> +static void kvm_init_pmu_info(struct kvm_cpuid2 *cpuid, X86CPU *cpu)
>> +{
>> +    CPUX86State *env = &cpu->env;
>> +
>> +    /*
>> +     * The PMU virtualization is disabled by kvm.enable_pmu=N.
>> +     */
>> +    if (kvm_pmu_disabled) {
>> +        return;
>> +    }
> 
> Thanks for explanation.
> 
>> The 'kvm_pmu_disabled' variable is used to differentiate between the 
>> following
>> two scenarios on AMD:
>>
>> (1) A newer KVM with KVM_PMU_CAP_DISABLE support, but explicitly disabled via
>> the KVM parameter ('N').
>>
>> (2) An older KVM without KVM_CAP_PMU_CAPABILITY support.
>>
>> In both cases, the call to KVM_CAP_PMU_CAPABILITY extension support check may
>> return 0.
>>
>> By reading the file "/sys/module/kvm/parameters/enable_pmu", we can 
>> distinguish
>> between these two scenarios.
> 
> As described in PATCH 1/7, without issuing KVM_PMU_CAP_DISABLE, KVM has
> no way to know that userspace does not intend to enable vPMU in AMD
> platforms, and therefore does not fault guest accesses to PMU MSRs.
> 
> My understanding is that the issue being addressed here is basically the
> opposite: QEMU does not know that vPMU is disabled by KVM.

Exactly.

Otherwise, QEMU issues unwanted MSR writes for every vCPU during QEMU reset.

> 
> IIUC, one difference between Intel and AMD is that AMD lacks a CPUID
> leaf to indicate the availability of PMU version 1. But Intel
> potentially could be in the same situation that KVM advertises PMU
> availability but it's not actually supported. (e.g. kvm->arch.enable_pmu
> is false while modules parameter enable_pmu is true).
> 
> From the guest’s point of view, it probes PMU MSRs to determine whether
> PMU support is present and it's fine in this situation.
> 
> In userspace, QEMU may issue KVM_SET_MSRS / KVM_GET_MSRS to KVM without
> knowing that vPMU has been disabled by KVM.  I think these IOCTLs should
> not fail, since KVM states that “Userspace is allowed to read MSRs, and
> write ‘0’ to MSRs, that KVM advertises to userspace, even if an MSR
> isn’t fully supported.”
> 
> My current understanding is that AMD should be fine even without
> kvm_pmu_disabled, but I may be missing some context here.
> 
> The bottom line is this patch doesn't handle the cases that KVM still
> could disable vPMU support even if enable_pmu is true.

Yes. There are still unwanted PMU MSR writes from QEMU. This just seems odd.

The concern with unwanted MSR writes was initially raised by Maksim Davydov:

https://lore.kernel.org/qemu-devel/[email protected]/

As shown below on the v6.0 KVM hypervisor (AMD), while there are no errors from
QEMU, numerous annoying warnings are generated. (If I recall correctly, this can
also be triggered from the VM itself.)

However, here the logs are not only due to vcpu0, but indeed every vcpu.

[  280.802976] kvm_set_msr_common: 1910 callbacks suppressed
[  280.802981] kvm [18411]: vcpu0, guest rIP: 0xffffffffa4c97844 disabled
perfctr wrmsr: 0xc0010007 data 0xffff
[  295.345747] kvm [18411]: vcpu0, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010004 data 0x0
[  295.355379] kvm [18411]: vcpu0, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010005 data 0x0
[  295.364997] kvm [18411]: vcpu0, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010006 data 0x0
[  295.374618] kvm [18411]: vcpu0, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010007 data 0x0
[  295.385048] kvm [18411]: vcpu1, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010004 data 0x0
[  295.394694] kvm [18411]: vcpu1, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010005 data 0x0
[  295.404317] kvm [18411]: vcpu1, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010006 data 0x0
[  295.413928] kvm [18411]: vcpu1, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010007 data 0x0
[  295.424319] kvm [18411]: vcpu2, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010004 data 0x0
[  295.433963] kvm [18411]: vcpu2, guest rIP: 0xfff0 disabled perfctr wrmsr:
0xc0010005 data 0x0
[  301.966571] kvm_set_msr_common: 1910 callbacks suppressed
[  301.966577] kvm [18411]: vcpu0, guest rIP: 0xffffffff8ac97844 disabled
perfctr wrmsr: 0xc0010007 data 0xffff

> 
> 
>> As you mentioned, another approach would be to use KVM_GET_MSRS to 
>> specifically
>> probe for AMD during QEMU initialization. In this case, we can set
>> 'kvm_pmu_disabled' to true if reading the AMD PMU MSR registers fails.
>>
>> To implement this, we may need to:
>>
>> 1. Turn this patch to be AMD specific by probing the AMD PMU registers during
>> initialization. We may need go create a new function in QEMU to use 
>> KVM_GET_MSRS
>> for probing only, or we may re-use kvm_arch_get_supported_msr_feature() or
>> kvm_get_one_msr(). I may change in the next version.
>>
>> 2. Limit the usage of 'kvm_pmu_disabled' to be AMD specific in PATCH 5/7.
> 
> I guess this might make things more complicated.
> 
>>>
>>> Additionally, in this scenario — where the user intends to enable a
>>> feature but the host cannot support it — normally no warning is emitted
>>> by QEMU.
>>
>> According to the usage of QEMU, may I assume QEMU already prints warning logs
>> for unsupported features? The below is an example.
>>
>> QEMU 10.2.50 monitor - type 'help' for more information
>> qemu-system-x86_64: warning: host doesn't support requested feature:
>> CPUID[eax=07h,ecx=00h].EBX.hle [bit 4]
>> qemu-system-x86_64: warning: host doesn't support requested feature:
>> CPUID[eax=07h,ecx=00h].EBX.rtm [bit 11]
>>
>>>
>>>> +            }
>>>> +        } else {
>>>> +            /*
>>>> +             * Since Linux v5.18, KVM provides a VM-level capability to 
>>>> easily
>>>> +             * disable PMUs; however, QEMU has been providing PMU 
>>>> property per
>>>> +             * CPU since v1.6. In order to accommodate both, have to 
>>>> configure
>>>> +             * the VM-level capability here.
>>>> +             *
>>>> +             * KVM_PMU_CAP_DISABLE doesn't change the PMU
>>>> +             * behavior on Intel platform because current "pmu" property 
>>>> works
>>>> +             * as expected.
>>>> +             */
>>>> +            if (pmu_cap & KVM_PMU_CAP_DISABLE) {
>>>> +                ret = kvm_vm_enable_cap(kvm_state, 
>>>> KVM_CAP_PMU_CAPABILITY, 0,
>>>> +                                        KVM_PMU_CAP_DISABLE);
>>>> +                if (ret < 0) {
>>>> +                    error_setg_errno(errp, -ret,
>>>> +                                     "Failed to set KVM_PMU_CAP_DISABLE");
>>>> +                    return ret;
>>>> +                }
>>>>              }
>>>>          }
>>>>      }
>>>> @@ -3302,6 +3313,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>>>>      int ret;
>>>>      struct utsname utsname;
>>>>      Error *local_err = NULL;
>>>> +    g_autofree char *kvm_enable_pmu;
>>>>  
>>>>      /*
>>>>       * Initialize confidential guest (SEV/TDX) context, if required
>>>> @@ -3437,6 +3449,21 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>>>>  
>>>>      pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
>>>>  
>>>> +    /*
>>>> +     * The enable_pmu parameter is introduced since Linux v5.17,
>>>> +     * give a chance to provide more information about vPMU
>>>> +     * enablement.
>>>> +     *
>>>> +     * The kvm.enable_pmu's permission is 0444. It does not change
>>>> +     * until a reload of the KVM module.
>>>> +     */
>>>> +    if (g_file_get_contents("/sys/module/kvm/parameters/enable_pmu",
>>>> +                            &kvm_enable_pmu, NULL, NULL)) {
>>>> +        if (*kvm_enable_pmu == 'N') {
>>>> +            kvm_pmu_disabled = true;
>>>
>>> It’s generally better not to rely on KVM’s internal implementation
>>> unless really necessary.
>>>
>>> For example, in the new mediated vPMU framework, even if the KVM module
>>> parameter enable_pmu is set, the per-guest kvm->arch.enable_pmu could
>>> still be cleared.
>>>
>>> In such a case, the logic here might not be correct.
>>
>> Would the Mediated vPMU set KVM_PMU_CAP_DISABLE to clear per-VM enable_pmu 
>> even
>> when the global KVM parameter enable_pmu=N is set?
>>
>> In this scenario, we plan to rely on KVM_PMU_CAP_DISABLE only when the value 
>> of
>> "/sys/module/kvm/parameters/enable_pmu" is not equal to N.
>>
>> Can I assume that this will work with Mediated vPMU?
>>
>>
>> Is there any possibility to follow the current approach before Mediated vPMU 
>> is
>> finalized for mainline, and later introduce an incremental change using
>> KVM_GET_MSRS probing? The current approach is straightforward and can work 
>> with
>> existing Linux kernel source code.
> 
> Apologies for the incorrect statement I made earlier regarding mediated
> vPMU.
> 
> According to the mediated vPMU v6, the only behavior specific to
> mediated vPMU is that kvm->arch.enable_pmu may be cleared when
> irqchip_in_kernel() is not true:
> https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!ACWV5N9M2RV99hQ!IS8XQG3Zx84utP2QScNlxp-0H5JAgr89lBb1j2oGVJJop3WMyK6X2I5mlerMPA06wkJy8VFd1x6XGEHc6kWn$
>  
> 
> However, this does not imply that mediated vPMU requires any special
> handling here. In theory, KVM could clear kvm->arch.enable_pmu in the
> future for other reasons.
> 

While "KVM could clear kvm->arch.enable_pmu in the future," I don't think KVM
may set kvm->arch.enable_pmu if the global enable_pmu is set to 'N'.

Taking Intel VMX EPT as an example, once "/sys/module/kvm_intel/parameters/ept"
is globally disabled, there's no way within KVM software to enable it for any
guest VM **after** 'ept' is set to N.

Similarly, "/sys/module/kvm/parameters/enable_pmu=N" indicates that this KVM
host will not support PMU virtualization in any way. Therefore, there should be
no way to enable vPMU for any guest VM if the global parameter is set to 'N'.
Here we read from ths parameter only during QEMU initialization.

That's why I believe it's reliable to trust the setting when
"/sys/module/kvm/parameters/enable_pmu=N".

In this way, we can avoid many unnecessary MSR writes, especially in cases where
a VM has 300+ vCPUs, even though these may be equivalent to NOPs with
optimizations in more recent KVM versions.

The objective isn't to improve performance. Minimizing the number of unwanted
MSR writes from QEMU reduces the chances of failure (e.g., due to any QEMU
software bug). We can simply avoid those unwanted MSR/NOPs by reading from a KVM
parameter.

>From a user's perspective, this just seems odd.
"/sys/module/kvm/parameters/enable_pmu=N" is a reliable setting. If there's a
configuration mismatch between QEMU and KVM, a warning could alert the user.

I can remove this patch, along with the 'kvm_pmu_disabled' variable.

Thank you very much!

Dongli Zhang

Reply via email to