Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
On 4/17/2021 1:30 AM, Sean Christopherson wrote: On Fri, Apr 16, 2021, Kirill A. Shutemov wrote: [...] index fadaccb95a4c..cd2374802702 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) } #endif +#define KVM_NR_SHARED_RANGES 32 + /* * Note: * memslots are not sorted by id anymore, please use id_to_memslot() @@ -513,6 +515,10 @@ struct kvm { pid_t userspace_pid; unsigned int max_halt_poll_ns; u32 dirty_ring_size; + bool mem_protected; + void *id; + int nr_shared_ranges; + struct range shared_ranges[KVM_NR_SHARED_RANGES]; Hard no for me. IMO, anything that requires KVM to track shared/pinned pages in a separate tree/array is non-starter. More specific to TDX #MCs, KVM should not be the canonical reference for the state of a page. Do you mean something in struct page to track if the page is shared or private?
Re: [PATCH v2 1/3] KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW
On 2/3/2021 12:05 AM, Paolo Bonzini wrote: On 02/02/21 16:02, Xiaoyao Li wrote: On 2/2/2021 10:49 PM, Paolo Bonzini wrote: On 02/02/21 10:04, Chenyi Qiang wrote: #define DR6_FIXED_1 0xfffe0ff0 -#define DR6_INIT 0x0ff0 +/* + * DR6_ACTIVE_LOW is actual the result of DR6_FIXED_1 | ACTIVE_LOW_BITS. + * We can regard all the current FIXED_1 bits as active_low bits even + * though in no case they will be turned into 0. But if they are defined + * in the future, it will require no code change. + * At the same time, it can be served as the init/reset value for DR6. + */ +#define DR6_ACTIVE_LOW 0x0ff0 #define DR6_VOLATILE 0x0001e00f Committed with some changes in the wording of the comment. Also, DR6_FIXED_1 is (DR6_ACTIVE_LOW & ~DR6_VOLATILE). Maybe we can add BUILD_BUG_ON() to make sure the correctness? We can even #define DR_FIXED_1 (DR6_ACTIVE_LOW & ~DR6_VOLATILE) directly. I have pushed this patch to kvm/queue, but the other two will have to wait for Fenghua's bare metal support. Hi Paolo, Fenghua's bare metal support is in tip tree now. https://lore.kernel.org/lkml/20210322135325.682257-1-fenghua...@intel.com/ Will the rest KVM patches get into 5.13 together?
Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/27/2021 8:18 AM, Kuppuswamy Sathyanarayanan wrote: In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions are not supported. So handle #VE due to these instructions as no ops. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- Changes since previous series: * Suppressed MWAIT feature as per Andi's comment. * Added warning debug log for MWAIT #VE exception. arch/x86/kernel/tdx.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e936b2f88bf6..fb7d22b846fc 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -308,6 +308,9 @@ void __init tdx_early_init(void) setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + /* MWAIT is not supported in TDX platform, so suppress it */ + setup_clear_cpu_cap(X86_FEATURE_MWAIT); In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This is enforced by SEAM module. Do we still need to safeguard it by setup_clear_cpu_cap() here? + tdg_get_info(); pv_ops.irq.safe_halt = tdg_safe_halt;
Re: [PATCH v2 1/3] KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW
On 2/2/2021 10:49 PM, Paolo Bonzini wrote: On 02/02/21 10:04, Chenyi Qiang wrote: #define DR6_FIXED_1 0xfffe0ff0 -#define DR6_INIT 0x0ff0 +/* + * DR6_ACTIVE_LOW is actual the result of DR6_FIXED_1 | ACTIVE_LOW_BITS. + * We can regard all the current FIXED_1 bits as active_low bits even + * though in no case they will be turned into 0. But if they are defined + * in the future, it will require no code change. + * At the same time, it can be served as the init/reset value for DR6. + */ +#define DR6_ACTIVE_LOW 0x0ff0 #define DR6_VOLATILE 0x0001e00f Committed with some changes in the wording of the comment. Also, DR6_FIXED_1 is (DR6_ACTIVE_LOW & ~DR6_VOLATILE). Maybe we can add BUILD_BUG_ON() to make sure the correctness? Paolo
Re: [RESEND PATCH 1/2] KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
On 1/28/2021 3:25 PM, Paolo Bonzini wrote: On 28/01/21 08:17, Xiaoyao Li wrote: "Active low" means that the bit is usually 1 and goes to 0 when the condition (such as RTM or bus lock) happens. For almost all those DR6 bits the value is in fact always 1, but if they are defined in the future it will require no code change. Why not keep use DR6_INIT, or DR6_RESET_VALUE? or any other better name. It's just the default clear value of DR6 that no debug condition is hit. I preferred "DR6_ACTIVE_LOW" because the value is used only once or twice to initialize dr6, and many times to invert those bits. For example: vcpu->arch.dr6 &= ~DR_TRAP_BITS; vcpu->arch.dr6 |= DR6_ACTIVE_LOW; vcpu->arch.dr6 |= payload; vcpu->arch.dr6 ^= payload & DR6_ACTIVE_LOW; payload = vcpu->arch.dr6; payload &= ~DR6_BT; payload ^= DR6_ACTIVE_LOW; The name conveys that it's not just the initialization value; it's also the XOR mask between the #DB exit qualification (which we also use as the "payload") and DR6. Fair enough.
Re: [RESEND PATCH 1/2] KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
On 1/27/2021 6:04 PM, Paolo Bonzini wrote: On 27/01/21 04:41, Xiaoyao Li wrote: On 1/27/2021 12:31 AM, Paolo Bonzini wrote: On 08/01/21 07:49, Chenyi Qiang wrote: To avoid breaking the CPUs without bus lock detection, activate the DR6_BUS_LOCK bit (bit 11) conditionally in DR6_FIXED_1 bits. The set/clear of DR6_BUS_LOCK is similar to the DR6_RTM in DR6 register. The processor clears DR6_BUS_LOCK when bus lock debug exception is generated. (For all other #DB the processor sets this bit to 1.) Software #DB handler should set this bit before returning to the interrupted task. For VM exit caused by debug exception, bit 11 of the exit qualification is set to indicate that a bus lock debug exception condition was detected. The VMM should emulate the exception by clearing bit 11 of the guest DR6. Please rename DR6_INIT to DR6_ACTIVE_LOW, and then a lot of changes become simpler: Paolo, What do you want to convey with the new name DR6_ACTIVE_LOW? To be honest, the new name is confusing to me. "Active low" means that the bit is usually 1 and goes to 0 when the condition (such as RTM or bus lock) happens. For almost all those DR6 bits the value is in fact always 1, but if they are defined in the future it will require no code change. Why not keep use DR6_INIT, or DR6_RESET_VALUE? or any other better name. It's just the default clear value of DR6 that no debug condition is hit. Paolo - dr6 |= DR6_BD | DR6_RTM; + dr6 |= DR6_BD | DR6_RTM | DR6_BUS_LOCK; dr6 |= DR6_BD | DR6_ACTIVE_LOW;
Re: [RESEND PATCH 1/2] KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
On 1/27/2021 12:31 AM, Paolo Bonzini wrote: On 08/01/21 07:49, Chenyi Qiang wrote: To avoid breaking the CPUs without bus lock detection, activate the DR6_BUS_LOCK bit (bit 11) conditionally in DR6_FIXED_1 bits. The set/clear of DR6_BUS_LOCK is similar to the DR6_RTM in DR6 register. The processor clears DR6_BUS_LOCK when bus lock debug exception is generated. (For all other #DB the processor sets this bit to 1.) Software #DB handler should set this bit before returning to the interrupted task. For VM exit caused by debug exception, bit 11 of the exit qualification is set to indicate that a bus lock debug exception condition was detected. The VMM should emulate the exception by clearing bit 11 of the guest DR6. Please rename DR6_INIT to DR6_ACTIVE_LOW, and then a lot of changes become simpler: Paolo, What do you want to convey with the new name DR6_ACTIVE_LOW? To be honest, the new name is confusing to me. - dr6 |= DR6_BD | DR6_RTM; + dr6 |= DR6_BD | DR6_RTM | DR6_BUS_LOCK; dr6 |= DR6_BD | DR6_ACTIVE_LOW;
Re: [PATCH RFC v3 2/4] x86/bus_lock: Handle warn and fatal in #DB for bus lock
On 10/31/2020 8:27 AM, Fenghua Yu wrote: ... diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 3c70fb34028b..1c3442000972 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -953,6 +953,13 @@ static __always_inline void exc_debug_user(struct pt_regs *regs, goto out_irq; } + /* +* Handle bus lock. #DB for bus lock can only be triggered from +* userspace. +*/ + if (!(dr6 & DR_BUS_LOCK)) it should be if (dr6 & DR_BUS_LOCK) since you keep DR6.[bit 11] reserved in this version. bit 11 of debug_read_clear_dr6() being set to 1 means bus lock detected. + handle_bus_lock(regs); + /* Add the virtual_dr6 bits for signals. */ dr6 |= current->thread.virtual_dr6; if (dr6 & (DR_STEP | DR_TRAP_BITS) || icebp)
Re: [PATCH] KVM: VMX: Enable Notify VM exit
On 11/3/2020 2:08 PM, Tao Xu wrote: On 11/3/20 12:43 AM, Andy Lutomirski wrote: On Sun, Nov 1, 2020 at 10:14 PM Tao Xu wrote: ... +static int handle_notify(struct kvm_vcpu *vcpu) +{ + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + /* + * Notify VM exit happened while executing iret from NMI, + * "blocked by NMI" bit has to be set before next VM entry. + */ + if (exit_qualification & NOTIFY_VM_CONTEXT_VALID) { + if (enable_vnmi && + (exit_qualification & INTR_INFO_UNBLOCK_NMI)) + vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, + GUEST_INTR_STATE_NMI); This needs actual documentation in the SDM or at least ISE please. Hi Andy, Do you mean SDM or ISE should call out it needs to restore "blocked by NMI" if bit 12 of exit qualification is set and VMM decides to re-enter the guest? you can refer to SDM 27.2.3 "Information about NMI unblocking Due to IRET" in latest SDM 325462-072US Notify VM-Exit is defined in ISE, chapter 9.2: https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf I will add this information into commit message. Thank you for reminding me.
Re: [PATCH] KVM: VMX: Enable Notify VM exit
On 11/3/2020 2:25 AM, Paolo Bonzini wrote: On 02/11/20 19:01, Andy Lutomirski wrote: What's the point? Surely the kernel should reliably mitigate the flaw, and the kernel should decide how to do so. There is some slowdown in trapping #DB and #AC unconditionally. Though for these two cases nobody should care so I agree with keeping the code simple and keeping the workaround. OK. Also, why would this trigger after more than a few hundred cycles, something like the length of the longest microcode loop? HZ*10 seems like a very generous estimate already. As Sean said in another mail, 1/10 tick should be a placeholder. Glad to see all of you think it should be smaller. We'll come up with more reasonable candidate once we can test on real silicon.
Re: [PATCH] KVM: VMX: Enable Notify VM exit
On 11/3/2020 2:12 PM, Tao Xu wrote: On 11/3/20 6:53 AM, Jim Mattson wrote: On Sun, Nov 1, 2020 at 10:14 PM Tao Xu wrote: There are some cases that malicious virtual machines can cause CPU stuck (event windows don't open up), e.g., infinite loop in microcode when nested #AC (CVE-2015-5307). No event window obviously means no events, e.g. NMIs, SMIs, and IRQs will all be blocked, may cause the related hardware CPU can't be used by host or other VM. To resolve those cases, it can enable a notify VM exit if no event window occur in VMX non-root mode for a specified amount of time (notify window). Expose a module param for setting notify window, default setting it to the time as 1/10 of periodic tick, and user can set it to 0 to disable this feature. TODO: 1. The appropriate value of notify window. 2. Another patch to disable interception of #DB and #AC when notify VM-Exiting is enabled. Co-developed-by: Xiaoyao Li Signed-off-by: Tao Xu Signed-off-by: Xiaoyao Li Do you have test cases? yes we have. The nested #AC (CVE-2015-5307) is a known test case, though we need to tweak KVM to disable interception #AC for it. Not yet, because we are waiting real silicon to do some test. I should add RFC next time before I test it in hardware.
Re: [PATCH] KVM: X86: Expose KVM_HINTS_REALTIME in KVM_GET_SUPPORTED_CPUID
On 10/22/2020 10:06 PM, Paolo Bonzini wrote: On 22/10/20 15:31, Xiaoyao Li wrote: It's common for userspace to copy all supported CPUID bits to KVM_SET_CPUID2, I don't think this is the right behavior for KVM_HINTS_REALTIME. It reminds of X86_FEATURE_WAITPKG, which is added to supported CPUID recently as a fix but QEMU exposes it to guest only when "-overcommit cpu-pm" WAITPKG is not included in KVM_GET_SUPPORTED_CPUID either. QEMU detects it through the MSR_IA32_UMWAIT register. Doesn't 0abcc8f65cc2 ("KVM: VMX: enable X86_FEATURE_WAITPKG in KVM capabilities") add WAITPKG to KVM_GET_SUPPORTED_CPUID? Paolo
Re: [PATCH] KVM: X86: Expose KVM_HINTS_REALTIME in KVM_GET_SUPPORTED_CPUID
On 10/22/2020 9:02 PM, Paolo Bonzini wrote: On 22/10/20 03:34, Wanpeng Li wrote: From: Wanpeng Li Per KVM_GET_SUPPORTED_CPUID ioctl documentation: This ioctl returns x86 cpuid features which are supported by both the hardware and kvm in its default configuration. A well-behaved userspace should not set the bit if it is not supported. Suggested-by: Jim Mattson Signed-off-by: Wanpeng Li It's common for userspace to copy all supported CPUID bits to KVM_SET_CPUID2, I don't think this is the right behavior for KVM_HINTS_REALTIME. It reminds of X86_FEATURE_WAITPKG, which is added to supported CPUID recently as a fix but QEMU exposes it to guest only when "-overcommit cpu-pm" (But maybe this was discussed already; if so, please point me to the previous discussion). Paolo
Re: [RFC v2 2/2] KVM: VMX: Enable bus lock VM exit
On 9/3/2020 6:44 AM, Sean Christopherson wrote: On Tue, Sep 01, 2020 at 10:43:12AM +0200, Vitaly Kuznetsov wrote: @@ -6809,6 +6824,19 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(vmx->exit_reason.failed_vmentry)) return EXIT_FASTPATH_NONE; + /* +* check the exit_reason to see if there is a bus lock +* happened in guest. +*/ + if (kvm_bus_lock_exit_enabled(vmx->vcpu.kvm)) { + if (vmx->exit_reason.bus_lock_detected) { + vcpu->stat.bus_locks++; Why bother with stats? Every bus lock exits to userspace, having quick stats doesn't seem all that interesting. OK. We will remove it. + vcpu->arch.bus_lock_detected = true; + } else { + vcpu->arch.bus_lock_detected = false; This is a fast path so I'm wondering if we can move bus_lock_detected clearing somewhere else. Why even snapshot vmx->exit_reason.bus_lock_detected? I don't see any reason why vcpu_enter_guest() needs to handle the exit to userspace, e.g. it's just as easily handled in VMX code. Because we want to handle the exit to userspace only in one place, i.e., after kvm_x86_ops.handle_exit(vcpu, exit_fastpath). Otherwise, we would have to check vmx->exit_reason.bus_lock_detected in every other handler, at least in those can preempt the bus lock VM-exit theoretically. + } + } + vmx->loaded_vmcs->launched = 1; vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); @@ -8060,6 +8088,9 @@ static __init int hardware_setup(void) kvm_tsc_scaling_ratio_frac_bits = 48; } + if (cpu_has_vmx_bus_lock_detection()) + kvm_has_bus_lock_exit = true; + set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */ if (enable_ept) ... @@ -4990,6 +4996,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, kvm->arch.exception_payload_enabled = cap->args[0]; r = 0; break; + case KVM_CAP_X86_BUS_LOCK_EXIT: + if (!kvm_has_bus_lock_exit) + return -EINVAL; ... because userspace can check for -EINVAL when enabling the cap. Or we can return e.g. -EOPNOTSUPP here. I don't have a strong opinion on the matter.. + kvm->arch.bus_lock_exit = cap->args[0]; Assuming we even want to make this per-VM, I think it'd make sense to make args[0] a bit mask, e.g. to provide "off" and "exit" (this behavior) while allowing for future modes, e.g. log-only. Good idea, will do it in next version. + r = 0; + break; default: r = -EINVAL; break; @@ -7732,12 +7744,23 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu)
Re: [PATCH 1/5] KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabled
On 8/28/2020 4:56 PM, Chenyi Qiang wrote: KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they doesn't expose during the setup of nested VMX controls MSR. Signed-off-by: Chenyi Qiang Reviewed-by: Xiaoyao Li --- arch/x86/kvm/vmx/nested.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 23b58c28a1c9..6e0e71f4d45f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6310,7 +6310,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL; msrs->exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER | @@ -6329,7 +6330,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) #ifdef CONFIG_X86_64 VM_ENTRY_IA32E_MODE | #endif - VM_ENTRY_LOAD_IA32_PAT; + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS | + VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL; msrs->entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER);
Re: [PATCH 1/5] KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabled
On 8/29/2020 9:49 AM, Chenyi Qiang wrote: On 8/29/2020 1:43 AM, Jim Mattson wrote: On Fri, Aug 28, 2020 at 1:54 AM Chenyi Qiang wrote: KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they doesn't expose during the setup of nested VMX controls MSR. Aren't these features added conditionally in nested_vmx_entry_exit_ctls_update() and nested_vmx_pmu_entry_exit_ctls_update()? Yes, but I assume vmcs_config.nested should reflect the global capability of VMX MSR. KVM supports these two controls, so should be exposed here. No. I prefer to say they are removed conditionally in nested_vmx_entry_exit_ctls_update() and nested_vmx_pmu_entry_exit_ctls_update(). Userspace calls vmx_get_msr_feature() to query what KVM supports for these VMX MSR. In vmx_get_msr_feature(), it returns the value of vmcs_config.nested. As KVM supports these two bits, we should advertise them in vmcs_config.nested and report to userspace.
[PATCH v10 2/9] x86/split_lock: Remove bogus case in handle_guest_split_lock()
The bogus case can never happen, i.e., when sld_state == sld_off, guest won't trigger split lock #AC and of course no handle_guest_split_lock() will be called. Besides, drop bogus case also makes future patch easier to remove sld_state if we reach the alignment that it must be sld_warn or sld_fatal when handle_guest_split_lock() is called. Signed-off-by: Xiaoyao Li --- The alternative would be to remove the "SLD enabled" check from KVM so that a truly unexpected/bogus #AC would generate a warn. It's not clear whether or not calling handle_guest_split_lock() iff SLD is enabled was intended in the long term. --- arch/x86/kernel/cpu/intel.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index c48b3267c141..5dab842ba7e1 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1104,9 +1104,8 @@ bool handle_guest_split_lock(unsigned long ip) return true; } - pr_warn_once("#AC: %s/%d %s split_lock trap at address: 0x%lx\n", -current->comm, current->pid, -sld_state == sld_fatal ? "fatal" : "bogus", ip); + pr_warn_once("#AC: %s/%d fatal split_lock trap at address: 0x%lx\n", +current->comm, current->pid, ip); current->thread.error_code = 0; current->thread.trap_nr = X86_TRAP_AC; -- 2.18.4
[PATCH v10 9/9] x86/split_lock: Enable split lock detection initialization when running as a guest on KVM
When running as guest, enumerating feature split lock detection through CPU model is not easy since CPU model is configurable by host VMM. If running upon KVM, it can be enumerated through KVM_FEATURE_SPLIT_LOCK_DETECT, and if KVM_HINTS_SLD_FATAL is set, it needs to be set to sld_fatal mode. Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/cpu.h | 2 ++ arch/x86/kernel/cpu/intel.c | 12 ++-- arch/x86/kernel/kvm.c | 3 +++ 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index 2971a29d5094..5520cc1cbb68 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -42,12 +42,14 @@ unsigned int x86_model(unsigned int sig); unsigned int x86_stepping(unsigned int sig); #ifdef CONFIG_CPU_SUP_INTEL extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); +extern void __init split_lock_setup(bool fatal); extern void switch_to_sld(unsigned long tifn); extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); extern bool handle_guest_split_lock(unsigned long ip); extern bool split_lock_virt_switch(bool on); #else static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} +static inline void __init split_lock_setup(bool fatal) {} static inline void switch_to_sld(unsigned long tifn) {} static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code) { diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index b16f3dd9b9c2..8cadd2fd8be6 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1011,12 +1011,18 @@ static bool split_lock_verify_msr(bool on) return ctrl == tmp; } -static void __init split_lock_setup(void) +void __init split_lock_setup(bool fatal) { enum split_lock_detect_state state = sld_warn; char arg[20]; int i, ret; + if (fatal) { + state = sld_fatal; + pr_info("forced on, sending SIGBUS on user-space split_locks\n"); + goto set_cap; + } + if (!split_lock_verify_msr(false)) { pr_info("MSR access failed: Disabled\n"); return; @@ -1052,6 +1058,7 @@ static void __init split_lock_setup(void) return; } +set_cap: cpu_model_supports_sld = true; setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); if (state == sld_fatal) @@ -1183,6 +1190,7 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) const struct x86_cpu_id *m; u64 ia32_core_caps; + /* Note, paravirt support can enable SLD, e.g., see kvm_guest_init(). */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) return; @@ -1204,5 +1212,5 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) return; } - split_lock_setup(); + split_lock_setup(false); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 08320b0b2b27..25cd5d7f1e51 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -687,6 +687,9 @@ static void __init kvm_guest_init(void) * overcommitted. */ hardlockup_detector_disable(); + + if (kvm_para_has_feature(KVM_FEATURE_SPLIT_LOCK_DETECT)) + split_lock_setup(kvm_para_has_hint(KVM_HINTS_SLD_FATAL)); } static noinline uint32_t __kvm_cpuid_base(void) -- 2.18.4
[PATCH v10 7/9] KVM: VMX: virtualize split lock detection
TEST_CTRL MSR is per-core scope, i.e., the sibling threads in the same physical core share the same MSR. This requires additional constraint when exposing it to guest. 1) When host SLD state is sld_off (no X86_FEATURE_SPLIT_LOCK_DETECT), feature split lock detection is unsupported/disabled. Cannot expose it to guest. 2) When host SLD state is sld_warn (has X86_FEATURE_SPLIT_LOCK_DETECT but no X86_FEATURE_SLD_FATAL), feature split lock detection can be exposed to guest only when nosmt due to the per-core scope. In this case, guest's setting can be propagated into real hardware MSR. Further, to avoid the potiential MSR_TEST_CTRL.SLD toggling overhead during every vm-enter and vm-exit, it loads and keeps guest's SLD setting when in vcpu run loop and guest_state_loaded, i.e., between vmx_prepare_switch_to_guest() and vmx_prepare_switch_to_host(). 3) when host SLD state is sld_fatal (has X86_FEATURE_SLD_FATAL), feature split lock detection can be exposed to guest regardless of SMT but KVM_HINTS_SLD_FATAL needs to be set. In this case, guest can still set and clear MSR_TEST_CTRL.SLD bit, but the bit value never be propagated to real MSR. KVM always keeps SLD bit turned on for guest vcpu. The reason why not force guest MSR_CTRL.SLD bit to 1 is that guest needs to set this bit to 1 itself to tell KVM it's SLD-aware. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 6 arch/x86/kvm/vmx/vmx.c | 68 -- arch/x86/kvm/vmx/vmx.h | 3 ++ arch/x86/kvm/x86.c | 6 +++- arch/x86/kvm/x86.h | 16 ++ 5 files changed, 89 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 3fd6eec202d7..58d902fb12c0 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -751,9 +751,15 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) if (sched_info_on()) entry->eax |= (1 << KVM_FEATURE_STEAL_TIME); + if (kvm_split_lock_detect_supported()) + entry->eax |= (1 << KVM_FEATURE_SPLIT_LOCK_DETECT); + entry->ebx = 0; entry->ecx = 0; entry->edx = 0; + + if (boot_cpu_has(X86_FEATURE_SLD_FATAL)) + entry->edx |= (1 << KVM_HINTS_SLD_FATAL); break; case 0x8000: entry->eax = min(entry->eax, 0x801f); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e113c9171bcc..e806c2855c9e 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1123,6 +1123,29 @@ void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, } } +static inline u64 vmx_msr_test_ctrl_valid_bits(struct vcpu_vmx *vmx) +{ + u64 valid_bits = 0; + + if (vmx->guest_has_sld) + valid_bits |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT; + + return valid_bits; +} + +static inline bool guest_sld_on(struct vcpu_vmx *vmx) +{ + return vmx->msr_test_ctrl & MSR_TEST_CTRL_SPLIT_LOCK_DETECT; +} + +static inline void vmx_update_guest_sld(struct vcpu_vmx *vmx) +{ + preempt_disable(); + if (vmx->guest_state_loaded) + split_lock_set_guest(guest_sld_on(vmx)); + preempt_enable(); +} + void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -1191,6 +1214,10 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) #endif vmx_set_host_fs_gs(host_state, fs_sel, gs_sel, fs_base, gs_base); + + if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && vmx->guest_has_sld) + vmx->host_sld_on = split_lock_set_guest(guest_sld_on(vmx)); + vmx->guest_state_loaded = true; } @@ -1229,6 +1256,10 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx) wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base); #endif load_fixmap_gdt(raw_smp_processor_id()); + + if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && vmx->guest_has_sld) + split_lock_restore_host(vmx->host_sld_on); + vmx->guest_state_loaded = false; vmx->guest_msrs_ready = false; } @@ -1833,7 +1864,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) switch (msr_info->index) { case MSR_TEST_CTRL: - msr_info->data = 0; + msr_info->data = vmx->msr_test_ctrl; break; #ifdef CONFIG_X86_64 case MSR_FS_BASE: @@ -1999,9 +2030,12 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) switch (msr_index) { case MSR_TEST_CTRL: - if (data) + if (data & ~vmx_msr_test_ctrl_valid_bits(vmx)) return 1;
[PATCH v10 6/9] KVM: VMX: Enable MSR TEST_CTRL for guest
Unconditionally allow the guest to read and zero-write MSR TEST_CTRL. This matches the fact that most Intel CPUs support MSR TEST_CTRL, and it also alleviates the effort to handle wrmsr/rdmsr when split lock detection is exposed to the guest in a future patch. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/vmx/vmx.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 46ba2e03a892..e113c9171bcc 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1832,6 +1832,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) u32 index; switch (msr_info->index) { + case MSR_TEST_CTRL: + msr_info->data = 0; + break; #ifdef CONFIG_X86_64 case MSR_FS_BASE: msr_info->data = vmcs_readl(GUEST_FS_BASE); @@ -1995,6 +1998,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) u32 index; switch (msr_index) { + case MSR_TEST_CTRL: + if (data) + return 1; + + break; case MSR_EFER: ret = kvm_set_msr_common(vcpu, msr_info); break; -- 2.18.4
[PATCH v10 8/9] x86/split_lock: Set cpu_model_supports_sld to true after the existence of split lock detection is verified BSP
It should be more resonable to set cpu_model_supports_sld to true after BSP is verified to have feature split lock detection. It also avoids externing the cpu_model_supports_sld when enumerate the split lock detection in a guest via PV interface in a future patch. Signed-off-by: Xiaoyao Li --- arch/x86/kernel/cpu/intel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 5f44e0de04b9..b16f3dd9b9c2 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1052,6 +1052,7 @@ static void __init split_lock_setup(void) return; } + cpu_model_supports_sld = true; setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); if (state == sld_fatal) setup_force_cpu_cap(X86_FEATURE_SLD_FATAL); @@ -1203,6 +1204,5 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) return; } - cpu_model_supports_sld = true; split_lock_setup(); } -- 2.18.4
[PATCH v10 5/9] x86/kvm: Introduce paravirt split lock detection enumeration
Introduce KVM_FEATURE_SPLIT_LOCK_DETECT, with which guests running on KVM can enumerate the avaliablility of feature split lock detection. Introduce KVM_HINTS_SLD_FATAL, which tells whether host is sld_fatal mode, i.e., whether split lock detection is forced on for guest vcpu. Signed-off-by: Xiaoyao Li --- Documentation/virt/kvm/cpuid.rst | 29 arch/x86/include/uapi/asm/kvm_para.h | 8 +--- 2 files changed, 26 insertions(+), 11 deletions(-) diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst index a7dff9186bed..c214f1c70703 100644 --- a/Documentation/virt/kvm/cpuid.rst +++ b/Documentation/virt/kvm/cpuid.rst @@ -92,6 +92,12 @@ KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit async pf acknowledgment msr 0x4b564d07. +KVM_FEATURE_SPLIT_LOCK_DETECT 15 guest checks this feature bit for + available of split lock detection. + + KVM doesn't support enumerating + split lock detection via CPU model + KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side per-cpu warps are expeced in kvmclock @@ -103,11 +109,18 @@ KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side Where ``flag`` here is defined as below: -== = -flag valuemeaning -== = -KVM_HINTS_REALTIME 0guest checks this feature bit to -determine that vCPUs are never -preempted for an unlimited time -allowing optimizations -== = + = +flag valuemeaning + = +KVM_HINTS_REALTIME 0guest checks this feature bit to + determine that vCPUs are never + preempted for an unlimited time + allowing optimizations + +KVM_HINTS_SLD_FATAL 1set if split lock detection is + forced on in the host, in which + case KVM will kill the guest if it + generates a split lock #AC with + SLD disabled from guest's + perspective + = diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 812e9b4c1114..328ddfaacd7b 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -32,14 +32,16 @@ #define KVM_FEATURE_POLL_CONTROL 12 #define KVM_FEATURE_PV_SCHED_YIELD 13 #define KVM_FEATURE_ASYNC_PF_INT 14 - -#define KVM_HINTS_REALTIME 0 - +#define KVM_FEATURE_SPLIT_LOCK_DETECT 15 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. */ #define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 +/* KVM feature hints in CPUID.0x4001.EDX */ +#define KVM_HINTS_REALTIME 0 +#define KVM_HINTS_SLD_FATAL1 + #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 -- 2.18.4
[PATCH v10 3/9] x86/split_lock: Introduce flag X86_FEATURE_SLD_FATAL and drop sld_state
Introduce a synthetic feature flag X86_FEATURE_SLD_FATAL, which means kernel is in sld_fatal mode if set. Now sld_state is not needed any more that the state of SLD can be inferred from X86_FEATURE_SPLIT_LOCK_DETECT and X86_FEATURE_SLD_FATAL. Suggested-by: Sean Christopherson Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/kernel/cpu/intel.c| 16 ++-- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 2901d5df4366..caf9f4e3e876 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -288,6 +288,7 @@ #define X86_FEATURE_FENCE_SWAPGS_USER (11*32+ 4) /* "" LFENCE in user entry SWAPGS path */ #define X86_FEATURE_FENCE_SWAPGS_KERNEL(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */ #define X86_FEATURE_SPLIT_LOCK_DETECT (11*32+ 6) /* #AC for split lock */ +#define X86_FEATURE_SLD_FATAL (11*32+ 7) /* split lock detection in fatal mode */ /* Intel-defined CPU features, CPUID level 0x0007:1 (EAX), word 12 */ #define X86_FEATURE_AVX512_BF16(12*32+ 5) /* AVX512 BFLOAT16 instructions */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 5dab842ba7e1..06de03974e66 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -42,12 +42,6 @@ enum split_lock_detect_state { sld_fatal, }; -/* - * Default to sld_off because most systems do not support split lock detection - * split_lock_setup() will switch this to sld_warn on systems that support - * split lock detect, unless there is a command line override. - */ -static enum split_lock_detect_state sld_state __ro_after_init = sld_off; static u64 msr_test_ctrl_cache __ro_after_init; /* @@ -1058,8 +1052,9 @@ static void __init split_lock_setup(void) return; } - sld_state = state; setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); + if (state == sld_fatal) + setup_force_cpu_cap(X86_FEATURE_SLD_FATAL); } /* @@ -1080,7 +1075,7 @@ static void sld_update_msr(bool on) static void split_lock_init(void) { if (cpu_model_supports_sld) - split_lock_verify_msr(sld_state != sld_off); + split_lock_verify_msr(boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)); } static void split_lock_warn(unsigned long ip) @@ -1099,7 +1094,7 @@ static void split_lock_warn(unsigned long ip) bool handle_guest_split_lock(unsigned long ip) { - if (sld_state == sld_warn) { + if (!boot_cpu_has(X86_FEATURE_SLD_FATAL)) { split_lock_warn(ip); return true; } @@ -1116,7 +,8 @@ EXPORT_SYMBOL_GPL(handle_guest_split_lock); bool handle_user_split_lock(struct pt_regs *regs, long error_code) { - if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) + if ((regs->flags & X86_EFLAGS_AC) || + boot_cpu_has(X86_FEATURE_SLD_FATAL)) return false; split_lock_warn(regs->ip); return true; -- 2.18.4
[PATCH v10 4/9] x86/split_lock: Introduce split_lock_virt_switch() and two wrappers
Introduce split_lock_virt_switch(), which is used for toggling split lock detection setting as well as updating TIF_SLD_DISABLED flag to make them consistent. Note, it can only be used in sld warn mode, i.e., X86_FEATURE_SPLIT_LOCK_DETECT && !X86_FEATURE_SLD_FATAL. The FATAL check is handled by wrappers, split_lock_set_guest() and split_lock_restore_host(), that will be used by KVM when virtualizing split lock detection for guest in the future. Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/cpu.h | 34 ++ arch/x86/kernel/cpu/intel.c | 20 2 files changed, 54 insertions(+) diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index da78ccbd493b..2971a29d5094 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -45,6 +45,7 @@ extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); extern void switch_to_sld(unsigned long tifn); extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); extern bool handle_guest_split_lock(unsigned long ip); +extern bool split_lock_virt_switch(bool on); #else static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} static inline void switch_to_sld(unsigned long tifn) {} @@ -57,7 +58,40 @@ static inline bool handle_guest_split_lock(unsigned long ip) { return false; } + +static inline bool split_lock_virt_switch(bool on) { return false; } #endif + +/** + * split_lock_set_guest - Set SLD state for a guest + * @guest_sld_on: If SLD is on in the guest + * + * returns:%true if SLD was enabled in the task + * + * Must be called when X86_FEATURE_SPLIT_LOCK_DETECT is available. + */ +static inline bool split_lock_set_guest(bool guest_sld_on) +{ + if (static_cpu_has(X86_FEATURE_SLD_FATAL)) + return true; + + return split_lock_virt_switch(guest_sld_on); +} + +/** + * split_lock_restore_host - Restore host SLD state + * @host_sld_on: If SLD is on in the host + * + * Must be called when X86_FEATURE_SPLIT_LOCK_DETECT is available. + */ +static inline void split_lock_restore_host(bool host_sld_on) +{ + if (static_cpu_has(X86_FEATURE_SLD_FATAL)) + return; + + split_lock_virt_switch(host_sld_on); +} + #ifdef CONFIG_IA32_FEAT_CTL void init_ia32_feat_ctl(struct cpuinfo_x86 *c); #else diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 06de03974e66..5f44e0de04b9 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1078,6 +1078,26 @@ static void split_lock_init(void) split_lock_verify_msr(boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)); } +/* + * It should never be called directly but should use split_lock_set_guest() + * and split_lock_restore_host() instead. + * + * The caller needs to be in preemption disabled context to ensure + * MSR state and TIF_SLD_DISABLED state consistent. + */ +bool split_lock_virt_switch(bool on) +{ + bool was_on = !test_thread_flag(TIF_SLD_DISABLED); + + if (on != was_on) { + sld_update_msr(on); + update_thread_flag(TIF_SLD_DISABLED, !on); + } + + return was_on; +} +EXPORT_SYMBOL_GPL(split_lock_virt_switch); + static void split_lock_warn(unsigned long ip) { pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n", -- 2.18.4
[PATCH v10 0/9] KVM: Add virtualization support of split lock detection
Hi maintainers, Please help review this new version and give comments. There is only one minor change from previous v9, i.e., adding patch 8. --- This series aims to add the virtualization of split lock detection in KVM. Due to the fact that split lock detection is tightly coupled with CPU model and CPU model is configurable by host VMM, we elect to use paravirt method to expose and enumerate it for guest. Changes in v10 - rebase to v5.9-rc1; - Add one patch to move the initialization of cpu_model_supports_sld to split_lock_setup(); Changes in v9 https://lkml.kernel.org/r/20200509110542.8159-1-xiaoyao...@intel.com - rebase to v5.7-rc4 - Add one patch to rename TIF_SLD to TIF_SLD_DISABLED; - Add one patch to remove bogus case in handle_guest_split_lock; - Introduce flag X86_FEATURE_SPLIT_LOCK_DETECT_FATAL and thus drop sld_state; - Use X86_FEATURE_SPLIT_LOCK_DETECT and X86_FEATURE_SPLIT_LOCK_DETECT_FATAL to determine the SLD state of host; - Introduce split_lock_virt_switch() and two wrappers for KVM instead of sld_update_to(); - Use paravirt to expose and enumerate split lock detection for guest; - Split lock detection can be exposed to guest when host is sld_fatal, even though host is SMT available. Changes in v8: https://lkml.kernel.org/r/20200414063129.133630-1-xiaoyao...@intel.com - rebase to v5.7-rc1. - basic enabling of split lock detection already merged. - When host is sld_warn and nosmt, load guest's sld bit when in KVM context, i.e., between vmx_prepare_switch_to_guest() and before vmx_prepare_switch_to_host(), KVM uses guest sld setting. Changes in v7: https://lkml.kernel.org/r/20200325030924.132881-1-xiaoyao...@intel.com - only pick patch 1 and patch 2, and hold all the left. - Update SLD bit on each processor based on sld_state. Changes in v6: https://lkml.kernel.org/r/20200324151859.31068-1-xiaoyao...@intel.com - Drop the sld_not_exist flag and use X86_FEATURE_SPLIT_LOCK_DETECT to check whether need to init split lock detection. [tglx] - Use tglx's method to verify the existence of split lock detectoin. - small optimization of sld_update_msr() that the default value of msr_test_ctrl_cache has split_lock_detect bit cleared. - Drop the patch3 in v5 that introducing kvm_only option. [tglx] - Rebase patch4-8 to kvm/queue. - use the new kvm-cpu-cap to expose X86_FEATURE_CORE_CAPABILITIES in Patch 6. Changes in v5: https://lkml.kernel.org/r/20200315050517.127446-1-xiaoyao...@intel.com - Use X86_FEATURE_SPLIT_LOCK_DETECT flag in kvm to ensure split lock detection is really supported. - Add and export sld related helper functions in their related usecase kvm patches. Xiaoyao Li (9): x86/split_lock: Rename TIF_SLD to TIF_SLD_DISABLED x86/split_lock: Remove bogus case in handle_guest_split_lock() x86/split_lock: Introduce flag X86_FEATURE_SLD_FATAL and drop sld_state x86/split_lock: Introduce split_lock_virt_switch() and two wrappers x86/kvm: Introduce paravirt split lock detection enumeration KVM: VMX: Enable MSR TEST_CTRL for guest KVM: VMX: virtualize split lock detection x86/split_lock: Set cpu_model_supports_sld to true after the existence of split lock detection is verified BSP x86/split_lock: Enable split lock detection initialization when running as a guest on KVM Documentation/virt/kvm/cpuid.rst | 29 +++ arch/x86/include/asm/cpu.h | 36 ++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/thread_info.h | 6 +-- arch/x86/include/uapi/asm/kvm_para.h | 8 ++-- arch/x86/kernel/cpu/intel.c | 61 +++ arch/x86/kernel/kvm.c| 3 ++ arch/x86/kernel/process.c| 2 +- arch/x86/kvm/cpuid.c | 6 +++ arch/x86/kvm/vmx/vmx.c | 72 +--- arch/x86/kvm/vmx/vmx.h | 3 ++ arch/x86/kvm/x86.c | 6 ++- arch/x86/kvm/x86.h | 16 +++ 13 files changed, 207 insertions(+), 42 deletions(-) -- 2.18.4
[PATCH v10 1/9] x86/split_lock: Rename TIF_SLD to TIF_SLD_DISABLED
TIF_SLD can only be set if a user space thread hits split lock and sld_state == sld_warn. This flag is set to indicate SLD (split lock detection) is turned off for the thread, so rename it to TIF_SLD_DISABLED, which is pretty self explaining. Suggested-by: Sean Christopherson Suggested-by: Thomas Gleixner Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/thread_info.h | 6 +++--- arch/x86/kernel/cpu/intel.c| 6 +++--- arch/x86/kernel/process.c | 2 +- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 267701ae3d86..fdb458a74557 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -92,7 +92,7 @@ struct thread_info { #define TIF_NOCPUID15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ -#define TIF_SLD18 /* Restore split lock detection on context switch */ +#define TIF_SLD_DISABLED 18 /* split lock detection is turned off */ #define TIF_MEMDIE 20 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */ #define TIF_IO_BITMAP 22 /* uses I/O bitmap */ @@ -122,7 +122,7 @@ struct thread_info { #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) -#define _TIF_SLD (1 << TIF_SLD) +#define _TIF_SLD_DISABLED (1 << TIF_SLD_DISABLED) #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG) #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP) #define _TIF_FORCED_TF (1 << TIF_FORCED_TF) @@ -136,7 +136,7 @@ struct thread_info { /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ (_TIF_NOCPUID | _TIF_NOTSC | _TIF_BLOCKSTEP | \ -_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE | _TIF_SLD) +_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE | _TIF_SLD_DISABLED) /* * Avoid calls to __switch_to_xtra() on UP as STIBP is not evaluated. diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 59a1e3ce3f14..c48b3267c141 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1090,11 +1090,11 @@ static void split_lock_warn(unsigned long ip) /* * Disable the split lock detection for this task so it can make -* progress and set TIF_SLD so the detection is re-enabled via +* progress and set TIF_SLD_DISABLED so the detection is re-enabled via * switch_to_sld() when the task is scheduled out. */ sld_update_msr(false); - set_tsk_thread_flag(current, TIF_SLD); + set_tsk_thread_flag(current, TIF_SLD_DISABLED); } bool handle_guest_split_lock(unsigned long ip) @@ -1132,7 +1132,7 @@ bool handle_user_split_lock(struct pt_regs *regs, long error_code) */ void switch_to_sld(unsigned long tifn) { - sld_update_msr(!(tifn & _TIF_SLD)); + sld_update_msr(!(tifn & _TIF_SLD_DISABLED)); } /* diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 994d8393f2f7..e9265c6b9256 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -641,7 +641,7 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) __speculation_ctrl_update(~tifn, tifn); } - if ((tifp ^ tifn) & _TIF_SLD) + if ((tifp ^ tifn) & _TIF_SLD_DISABLED) switch_to_sld(tifn); } -- 2.18.4
Re: [RFC 2/2] KVM: VMX: Enable bus lock VM exit
On 7/23/2020 9:21 AM, Sean Christopherson wrote: On Wed, Jul 01, 2020 at 04:49:49PM +0200, Vitaly Kuznetsov wrote: Xiaoyao Li writes: So you want an exit to userspace for every bus lock and leave it all to userspace. Yes, it's doable. In some cases we may not even want to have a VM exit: think e.g. real-time/partitioning case when even in case of bus lock we may not want to add additional latency just to count such events. Hmm, I suspect this isn't all that useful for real-time cases because they'd probably want to prevent the split lock in the first place, e.g. would prefer to use the #AC variant in fatal mode. Of course, the availability of split lock #AC is a whole other can of worms. But anyways, I 100% agree that this needs either an off-by-default module param or an opt-in per-VM capability. Maybe on-by-default or an opt-out per-VM capability? Turning it on introduces no overhead if no bus lock happens in guest but gives KVM the capability to track every potential bus lock. If user doesn't want the extra latency due to bus lock VM exit, it's better try to fix the bus lock, which also incurs high latency. I'd suggest we make the new capability tri-state: - disabled (no vmexit, default) - stats only (what this patch does) - userspace exit But maybe this is an overkill, I'd like to hear what others think. Userspace exit would also be interesting for debug. Another throttling option would be schedule() or cond_reched(), though that's probably getting into overkill territory. We're going to leverage host's policy, i.e., calling handle_user_bus_lock(), for throttling, as proposed in https://lkml.kernel.org/r/1595021700-68460-1-git-send-email-fenghua...@intel.com
Re: kvm crash on 5.7-rc1 and later
On 7/12/2020 2:21 AM, Peter Zijlstra wrote: On Fri, Jul 03, 2020 at 11:15:31AM -0400, Woody Suwalski wrote: I am observing a 100% reproducible kvm crash on kernels starting with 5.7-rc1, always with the same opcode . It happens during wake up from the host suspended state. Worked OK on 5.6 and older. The host is based on Debian testing, Thinkpad T440, i5 cpu. [ 61.576664] kernel BUG at arch/x86/kvm/x86.c:387! [ 61.576672] invalid opcode: [#1] PREEMPT SMP NOPTI [ 61.576678] CPU: 0 PID: 3851 Comm: qemu-system-x86 Not tainted 5.7-pingu #0 [ 61.576680] Hardware name: LENOVO 20B6005JUS/20B6005JUS, BIOS GJETA4WW (2.54 ) 03/27/2020 [ 61.576700] RIP: 0010:kvm_spurious_fault+0xa/0x10 [kvm] Crash results in a dead kvm and occasionally a very unstable system. Bisecting the problem between v5.6 and v5.7-rc1 points to commit 6650cdd9a8ccf00555dbbe743d58541ad8feb6a7 Author: Peter Zijlstra (Intel) Date: Sun Jan 26 12:05:35 2020 -0800 x86/split_lock: Enable split lock detection by kernel Reversing that patch seems to actually "cure" the issue. The problem is present in all kernels past 5.7-rc1, however the patch is not reversing directly in later source trees, so can not retest the logic on recent kernels. Peter, would you have idea how to debug that (or even better - would you happen to know the fix)? I have attached dmesg logs from a "good" 5.6.9 kernel, and then "bad" 5.7.0 and 5.8-rc3 I have no clue about kvm. Nor do I actually have hardware with SLD on. I've Cc'ed a bunch of folks who might have more ideas. I think this bug is the same as the one found by Sean, and is already fixed in 5.8-rc4. https://lore.kernel.org/kvm/20200605192605.7439-1-sean.j.christopher...@intel.com/
[PATCH v4 0/5] Refactor handling flow of KVM_SET_CPUID*
4 Patches of v3 has been queued into kvm/queue branch. This v4 contains the rest to refactor the flow of KVM_SET_CPUID* as: 1. cpuid check: check if userspace provides legal CPUID settings; 2. cpuid update: Update userspace provided CPUID settings. It currently only contains kvm_update_cpuid_runtime, which updates special CPUID bits based on the vcpu state, e.g., OSXSAVE, OSPKE. In the future, we can re-introduce kvm_update_cpuid() if KVM needs to force on/off some bits. 3. update vcpu states: Update vcpu states/settings based on the final updated CPUID settings. v4: - remove 4 queued patches - rebased to kvm/queue: c16ced9cc67a "x86/kvm/vmx: Use native read/write_cr2()" - fix one bug in v3 to call kvfree(cpuid_entries) in kvm_vcpu_ioctl_set_cpuid() - rename "update_vcpu_model" to "vcpu_after_set_cpuid" [Paolo] - Add a new patch to extrace kvm_update_cpuid_runtime() v3: https://lkml.kernel.org/r/20200708065054.19713-1-xiaoyao...@intel.com - Add a note in KVM api doc to state the previous CPUID configuration is not reliable if current KVM_SET_CPUID* fails [Jim] - Adjust Patch 2 to reduce code churn [Sean] - Commit message refine to add more justification [Sean] - Add a new patch 7 v2: https://lkml.kernel.org/r/20200623115816.24132-1-xiaoyao...@intel.com - rebase to kvm/queue: a037ff353ba6 ("Merge branch 'kvm-master' into HEAD") - change the name of kvm_update_state_based_on_cpuid() to kvm_update_vcpu_model() [Sean] - Add patch 5 to rename kvm_x86_ops.cpuid_date() to kvm_x86_ops.update_vcpu_model() v1: https://lkml.kernel.org/r/20200529085545.29242-1-xiaoyao...@intel.com Xiaoyao Li (5): KVM: x86: Introduce kvm_check_cpuid() KVM: x86: Extract kvm_update_cpuid_runtime() from kvm_update_cpuid() KVM: x86: Rename kvm_update_cpuid() to kvm_vcpu_after_set_cpuid() KVM: x86: Rename cpuid_update() callback to vcpu_after_set_cpuid() KVM: x86: Move kvm_x86_ops.vcpu_after_set_cpuid() into kvm_vcpu_after_set_cpuid() arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 99 + arch/x86/kvm/cpuid.h| 2 +- arch/x86/kvm/lapic.c| 2 +- arch/x86/kvm/svm/svm.c | 4 +- arch/x86/kvm/vmx/nested.c | 3 +- arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/x86.c | 10 ++-- 8 files changed, 76 insertions(+), 50 deletions(-) -- 2.18.4
[PATCH v4 1/5] KVM: x86: Introduce kvm_check_cpuid()
Use kvm_check_cpuid() to validate if userspace provides legal cpuid settings and call it before KVM take any action to update CPUID or update vcpu states based on given CPUID settings. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 55 arch/x86/kvm/cpuid.h | 2 +- 2 files changed, 36 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index eebd66f86abe..1a053022a961 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -54,7 +54,26 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted) #define F feature_bit -int kvm_update_cpuid(struct kvm_vcpu *vcpu) +static int kvm_check_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + /* +* The existing code assumes virtual address is 48-bit or 57-bit in the +* canonical address checks; exit if it is ever changed. +*/ + best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + if (best) { + int vaddr_bits = (best->eax & 0xff00) >> 8; + + if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) + return -EINVAL; + } + + return 0; +} + +void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; struct kvm_lapic *apic = vcpu->arch.apic; @@ -98,18 +117,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) cpuid_entry_has(best, X86_FEATURE_XSAVEC))) best->ebx = xstate_required_size(vcpu->arch.xcr0, true); - /* -* The existing code assumes virtual address is 48-bit or 57-bit in the -* canonical address checks; exit if it is ever changed. -*/ - best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); - if (best) { - int vaddr_bits = (best->eax & 0xff00) >> 8; - - if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) - return -EINVAL; - } - best = kvm_find_cpuid_entry(vcpu, KVM_CPUID_FEATURES, 0); if (kvm_hlt_in_guest(vcpu->kvm) && best && (best->eax & (1 << KVM_FEATURE_PV_UNHALT))) @@ -131,7 +138,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_pmu_refresh(vcpu); vcpu->arch.cr4_guest_rsvd_bits = __cr4_reserved_bits(guest_cpuid_has, vcpu); - return 0; } static int is_efer_nx(void) @@ -206,11 +212,16 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_entries[i].padding[2] = 0; } vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; + kvfree(cpuid_entries); + goto out; + } + cpuid_fix_nx_cap(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); kvfree(cpuid_entries); out: @@ -231,10 +242,14 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, cpuid->nent * sizeof(struct kvm_cpuid_entry2))) goto out; vcpu->arch.cpuid_nent = cpuid->nent; - kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) + r = kvm_check_cpuid(vcpu); + if (r) { vcpu->arch.cpuid_nent = 0; + goto out; + } + + kvm_x86_ops.cpuid_update(vcpu); + kvm_update_cpuid(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 05434cd9342f..f136de1debad 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -9,7 +9,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); -int kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_cpuid(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, -- 2.18.4
[PATCH v4 4/5] KVM: x86: Rename cpuid_update() callback to vcpu_after_set_cpuid()
The name of callback cpuid_update() is misleading that it's not about updating CPUID settings of vcpu but updating the configurations of vcpu based on the CPUIDs. So rename it to vcpu_after_set_cpuid(). Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 4 ++-- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/vmx/nested.c | 3 ++- arch/x86/kvm/vmx/vmx.c | 4 ++-- 5 files changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 16461a674406..3d7d818a282c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1052,7 +1052,7 @@ struct kvm_x86_ops { void (*hardware_unsetup)(void); bool (*cpu_has_accelerated_tpr)(void); bool (*has_emulated_msr)(u32 index); - void (*cpuid_update)(struct kvm_vcpu *vcpu); + void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu); unsigned int vm_size; int (*vm_init)(struct kvm *kvm); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b602c0c9078e..832a24c1334e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -228,7 +228,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, } cpuid_fix_nx_cap(vcpu); - kvm_x86_ops.cpuid_update(vcpu); + kvm_x86_ops.vcpu_after_set_cpuid(vcpu); kvm_update_cpuid_runtime(vcpu); kvm_vcpu_after_set_cpuid(vcpu); @@ -257,7 +257,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; } - kvm_x86_ops.cpuid_update(vcpu); + kvm_x86_ops.vcpu_after_set_cpuid(vcpu); kvm_update_cpuid_runtime(vcpu); kvm_vcpu_after_set_cpuid(vcpu); out: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 6d49afa3063f..535ad311ad02 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3595,7 +3595,7 @@ static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) return 0; } -static void svm_cpuid_update(struct kvm_vcpu *vcpu) +static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4095,7 +4095,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .get_exit_info = svm_get_exit_info, - .cpuid_update = svm_cpuid_update, + .vcpu_after_set_cpuid = svm_vcpu_after_set_cpuid, .has_wbinvd_exit = svm_has_wbinvd_exit, diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 7693d41a2446..e4080ab2df21 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6354,7 +6354,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) /* * secondary cpu-based controls. Do not include those that -* depend on CPUID bits, they are added later by vmx_cpuid_update. +* depend on CPUID bits, they are added later by +* vmx_vcpu_after_set_cpuid. */ if (msrs->procbased_ctls_high & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0cdc78f2f2f4..2b41d987b101 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7282,7 +7282,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu) vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4)); } -static void vmx_cpuid_update(struct kvm_vcpu *vcpu) +static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -7940,7 +7940,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .get_exit_info = vmx_get_exit_info, - .cpuid_update = vmx_cpuid_update, + .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid, .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, -- 2.18.4
[PATCH v4 3/5] KVM: x86: Rename kvm_update_cpuid() to kvm_vcpu_after_set_cpuid()
Now there is no updating CPUID bits behavior in kvm_update_cpuid(), rename it to kvm_vcpu_after_set_cpuid(). Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0ed3b343c44e..b602c0c9078e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -116,7 +116,7 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu) } } -static void kvm_update_cpuid(struct kvm_vcpu *vcpu) +static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; struct kvm_cpuid_entry2 *best; @@ -230,7 +230,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid_runtime(vcpu); - kvm_update_cpuid(vcpu); + kvm_vcpu_after_set_cpuid(vcpu); kvfree(cpuid_entries); out: @@ -259,7 +259,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid_runtime(vcpu); - kvm_update_cpuid(vcpu); + kvm_vcpu_after_set_cpuid(vcpu); out: return r; } -- 2.18.4
[PATCH v4 5/5] KVM: x86: Move kvm_x86_ops.vcpu_after_set_cpuid() into kvm_vcpu_after_set_cpuid()
kvm_x86_ops.vcpu_after_set_cpuid() is used to update vmx/svm specific vcpu settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are updated, i.e., kvm_update_cpuid_runtime(). Currently, kvm_update_cpuid_runtime() only updates CPUID bits of OSXSAVE, APIC, OSPKE, MWAIT, KVM_FEATURE_PV_UNHALT and CPUID(0xD,0).ebx and CPUID(0xD, 1).ebx. None of them is consumed by vmx/svm's update_vcpu_after_set_cpuid(). So there is no dependency between them. Move kvm_x86_ops.vcpu_after_set_cpuid() into kvm_vcpu_after_set_cpuid() is obviously more reasonable. Signed-off-by: Xiaoyao Li --- --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 832a24c1334e..edbed4f522f2 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -121,6 +121,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; struct kvm_cpuid_entry2 *best; + kvm_x86_ops.vcpu_after_set_cpuid(vcpu); + best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best && apic) { if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) @@ -228,7 +230,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, } cpuid_fix_nx_cap(vcpu); - kvm_x86_ops.vcpu_after_set_cpuid(vcpu); kvm_update_cpuid_runtime(vcpu); kvm_vcpu_after_set_cpuid(vcpu); @@ -257,7 +258,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; } - kvm_x86_ops.vcpu_after_set_cpuid(vcpu); kvm_update_cpuid_runtime(vcpu); kvm_vcpu_after_set_cpuid(vcpu); out: -- 2.18.4
[PATCH v4 2/5] KVM: x86: Extract kvm_update_cpuid_runtime() from kvm_update_cpuid()
Beside called in kvm_vcpu_ioctl_set_cpuid*(), kvm_update_cpuid() is also called 5 places else in x86.c and 1 place else in lapic.c. All those 6 places only need the part of updating guest CPUIDs (OSXSAVE, OSPKE, APIC, KVM_FEATURE_PV_UNHALT, ...) based on the runtime vcpu state, so extract them as a separate kvm_update_cpuid_runtime(). Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 44 +++- arch/x86/kvm/cpuid.h | 2 +- arch/x86/kvm/lapic.c | 2 +- arch/x86/kvm/x86.c | 10 +- 4 files changed, 34 insertions(+), 24 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1a053022a961..0ed3b343c44e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -73,10 +73,9 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu) return 0; } -void kvm_update_cpuid(struct kvm_vcpu *vcpu) +void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; - struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best) { @@ -89,28 +88,14 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); } - if (best && apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; - - kvm_apic_set_version(vcpu); - } - best = kvm_find_cpuid_entry(vcpu, 7, 0); if (best && boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) cpuid_entry_change(best, X86_FEATURE_OSPKE, kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); best = kvm_find_cpuid_entry(vcpu, 0xD, 0); - if (!best) { - vcpu->arch.guest_supported_xcr0 = 0; - } else { - vcpu->arch.guest_supported_xcr0 = - (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + if (best) best->ebx = xstate_required_size(vcpu->arch.xcr0, false); - } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) || @@ -129,6 +114,29 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT); } +} + +static void kvm_update_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (best && apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + + kvm_apic_set_version(vcpu); + } + + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); + if (!best) + vcpu->arch.guest_supported_xcr0 = 0; + else + vcpu->arch.guest_supported_xcr0 = + (best->eax | ((u64)best->edx << 32)) & supported_xcr0; /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); @@ -221,6 +229,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_x86_ops.cpuid_update(vcpu); + kvm_update_cpuid_runtime(vcpu); kvm_update_cpuid(vcpu); kvfree(cpuid_entries); @@ -249,6 +258,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, } kvm_x86_ops.cpuid_update(vcpu); + kvm_update_cpuid_runtime(vcpu); kvm_update_cpuid(vcpu); out: return r; diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f136de1debad..3a923ae15f2f 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -9,7 +9,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); -void kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e5dbb7ebae78..47801a44cfa6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2230,7 +2230,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) vcpu->arch.apic_base = value; if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) - kvm_
Re: [PATCH v3 0/8] Refactor handling flow of KVM_SET_CPUID*
On 7/8/2020 8:10 PM, Paolo Bonzini wrote: On 08/07/20 08:50, Xiaoyao Li wrote: This serial is the extended version of https://lkml.kernel.org/r/20200528151927.14346-1-xiaoyao...@intel.com First two patches are bug fixing, and the others aim to refactor the flow of SET_CPUID* as: 1. cpuid check: check if userspace provides legal CPUID settings; 2. cpuid update: Update some special CPUID bits based on current vcpu state, e.g., OSXSAVE, OSPKE, ... 3. update vcpu model: Update vcpu model (settings) based on the final CPUID settings. v3: - Add a note in KVM api doc to state the previous CPUID configuration is not reliable if current KVM_SET_CPUID* fails [Jim] - Adjust Patch 2 to reduce code churn [Sean] - Commit message refine to add more justification [Sean] - Add a new patch (7) v2: https://lkml.kernel.org/r/20200623115816.24132-1-xiaoyao...@intel.com - rebase to kvm/queue: a037ff353ba6 ("Merge branch 'kvm-master' into HEAD") - change the name of kvm_update_state_based_on_cpuid() to kvm_update_vcpu_model() [Sean] - Add patch 5 to rename kvm_x86_ops.cpuid_date() to kvm_x86_ops.update_vcpu_model() v1: https://lkml.kernel.org/r/20200529085545.29242-1-xiaoyao...@intel.com Xiaoyao Li (8): KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID* fails KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent KVM: X86: Introduce kvm_check_cpuid() KVM: X86: Split kvm_update_cpuid() KVM: X86: Rename cpuid_update() to update_vcpu_model() KVM: X86: Move kvm_x86_ops.update_vcpu_model() into kvm_update_vcpu_model() KVM: lapic: Use guest_cpuid_has() in kvm_apic_set_version() KVM: X86: Move kvm_apic_set_version() to kvm_update_vcpu_model() Documentation/virt/kvm/api.rst | 4 ++ arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 107 arch/x86/kvm/cpuid.h| 3 +- arch/x86/kvm/lapic.c| 4 +- arch/x86/kvm/svm/svm.c | 4 +- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/x86.c | 1 + 9 files changed, 81 insertions(+), 50 deletions(-) Queued patches 1/2/3/7/8, thanks. Paolo, I notice that you queued patch 8 into kvm/queue branch as commit 84dd4897524e "KVM: X86: Move kvm_apic_set_version() to kvm_update_vcpu_model()" Can you change the subject of that commit to "KVM: X86: Move kvm_apic_set_version() to kvm_update_cpuid()" ? Paolo
Re: [PATCH v3 4/8] KVM: X86: Split kvm_update_cpuid()
On 7/8/2020 8:41 PM, Paolo Bonzini wrote: On 08/07/20 14:33, Xiaoyao Li wrote: On 7/8/2020 8:06 PM, Paolo Bonzini wrote: On 08/07/20 08:50, Xiaoyao Li wrote: Split the part of updating vcpu model out of kvm_update_cpuid(), and put it into a new kvm_update_vcpu_model(). So it's more clear that kvm_update_cpuid() is to update guest CPUID settings, while kvm_update_vcpu_model() is to update vcpu model (settings) based on the updated CPUID settings. Signed-off-by: Xiaoyao Li I would prefer to keep the kvm_update_cpuid name for what you called kvm_update_vcpu_model(), and rename the rest to kvm_update_cpuid_runtime(). But there is no CPUID being updated in kvm_update_cpuid(), after kvm_update_cpuid_runtime() is split out. This is confusing, IMO. Then what about kvm_vcpu_after_set_cpuid()? It's the "model" that is not clear. I'm ok with kvm_vcpu_after_set_cpuid(). BTW there is an unknown for me regarding enter_smm(). Currently, it calls kvm_update_cpuid(). I'm not sure which part it really needs, update CPUID or update vcpu state based on CPUID? maybe both? So in this Patch, after splitting kvm_update_cpuid(), I keep both functions there to ensure no functional change in enter_smm(). So using the name "kvm_vcpu_after_set_cpuid" seems weird in that function. Thanks, Paolo Paolo --- arch/x86/kvm/cpuid.c | 38 -- arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/x86.c | 1 + 3 files changed, 26 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index a825878b7f84..001f5a94880e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -76,7 +76,6 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu) void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; - struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best) { @@ -89,26 +88,14 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); } - if (best && apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; - } - best = kvm_find_cpuid_entry(vcpu, 7, 0); if (best && boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) cpuid_entry_change(best, X86_FEATURE_OSPKE, kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); best = kvm_find_cpuid_entry(vcpu, 0xD, 0); - if (!best) { - vcpu->arch.guest_supported_xcr0 = 0; - } else { - vcpu->arch.guest_supported_xcr0 = - (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + if (best) best->ebx = xstate_required_size(vcpu->arch.xcr0, false); - } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) || @@ -127,6 +114,27 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT); } +} + +void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (best && apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } + + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); + if (!best) + vcpu->arch.guest_supported_xcr0 = 0; + else + vcpu->arch.guest_supported_xcr0 = + (best->eax | ((u64)best->edx << 32)) & supported_xcr0; /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); @@ -218,6 +226,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); kvfree(cpuid_entries); out: @@ -247,6 +256,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f136de1debad..45e3643e2fba 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -10,6 +10,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); void kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_vcpu_model(struct kvm_vcp
Re: [PATCH v3 4/8] KVM: X86: Split kvm_update_cpuid()
On 7/8/2020 8:06 PM, Paolo Bonzini wrote: On 08/07/20 08:50, Xiaoyao Li wrote: Split the part of updating vcpu model out of kvm_update_cpuid(), and put it into a new kvm_update_vcpu_model(). So it's more clear that kvm_update_cpuid() is to update guest CPUID settings, while kvm_update_vcpu_model() is to update vcpu model (settings) based on the updated CPUID settings. Signed-off-by: Xiaoyao Li I would prefer to keep the kvm_update_cpuid name for what you called kvm_update_vcpu_model(), and rename the rest to kvm_update_cpuid_runtime(). But there is no CPUID being updated in kvm_update_cpuid(), after kvm_update_cpuid_runtime() is split out. This is confusing, IMO. Paolo --- arch/x86/kvm/cpuid.c | 38 -- arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/x86.c | 1 + 3 files changed, 26 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index a825878b7f84..001f5a94880e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -76,7 +76,6 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu) void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; - struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best) { @@ -89,26 +88,14 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); } - if (best && apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; - } - best = kvm_find_cpuid_entry(vcpu, 7, 0); if (best && boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) cpuid_entry_change(best, X86_FEATURE_OSPKE, kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); best = kvm_find_cpuid_entry(vcpu, 0xD, 0); - if (!best) { - vcpu->arch.guest_supported_xcr0 = 0; - } else { - vcpu->arch.guest_supported_xcr0 = - (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + if (best) best->ebx = xstate_required_size(vcpu->arch.xcr0, false); - } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) || @@ -127,6 +114,27 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT); } +} + +void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (best && apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } + + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); + if (!best) + vcpu->arch.guest_supported_xcr0 = 0; + else + vcpu->arch.guest_supported_xcr0 = + (best->eax | ((u64)best->edx << 32)) & supported_xcr0; /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); @@ -218,6 +226,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); kvfree(cpuid_entries); out: @@ -247,6 +256,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f136de1debad..45e3643e2fba 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -10,6 +10,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); void kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_vcpu_model(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 09ee54f5e385..6f376392e6e6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8184,6 +8184,7 @@ static void enter_smm(struct kvm_vcpu *vcpu) #endif kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); kvm_mmu_reset_context(vcpu); }
Re: [PATCH v3 3/8] KVM: X86: Introduce kvm_check_cpuid()
On 7/8/2020 2:50 PM, Xiaoyao Li wrote: Use kvm_check_cpuid() to validate if userspace provides legal cpuid settings and call it before KVM updates CPUID. Signed-off-by: Xiaoyao Li [...] @@ -202,12 +208,16 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_entries[i].padding[2] = 0; } vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; Paolo, here lack a kvfree(cpuid_entries); Can you help fix it? Apologize for it. + goto out; + } + cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); kvfree(cpuid_entries); out:
Re: [PATCH v2] x86/speculation/l1tf: Add KConfig for setting the L1D cache flush mode
On 7/3/2020 5:38 AM, Abhishek Bhardwaj wrote: This change adds a new kernel configuration that sets the l1d cache flush setting at compile time rather than at run time. Signed-off-by: Abhishek Bhardwaj --- Changes in v2: - Fix typo in the help of the new KConfig. arch/x86/kernel/cpu/bugs.c | 8 arch/x86/kvm/Kconfig | 17 + 2 files changed, 25 insertions(+) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 0b71970d2d3d2..1dcc875cf5547 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1406,7 +1406,15 @@ enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH; #if IS_ENABLED(CONFIG_KVM_INTEL) EXPORT_SYMBOL_GPL(l1tf_mitigation); #endif +#if (CONFIG_KVM_VMENTRY_L1D_FLUSH == 1) +enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NEVER; +#elif (CONFIG_KVM_VMENTRY_L1D_FLUSH == 2) +enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_COND; +#elif (CONFIG_KVM_VMENTRY_L1D_FLUSH == 3) +enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_ALWAYS; +#else enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO; +#endif how about enum vmx_l1d_flush_state l1tf_vmx_mitigation = #if (CONFIG_KVM_VMENTRY_L1D_FLUSH == 1) VMENTER_L1D_FLUSH_NEVER; #elif (CONFIG_KVM_VMENTRY_L1D_FLUSH == 2) VMENTER_L1D_FLUSH_COND; #elif (CONFIG_KVM_VMENTRY_L1D_FLUSH == 3) VMENTER_L1D_FLUSH_ALWAYS; #else VMENTER_L1D_FLUSH_AUTO; #endif
[PATCH v3 4/8] KVM: X86: Split kvm_update_cpuid()
Split the part of updating vcpu model out of kvm_update_cpuid(), and put it into a new kvm_update_vcpu_model(). So it's more clear that kvm_update_cpuid() is to update guest CPUID settings, while kvm_update_vcpu_model() is to update vcpu model (settings) based on the updated CPUID settings. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 38 -- arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/x86.c | 1 + 3 files changed, 26 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index a825878b7f84..001f5a94880e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -76,7 +76,6 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu) void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; - struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best) { @@ -89,26 +88,14 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); } - if (best && apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; - } - best = kvm_find_cpuid_entry(vcpu, 7, 0); if (best && boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) cpuid_entry_change(best, X86_FEATURE_OSPKE, kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); best = kvm_find_cpuid_entry(vcpu, 0xD, 0); - if (!best) { - vcpu->arch.guest_supported_xcr0 = 0; - } else { - vcpu->arch.guest_supported_xcr0 = - (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + if (best) best->ebx = xstate_required_size(vcpu->arch.xcr0, false); - } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) || @@ -127,6 +114,27 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT); } +} + +void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (best && apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } + + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); + if (!best) + vcpu->arch.guest_supported_xcr0 = 0; + else + vcpu->arch.guest_supported_xcr0 = + (best->eax | ((u64)best->edx << 32)) & supported_xcr0; /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); @@ -218,6 +226,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); kvfree(cpuid_entries); out: @@ -247,6 +256,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f136de1debad..45e3643e2fba 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -10,6 +10,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); void kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_vcpu_model(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 09ee54f5e385..6f376392e6e6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8184,6 +8184,7 @@ static void enter_smm(struct kvm_vcpu *vcpu) #endif kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); kvm_mmu_reset_context(vcpu); } -- 2.18.4
[PATCH v3 8/8] KVM: X86: Move kvm_apic_set_version() to kvm_update_vcpu_model()
There is no dependencies between kvm_apic_set_version() and kvm_update_cpuid() because kvm_apic_set_version() queries X2APIC CPUID bit, which is not touched/changed by kvm_update_cpuid(). Obviously, kvm_apic_set_version() belongs to the category of updating vcpu model. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 89ffd9dccfc6..c183a11dbcff 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -129,6 +129,8 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) apic->lapic_timer.timer_mode_mask = 3 << 17; else apic->lapic_timer.timer_mode_mask = 1 << 17; + + kvm_apic_set_version(vcpu); } best = kvm_find_cpuid_entry(vcpu, 0xD, 0); @@ -225,7 +227,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, } cpuid_fix_nx_cap(vcpu); - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -254,7 +255,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; } - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: -- 2.18.4
[PATCH v3 6/8] KVM: X86: Move kvm_x86_ops.update_vcpu_model() into kvm_update_vcpu_model()
kvm_x86_ops.update_vcpu_model() is used to update vmx/svm vcpu settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are fully updated, i.e., kvm_update_cpuid(). Currently, kvm_update_cpuid() only updates CPUID bits of OSXSAVE, APIC, OSPKE, MWAIT, KVM_FEATURE_PV_UNHALT and ebx of (leaf 0xD, subleaf 0x0), ebx of (leaf 0xD, subleaf 0x1). None of them is consumed by vmx/svm's update_vcpu_model(). So there is no dependency between kvm_x86_ops.update_vcpu_model() and kvm_update_cpuid(). Move kvm_x86_ops.update_vcpu_model() into kvm_update_vcpu_model() make it more reasonable. Signed-off-by: Xiaoyao Li --- --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d2f93823f9fd..89ffd9dccfc6 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -121,6 +121,8 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; struct kvm_cpuid_entry2 *best; + kvm_x86_ops.update_vcpu_model(vcpu); + best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best && apic) { if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) @@ -224,7 +226,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -254,7 +255,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, } kvm_apic_set_version(vcpu); - kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: -- 2.18.4
[PATCH v3 7/8] KVM: lapic: Use guest_cpuid_has() in kvm_apic_set_version()
Only code cleanup and no functional change. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/lapic.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5bf72fc86a8e..e5dbb7ebae78 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -354,7 +354,6 @@ static inline int apic_lvt_nmi_mode(u32 lvt_val) void kvm_apic_set_version(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; - struct kvm_cpuid_entry2 *feat; u32 v = APIC_VERSION; if (!lapic_in_kernel(vcpu)) @@ -367,8 +366,7 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu) * version first and level-triggered interrupts never get EOIed in * IOAPIC. */ - feat = kvm_find_cpuid_entry(apic->vcpu, 0x1, 0); - if (feat && (feat->ecx & (1 << (X86_FEATURE_X2APIC & 31))) && + if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) && !ioapic_in_kernel(vcpu->kvm)) v |= APIC_LVR_DIRECTED_EOI; kvm_lapic_set_reg(apic, APIC_LVR, v); -- 2.18.4
[PATCH v3 3/8] KVM: X86: Introduce kvm_check_cpuid()
Use kvm_check_cpuid() to validate if userspace provides legal cpuid settings and call it before KVM updates CPUID. Signed-off-by: Xiaoyao Li --- Is the check of virutal address width really necessary? KVM doesn't check other bits at all. I guess the policy is that KVM allows illegal CPUID settings as long as it doesn't break host kernel. Userspace takes the consequences itself if it sets bogus CPUID settings that breaks its guest. But why vaddr_bits is special? It seems illegal vaddr_bits won't break host kernel. --- arch/x86/kvm/cpuid.c | 54 arch/x86/kvm/cpuid.h | 2 +- 2 files changed, 35 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0e3a23c2ea55..a825878b7f84 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -54,7 +54,26 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted) #define F feature_bit -int kvm_update_cpuid(struct kvm_vcpu *vcpu) +static int kvm_check_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + /* +* The existing code assumes virtual address is 48-bit or 57-bit in the +* canonical address checks; exit if it is ever changed. +*/ + best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + if (best) { + int vaddr_bits = (best->eax & 0xff00) >> 8; + + if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) + return -EINVAL; + } + + return 0; +} + +void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; struct kvm_lapic *apic = vcpu->arch.apic; @@ -96,18 +115,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) cpuid_entry_has(best, X86_FEATURE_XSAVEC))) best->ebx = xstate_required_size(vcpu->arch.xcr0, true); - /* -* The existing code assumes virtual address is 48-bit or 57-bit in the -* canonical address checks; exit if it is ever changed. -*/ - best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); - if (best) { - int vaddr_bits = (best->eax & 0xff00) >> 8; - - if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) - return -EINVAL; - } - best = kvm_find_cpuid_entry(vcpu, KVM_CPUID_FEATURES, 0); if (kvm_hlt_in_guest(vcpu->kvm) && best && (best->eax & (1 << KVM_FEATURE_PV_UNHALT))) @@ -127,7 +134,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_mmu_reset_context(vcpu); kvm_pmu_refresh(vcpu); - return 0; } static int is_efer_nx(void) @@ -202,12 +208,16 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_entries[i].padding[2] = 0; } vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; + goto out; + } + cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); kvfree(cpuid_entries); out: @@ -228,11 +238,15 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, cpuid->nent * sizeof(struct kvm_cpuid_entry2))) goto out; vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; + goto out; + } + kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 05434cd9342f..f136de1debad 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -9,7 +9,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); -int kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_cpuid(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, -- 2.18.4
[PATCH v3 2/8] KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent
As handling of bits out of leaf 1 added over time, kvm_update_cpuid() should not return directly if leaf 1 is absent, but should go on updateing other CPUID leaves. Keep the update of apic->lapic_timer.timer_mode_mask in a separate wrapper, to minimize churn for code since it will be moved out of this function in a future patch. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1d13bad42bf9..0e3a23c2ea55 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -60,18 +60,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); - if (!best) - return 0; - - /* Update OSXSAVE bit */ - if (boot_cpu_has(X86_FEATURE_XSAVE) && best->function == 0x1) - cpuid_entry_change(best, X86_FEATURE_OSXSAVE, + if (best) { + /* Update OSXSAVE bit */ + if (boot_cpu_has(X86_FEATURE_XSAVE)) + cpuid_entry_change(best, X86_FEATURE_OSXSAVE, kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)); - cpuid_entry_change(best, X86_FEATURE_APIC, + cpuid_entry_change(best, X86_FEATURE_APIC, vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); + } - if (apic) { + if (best && apic) { if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) apic->lapic_timer.timer_mode_mask = 3 << 17; else -- 2.18.4
[PATCH v3 5/8] KVM: X86: Rename cpuid_update() to update_vcpu_model()
The name of callback cpuid_update() is misleading that it's not about updating CPUID settings of vcpu but updating the configurations of vcpu based on the CPUIDs. So rename it to update_vcpu_model(). Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 4 ++-- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 ++-- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 97cb005c7aa7..c35d14b257c9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1051,7 +1051,7 @@ struct kvm_x86_ops { void (*hardware_unsetup)(void); bool (*cpu_has_accelerated_tpr)(void); bool (*has_emulated_msr)(u32 index); - void (*cpuid_update)(struct kvm_vcpu *vcpu); + void (*update_vcpu_model)(struct kvm_vcpu *vcpu); unsigned int vm_size; int (*vm_init)(struct kvm *kvm); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 001f5a94880e..d2f93823f9fd 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -224,7 +224,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); + kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -254,7 +254,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, } kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); + kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 74096aa72ad9..01f359e590d5 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3550,7 +3550,7 @@ static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) return 0; } -static void svm_cpuid_update(struct kvm_vcpu *vcpu) +static void svm_update_vcpu_model(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4050,7 +4050,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .get_exit_info = svm_get_exit_info, - .cpuid_update = svm_cpuid_update, + .update_vcpu_model = svm_update_vcpu_model, .has_wbinvd_exit = svm_has_wbinvd_exit, diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index b627c5f36b9e..85080a5b8d3c 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6354,7 +6354,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) /* * secondary cpu-based controls. Do not include those that -* depend on CPUID bits, they are added later by vmx_cpuid_update. +* depend on CPUID bits, they are added later by vmx_update_vcpu_model. */ if (msrs->procbased_ctls_high & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 8187ca152ad2..4673c84b54ac 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7257,7 +7257,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu) vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4)); } -static void vmx_cpuid_update(struct kvm_vcpu *vcpu) +static void vmx_update_vcpu_model(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -7915,7 +7915,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .get_exit_info = vmx_get_exit_info, - .cpuid_update = vmx_cpuid_update, + .update_vcpu_model = vmx_update_vcpu_model, .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, -- 2.18.4
[PATCH v3 1/8] KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID* fails
Current implementation keeps userspace input of CPUID configuration and cpuid->nent even if kvm_update_cpuid() fails. Reset vcpu->arch.cpuid_nent to 0 for the case of failure as a simple fix. Besides, update the doc to explicitly state that if IOCTL SET_CPUID* fail KVM gives no gurantee that previous valid CPUID configuration is kept. Signed-off-by: Xiaoyao Li --- Documentation/virt/kvm/api.rst | 4 arch/x86/kvm/cpuid.c | 4 2 files changed, 8 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1cfe79b932d6..3ca809a1a44f 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -669,6 +669,10 @@ MSRs that have been set successfully. Defines the vcpu responses to the cpuid instruction. Applications should use the KVM_SET_CPUID2 ioctl if available. +Note, when this IOCTL fails, KVM gives no guarantees that previous valid CPUID +configuration (if there is) is not corrupted. Userspace can get a copy of valid +CPUID configuration through KVM_GET_CPUID2 in case. + :: struct kvm_cpuid_entry { diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 8a294f9747aa..1d13bad42bf9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -207,6 +207,8 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; kvfree(cpuid_entries); out: @@ -230,6 +232,8 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; out: return r; } -- 2.18.4
[PATCH v3 0/8] Refactor handling flow of KVM_SET_CPUID*
This serial is the extended version of https://lkml.kernel.org/r/20200528151927.14346-1-xiaoyao...@intel.com First two patches are bug fixing, and the others aim to refactor the flow of SET_CPUID* as: 1. cpuid check: check if userspace provides legal CPUID settings; 2. cpuid update: Update some special CPUID bits based on current vcpu state, e.g., OSXSAVE, OSPKE, ... 3. update vcpu model: Update vcpu model (settings) based on the final CPUID settings. v3: - Add a note in KVM api doc to state the previous CPUID configuration is not reliable if current KVM_SET_CPUID* fails [Jim] - Adjust Patch 2 to reduce code churn [Sean] - Commit message refine to add more justification [Sean] - Add a new patch (7) v2: https://lkml.kernel.org/r/20200623115816.24132-1-xiaoyao...@intel.com - rebase to kvm/queue: a037ff353ba6 ("Merge branch 'kvm-master' into HEAD") - change the name of kvm_update_state_based_on_cpuid() to kvm_update_vcpu_model() [Sean] - Add patch 5 to rename kvm_x86_ops.cpuid_date() to kvm_x86_ops.update_vcpu_model() v1: https://lkml.kernel.org/r/20200529085545.29242-1-xiaoyao...@intel.com Xiaoyao Li (8): KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID* fails KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent KVM: X86: Introduce kvm_check_cpuid() KVM: X86: Split kvm_update_cpuid() KVM: X86: Rename cpuid_update() to update_vcpu_model() KVM: X86: Move kvm_x86_ops.update_vcpu_model() into kvm_update_vcpu_model() KVM: lapic: Use guest_cpuid_has() in kvm_apic_set_version() KVM: X86: Move kvm_apic_set_version() to kvm_update_vcpu_model() Documentation/virt/kvm/api.rst | 4 ++ arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 107 arch/x86/kvm/cpuid.h| 3 +- arch/x86/kvm/lapic.c| 4 +- arch/x86/kvm/svm/svm.c | 4 +- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/x86.c | 1 + 9 files changed, 81 insertions(+), 50 deletions(-) -- 2.18.4
Re: [PATCH v12 07/11] KVM: vmx/pmu: Unmask LBR fields in the MSR_IA32_DEBUGCTLMSR emualtion
On 7/8/2020 4:21 AM, Sean Christopherson wrote: On Sat, Jun 13, 2020 at 05:42:50PM +0800, Xu, Like wrote: On 2020/6/13 17:14, Xiaoyao Li wrote: On 6/13/2020 4:09 PM, Like Xu wrote: [...] @@ -237,6 +238,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; msr_info->data = vcpu->arch.perf_capabilities; return 0; + case MSR_IA32_DEBUGCTLMSR: + msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL); Can we put the emulation of MSR_IA32_DEBUGCTLMSR in vmx_{get/set})_msr(). AFAIK, MSR_IA32_DEBUGCTLMSR is not a pure PMU related MSR that there is bit 2 to enable #DB for bus lock. We already have "case MSR_IA32_DEBUGCTLMSR" handler in the vmx_set_msr() and you may apply you bus lock changes in that handler. Hrm, but that'd be weird dependency as vmx_set_msr() would need to check for #DB bus lock support but not actually write GUEST_IA32_DEBUGCTL, or we'd end up writing it twice when both bus lock and LBR are supported. Yeah. That's what I concerned as well. I don't see anything in the series that takes action on writes to MSR_IA32_DEBUGCTLMSR beyond updating the VMCS, i.e. AFAICT there isn't any reason to call into the PMU, VMX can simply query vmx_get_perf_capabilities() to check if it's legal to enable DEBUGCTLMSR_LBR_MASK. A question for both LBR and bus lock: would it make sense to cache the guest's value in vcpu_vmx so that querying the guest value doesn't require a VMREAD? I don't have a good feel for how frequently it would be accessed. Cache the guest's value is OK, even though #DB bus lock bit wouldn't be toggled frequently in a normal OS.
Re: [PATCH v2 7/7] KVM: X86: Move kvm_apic_set_version() to kvm_update_vcpu_model()
On 7/3/2020 3:00 AM, Sean Christopherson wrote: On Tue, Jun 23, 2020 at 07:58:16PM +0800, Xiaoyao Li wrote: Obviously, kvm_apic_set_version() fits well in kvm_update_vcpu_model(). Same as the last patch, it would be nice to explicitly document that there are no dependencies between kvm_apic_set_version() and kvm_update_cpuid(). Sure, will do. Thanks! Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 5decc2dd5448..3428f4d84b42 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -129,6 +129,8 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) apic->lapic_timer.timer_mode_mask = 3 << 17; else apic->lapic_timer.timer_mode_mask = 1 << 17; + + kvm_apic_set_version(vcpu); } best = kvm_find_cpuid_entry(vcpu, 0xD, 0); @@ -226,7 +228,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, } cpuid_fix_nx_cap(vcpu); - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -255,7 +256,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; } - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: -- 2.18.2
Re: [PATCH v2 6/7] KVM: X86: Move kvm_x86_ops.update_vcpu_model() into kvm_update_vcpu_model()
On 7/3/2020 2:59 AM, Sean Christopherson wrote: On Tue, Jun 23, 2020 at 07:58:15PM +0800, Xiaoyao Li wrote: kvm_x86_ops.update_vcpu_model() is used to update vmx/svm vcpu settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are fully updated, i.e., kvm_update_cpuid(). Move it in kvm_update_vcpu_model(). The changelog needs to provide an in-depth analysis of VMX and SVM to prove that there are no existing dependencies in the ordering. I've done the analysis a few times over the past few years for a similar chage I carried in my SGX code, but dropped that code a while back and haven't done the analysis since. Anyways, it should be documented. No problem. Will add the analysis in next version. Signed-off-by: Xiaoyao Li --- --- arch/x86/kvm/cpuid.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d2f93823f9fd..5decc2dd5448 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -121,6 +121,8 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; struct kvm_cpuid_entry2 *best; + kvm_x86_ops.update_vcpu_model(vcpu); + best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best && apic) { if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) @@ -136,6 +138,7 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + Spurious whitespace. /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); vcpu->arch.tdp_level = kvm_x86_ops.get_tdp_level(vcpu); @@ -224,7 +227,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -254,7 +256,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, } kvm_apic_set_version(vcpu); - kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: -- 2.18.2
Re: [PATCH v2 2/7] KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent
On 7/3/2020 3:02 AM, Sean Christopherson wrote: On Thu, Jul 02, 2020 at 11:54:03AM -0700, Sean Christopherson wrote: On Tue, Jun 23, 2020 at 07:58:11PM +0800, Xiaoyao Li wrote: As handling of bits other leaf 1 added over time, kvm_update_cpuid() should not return directly if leaf 1 is absent, but should go on updateing other CPUID leaves. Signed-off-by: Xiaoyao Li This should probably be marked for stable. --- arch/x86/kvm/cpuid.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1d13bad42bf9..0164dac95ef5 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -60,22 +60,21 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); - if (!best) - return 0; Rather than wrap the existing code, what about throwing it in a separate helper? That generates an easier to read diff and also has the nice property of getting 'apic' out of the common code. Hrm, that'd be overkill once the apic code is moved in a few patches. What if you keep the cpuid updates wrapped (as in this patch), but then do if (best && apic) { } for the apic path? That'll minimize churn for code that is disappearing, e.g. will make future git archaeologists happy :-). Sure. I'll do it in next version. - - /* Update OSXSAVE bit */ - if (boot_cpu_has(X86_FEATURE_XSAVE) && best->function == 0x1) - cpuid_entry_change(best, X86_FEATURE_OSXSAVE, + if (best) { + /* Update OSXSAVE bit */ + if (boot_cpu_has(X86_FEATURE_XSAVE)) + cpuid_entry_change(best, X86_FEATURE_OSXSAVE, kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)); - cpuid_entry_change(best, X86_FEATURE_APIC, + cpuid_entry_change(best, X86_FEATURE_APIC, vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); - if (apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; + if (apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } } best = kvm_find_cpuid_entry(vcpu, 7, 0); -- 2.18.2
Re: [PATCH v13 03/11] KVM: VMX: Set guest CET MSRs per KVM and host configuration
On 7/1/2020 4:04 PM, Yang Weijiang wrote: CET MSRs pass through guest directly to enhance performance. CET runtime control settings are stored in MSR_IA32_{U,S}_CET, Shadow Stack Pointer(SSP) are stored in MSR_IA32_PL{0,1,2,3}_SSP, SSP table base address is stored in MSR_IA32_INT_SSP_TAB, these MSRs are defined in kernel and re-used here. MSR_IA32_U_CET and MSR_IA32_PL3_SSP are used for user-mode protection,the MSR contents are switched between threads during scheduling, it makes sense to pass through them so that the guest kernel can use xsaves/xrstors to operate them efficiently. Other MSRs are used for non-user mode protection. See SDM for detailed info. The difference between CET VMCS fields and CET MSRs is that,the former are used during VMEnter/VMExit, whereas the latter are used for CET state storage between task/thread scheduling. Co-developed-by: Zhang Yi Z Signed-off-by: Zhang Yi Z Signed-off-by: Yang Weijiang --- arch/x86/kvm/vmx/vmx.c | 46 ++ arch/x86/kvm/x86.c | 3 +++ 2 files changed, 49 insertions(+) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d52d470e36b1..97e766875a7e 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3020,6 +3020,13 @@ void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long cr3) vmcs_writel(GUEST_CR3, guest_cr3); } +static bool is_cet_state_supported(struct kvm_vcpu *vcpu, u32 xss_states) +{ + return ((supported_xss & xss_states) && + (guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) || + guest_cpuid_has(vcpu, X86_FEATURE_IBT))); +} + int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -7098,6 +7105,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu) vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4)); } +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap; + bool incpt; + + incpt = !is_cet_state_supported(vcpu, XFEATURE_MASK_CET_USER); + /* +* U_CET is required for USER CET, and U_CET, PL3_SPP are bound as +* one component and controlled by IA32_XSS[bit 11]. +*/ + vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_U_CET, MSR_TYPE_RW, + incpt); + vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PL3_SSP, MSR_TYPE_RW, + incpt); + + incpt = !is_cet_state_supported(vcpu, XFEATURE_MASK_CET_KERNEL); + /* +* S_CET is required for KERNEL CET, and PL0_SSP ... PL2_SSP are +* bound as one component and controlled by IA32_XSS[bit 12]. +*/ + vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_S_CET, MSR_TYPE_RW, + incpt); + vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PL0_SSP, MSR_TYPE_RW, + incpt); + vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PL1_SSP, MSR_TYPE_RW, + incpt); + vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PL2_SSP, MSR_TYPE_RW, + incpt); + + incpt |= !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); + /* SSP_TAB is only available for KERNEL SHSTK.*/ + vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW, + incpt); +} + static void vmx_cpuid_update(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -7136,6 +7179,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) vmx_set_guest_msr(vmx, msr, enabled ? 0 : TSX_CTRL_RTM_DISABLE); } } + + if (supported_xss & (XFEATURE_MASK_CET_KERNEL | XFEATURE_MASK_CET_USER)) + vmx_update_intercept_for_cet_msr(vcpu); } static __init void vmx_set_cpu_caps(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c5835f9cb9ad..6390b62c12ed 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -186,6 +186,9 @@ static struct kvm_shared_msrs __percpu *shared_msrs; | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ | XFEATURE_MASK_PKRU) +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \ +XFEATURE_MASK_CET_KERNEL) + This definition need to be moved to Patch 5? u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer);
Re: [RFC 2/2] KVM: VMX: Enable bus lock VM exit
On 7/1/2020 10:49 PM, Vitaly Kuznetsov wrote: Xiaoyao Li writes: On 7/1/2020 8:44 PM, Vitaly Kuznetsov wrote: Xiaoyao Li writes: On 7/1/2020 5:04 PM, Vitaly Kuznetsov wrote: Chenyi Qiang writes: [...] static const int kvm_vmx_max_exit_handlers = @@ -6830,6 +6838,13 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(vmx->exit_reason.failed_vmentry)) return EXIT_FASTPATH_NONE; + /* +* check the exit_reason to see if there is a bus lock +* happened in guest. +*/ + if (vmx->exit_reason.bus_lock_detected) + handle_bus_lock(vcpu); In case the ultimate goal is to have an exit to userspace on bus lock, I don't think we will need an exit to userspace on bus lock. See below. the two ways to reach handle_bus_lock() are very different: in case we're handling EXIT_REASON_BUS_LOCK we can easily drop to userspace by returning 0 but what are we going to do in case of exit_reason.bus_lock_detected? The 'higher priority VM exit' may require exit to userspace too. So what's the plan? Maybe we can ignore the case when we're exiting to userspace for some other reason as this is slow already and force the exit otherwise? And should we actually introduce the KVM_EXIT_BUS_LOCK and a capability to enable it here? Introducing KVM_EXIT_BUS_LOCK maybe help nothing. No matter EXIT_REASON_BUS_LOCK or exit_reason.bus_lock_detected, the bus lock has already happened. Exit to userspace cannot prevent bus lock, so what userspace can do is recording and counting as what this patch does in vcpu->stat.bus_locks. Exiting to userspace would allow to implement custom 'throttling' policies to mitigate the 'noisy neighbour' problem. The simplest would be to just inject some sleep time. So you want an exit to userspace for every bus lock and leave it all to userspace. Yes, it's doable. In some cases we may not even want to have a VM exit: think e.g. real-time/partitioning case when even in case of bus lock we may not want to add additional latency just to count such events. For real-time case, the bus lock needs to be avoided at all, since bus lock takes many cycles and prevents others accessing memory. If no bus lock, then no bus lock vm exit to worry about. If bus lock, the latency requirement maybe cannot be met due to it. I'd suggest we make the new capability tri-state: - disabled (no vmexit, default) - stats only (what this patch does) - userspace exit But maybe this is an overkill, I'd like to hear what others think. Yeah. Others' thought is very welcomed. Besides, for how to throttle, KVM maybe has to take kernel policy into account. Since in the spec, there is another feature for bare metal to raise a #DB for bus lock. Native kernel likely implements some policy to restrict the rate the bus lock can happen. So qemu threads will have to follow that as well. As you said, the exit_reason.bus_lock_detected case is the tricky one. We cannot do the similar to extend vcpu->run->exit_reason, this breaks ABI. Maybe we can extend the vcpu->run->flags to indicate bus lock detected for the other exit reason? This is likely the easiest solution.
Re: [RFC 2/2] KVM: VMX: Enable bus lock VM exit
On 7/1/2020 8:44 PM, Vitaly Kuznetsov wrote: Xiaoyao Li writes: On 7/1/2020 5:04 PM, Vitaly Kuznetsov wrote: Chenyi Qiang writes: [...] static const int kvm_vmx_max_exit_handlers = @@ -6830,6 +6838,13 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(vmx->exit_reason.failed_vmentry)) return EXIT_FASTPATH_NONE; + /* +* check the exit_reason to see if there is a bus lock +* happened in guest. +*/ + if (vmx->exit_reason.bus_lock_detected) + handle_bus_lock(vcpu); In case the ultimate goal is to have an exit to userspace on bus lock, I don't think we will need an exit to userspace on bus lock. See below. the two ways to reach handle_bus_lock() are very different: in case we're handling EXIT_REASON_BUS_LOCK we can easily drop to userspace by returning 0 but what are we going to do in case of exit_reason.bus_lock_detected? The 'higher priority VM exit' may require exit to userspace too. So what's the plan? Maybe we can ignore the case when we're exiting to userspace for some other reason as this is slow already and force the exit otherwise? And should we actually introduce the KVM_EXIT_BUS_LOCK and a capability to enable it here? Introducing KVM_EXIT_BUS_LOCK maybe help nothing. No matter EXIT_REASON_BUS_LOCK or exit_reason.bus_lock_detected, the bus lock has already happened. Exit to userspace cannot prevent bus lock, so what userspace can do is recording and counting as what this patch does in vcpu->stat.bus_locks. Exiting to userspace would allow to implement custom 'throttling' policies to mitigate the 'noisy neighbour' problem. The simplest would be to just inject some sleep time. So you want an exit to userspace for every bus lock and leave it all to userspace. Yes, it's doable. As you said, the exit_reason.bus_lock_detected case is the tricky one. We cannot do the similar to extend vcpu->run->exit_reason, this breaks ABI. Maybe we can extend the vcpu->run->flags to indicate bus lock detected for the other exit reason?
Re: [RFC 2/2] KVM: VMX: Enable bus lock VM exit
On 7/1/2020 5:04 PM, Vitaly Kuznetsov wrote: Chenyi Qiang writes: [...] static const int kvm_vmx_max_exit_handlers = @@ -6830,6 +6838,13 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(vmx->exit_reason.failed_vmentry)) return EXIT_FASTPATH_NONE; + /* +* check the exit_reason to see if there is a bus lock +* happened in guest. +*/ + if (vmx->exit_reason.bus_lock_detected) + handle_bus_lock(vcpu); In case the ultimate goal is to have an exit to userspace on bus lock, I don't think we will need an exit to userspace on bus lock. See below. the two ways to reach handle_bus_lock() are very different: in case we're handling EXIT_REASON_BUS_LOCK we can easily drop to userspace by returning 0 but what are we going to do in case of exit_reason.bus_lock_detected? The 'higher priority VM exit' may require exit to userspace too. So what's the plan? Maybe we can ignore the case when we're exiting to userspace for some other reason as this is slow already and force the exit otherwise? And should we actually introduce the KVM_EXIT_BUS_LOCK and a capability to enable it here? Introducing KVM_EXIT_BUS_LOCK maybe help nothing. No matter EXIT_REASON_BUS_LOCK or exit_reason.bus_lock_detected, the bus lock has already happened. Exit to userspace cannot prevent bus lock, so what userspace can do is recording and counting as what this patch does in vcpu->stat.bus_locks.
Re: [PATCH v9 0/8] KVM: Add virtualization support of split lock detection
Ping for comments. On 5/9/2020 7:05 PM, Xiaoyao Li wrote: This series aims to add the virtualization of split lock detection in KVM. Due to the fact that split lock detection is tightly coupled with CPU model and CPU model is configurable by host VMM, we elect to use paravirt method to expose and enumerate it for guest. Changes in v9 - rebase to v5.7-rc4 - Add one patch to rename TIF_SLD to TIF_SLD_DISABLED; - Add one patch to remove bogus case in handle_guest_split_lock; - Introduce flag X86_FEATURE_SPLIT_LOCK_DETECT_FATAL and thus drop sld_state; - Use X86_FEATURE_SPLIT_LOCK_DETECT and X86_FEATURE_SPLIT_LOCK_DETECT_FATAL to determine the SLD state of host; - Introduce split_lock_virt_switch() and two wrappers for KVM instead of sld_update_to(); - Use paravirt to expose and enumerate split lock detection for guest; - Split lock detection can be exposed to guest when host is sld_fatal, even though host is SMT available. Changes in v8: https://lkml.kernel.org/r/20200414063129.133630-1-xiaoyao...@intel.com - rebase to v5.7-rc1. - basic enabling of split lock detection already merged. - When host is sld_warn and nosmt, load guest's sld bit when in KVM context, i.e., between vmx_prepare_switch_to_guest() and before vmx_prepare_switch_to_host(), KVM uses guest sld setting. Changes in v7: https://lkml.kernel.org/r/20200325030924.132881-1-xiaoyao...@intel.com - only pick patch 1 and patch 2, and hold all the left. - Update SLD bit on each processor based on sld_state. Changes in v6: https://lkml.kernel.org/r/20200324151859.31068-1-xiaoyao...@intel.com - Drop the sld_not_exist flag and use X86_FEATURE_SPLIT_LOCK_DETECT to check whether need to init split lock detection. [tglx] - Use tglx's method to verify the existence of split lock detectoin. - small optimization of sld_update_msr() that the default value of msr_test_ctrl_cache has split_lock_detect bit cleared. - Drop the patch3 in v5 that introducing kvm_only option. [tglx] - Rebase patch4-8 to kvm/queue. - use the new kvm-cpu-cap to expose X86_FEATURE_CORE_CAPABILITIES in Patch 6. Changes in v5: https://lkml.kernel.org/r/20200315050517.127446-1-xiaoyao...@intel.com - Use X86_FEATURE_SPLIT_LOCK_DETECT flag in kvm to ensure split lock detection is really supported. - Add and export sld related helper functions in their related usecase kvm patches. Xiaoyao Li (8): x86/split_lock: Rename TIF_SLD to TIF_SLD_DISABLED x86/split_lock: Remove bogus case in handle_guest_split_lock() x86/split_lock: Introduce flag X86_FEATURE_SLD_FATAL and drop sld_state x86/split_lock: Introduce split_lock_virt_switch() and two wrappers x86/kvm: Introduce paravirt split lock detection enumeration KVM: VMX: Enable MSR TEST_CTRL for guest KVM: VMX: virtualize split lock detection x86/split_lock: Enable split lock detection initialization when running as an guest on KVM Documentation/virt/kvm/cpuid.rst | 29 +++ arch/x86/include/asm/cpu.h | 35 ++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/thread_info.h | 6 +-- arch/x86/include/uapi/asm/kvm_para.h | 8 ++-- arch/x86/kernel/cpu/intel.c | 59 --- arch/x86/kernel/kvm.c| 3 ++ arch/x86/kernel/process.c| 2 +- arch/x86/kvm/cpuid.c | 6 +++ arch/x86/kvm/vmx/vmx.c | 72 +--- arch/x86/kvm/vmx/vmx.h | 3 ++ arch/x86/kvm/x86.c | 6 ++- arch/x86/kvm/x86.h | 7 +++ 13 files changed, 196 insertions(+), 41 deletions(-)
Re: [PATCH v2 1/7] KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID fails
On 6/24/2020 2:20 AM, Jim Mattson wrote: On Tue, Jun 23, 2020 at 4:58 AM Xiaoyao Li wrote: It needs to invalidate CPUID configruations if usersapce provides Nits: configurations, userspace oh, I'll fix it. illegal input. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 8a294f9747aa..1d13bad42bf9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -207,6 +207,8 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; kvfree(cpuid_entries); out: @@ -230,6 +232,8 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; out: return r; } -- 2.18.2 What if vcpu->arch.cpuid_nent was greater than 0 before the ioctl in question? Nice catch! If considering it, then we have to restore the old CPUID configuration. So how about making it simpler to just add one line of comment in API doc: If KVM_SET_CPUID{2} fails, the old valid configuration is cleared as a side effect.
[PATCH v2 2/7] KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent
As handling of bits other leaf 1 added over time, kvm_update_cpuid() should not return directly if leaf 1 is absent, but should go on updateing other CPUID leaves. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1d13bad42bf9..0164dac95ef5 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -60,22 +60,21 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); - if (!best) - return 0; - - /* Update OSXSAVE bit */ - if (boot_cpu_has(X86_FEATURE_XSAVE) && best->function == 0x1) - cpuid_entry_change(best, X86_FEATURE_OSXSAVE, + if (best) { + /* Update OSXSAVE bit */ + if (boot_cpu_has(X86_FEATURE_XSAVE)) + cpuid_entry_change(best, X86_FEATURE_OSXSAVE, kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)); - cpuid_entry_change(best, X86_FEATURE_APIC, + cpuid_entry_change(best, X86_FEATURE_APIC, vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); - if (apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; + if (apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } } best = kvm_find_cpuid_entry(vcpu, 7, 0); -- 2.18.2
[PATCH v2 4/7] KVM: X86: Split kvm_update_cpuid()
Split the part of updating vcpu model out of kvm_update_cpuid(), and put it into a new kvm_update_vcpu_model(). So it's more clear that kvm_update_cpuid() is to update guest CPUID settings, while kvm_update_vcpu_model() is to update vcpu model (settings) based on the updated CPUID settings. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 38 -- arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/x86.c | 1 + 3 files changed, 26 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 67e5c68fdb45..001f5a94880e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -76,7 +76,6 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu) void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; - struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best) { @@ -87,13 +86,6 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) cpuid_entry_change(best, X86_FEATURE_APIC, vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); - - if (apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; - } } best = kvm_find_cpuid_entry(vcpu, 7, 0); @@ -102,13 +94,8 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); best = kvm_find_cpuid_entry(vcpu, 0xD, 0); - if (!best) { - vcpu->arch.guest_supported_xcr0 = 0; - } else { - vcpu->arch.guest_supported_xcr0 = - (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + if (best) best->ebx = xstate_required_size(vcpu->arch.xcr0, false); - } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) || @@ -127,6 +114,27 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT); } +} + +void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (best && apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } + + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); + if (!best) + vcpu->arch.guest_supported_xcr0 = 0; + else + vcpu->arch.guest_supported_xcr0 = + (best->eax | ((u64)best->edx << 32)) & supported_xcr0; /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); @@ -218,6 +226,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); kvfree(cpuid_entries); out: @@ -247,6 +256,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f136de1debad..45e3643e2fba 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -10,6 +10,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); void kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_vcpu_model(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 00c88c2f34e4..4dee28bbc087 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8149,6 +8149,7 @@ static void enter_smm(struct kvm_vcpu *vcpu) #endif kvm_update_cpuid(vcpu); + kvm_update_vcpu_model(vcpu); kvm_mmu_reset_context(vcpu); } -- 2.18.2
[PATCH v2 3/7] KVM: X86: Introduce kvm_check_cpuid()
Use kvm_check_cpuid() to validate if userspace provides legal cpuid settings and call it before KVM updates CPUID. Signed-off-by: Xiaoyao Li --- Is the check of virutal address width really necessary? KVM doesn't check other bits at all. I guess the policy is that KVM allows illegal CPUID settings as long as it doesn't break host kernel. Userspace takes the consequences itself if it sets bogus CPUID settings that breaks its guest. But why vaddr_bits is special? It seems illegal vaddr_bits won't break host kernel. --- arch/x86/kvm/cpuid.c | 54 arch/x86/kvm/cpuid.h | 2 +- 2 files changed, 35 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0164dac95ef5..67e5c68fdb45 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -54,7 +54,26 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted) #define F feature_bit -int kvm_update_cpuid(struct kvm_vcpu *vcpu) +static int kvm_check_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + /* +* The existing code assumes virtual address is 48-bit or 57-bit in the +* canonical address checks; exit if it is ever changed. +*/ + best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + if (best) { + int vaddr_bits = (best->eax & 0xff00) >> 8; + + if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) + return -EINVAL; + } + + return 0; +} + +void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; struct kvm_lapic *apic = vcpu->arch.apic; @@ -96,18 +115,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) cpuid_entry_has(best, X86_FEATURE_XSAVEC))) best->ebx = xstate_required_size(vcpu->arch.xcr0, true); - /* -* The existing code assumes virtual address is 48-bit or 57-bit in the -* canonical address checks; exit if it is ever changed. -*/ - best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); - if (best) { - int vaddr_bits = (best->eax & 0xff00) >> 8; - - if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) - return -EINVAL; - } - best = kvm_find_cpuid_entry(vcpu, KVM_CPUID_FEATURES, 0); if (kvm_hlt_in_guest(vcpu->kvm) && best && (best->eax & (1 << KVM_FEATURE_PV_UNHALT))) @@ -127,7 +134,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_mmu_reset_context(vcpu); kvm_pmu_refresh(vcpu); - return 0; } static int is_efer_nx(void) @@ -202,12 +208,16 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_entries[i].padding[2] = 0; } vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; + goto out; + } + cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); kvfree(cpuid_entries); out: @@ -228,11 +238,15 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, cpuid->nent * sizeof(struct kvm_cpuid_entry2))) goto out; vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; + goto out; + } + kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 05434cd9342f..f136de1debad 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -9,7 +9,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); -int kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_cpuid(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, -- 2.18.2
[PATCH v2 5/7] KVM: X86: Rename cpuid_update() to update_vcpu_model()
The name of callback cpuid_update() is misleading that it's not about updating CPUID settings of vcpu but updating the configurations of vcpu based on the CPUIDs. So rename it to update_vcpu_model(). Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 4 ++-- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 ++-- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f852ee350beb..e8ae89eef199 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1083,7 +1083,7 @@ struct kvm_x86_ops { void (*hardware_unsetup)(void); bool (*cpu_has_accelerated_tpr)(void); bool (*has_emulated_msr)(u32 index); - void (*cpuid_update)(struct kvm_vcpu *vcpu); + void (*update_vcpu_model)(struct kvm_vcpu *vcpu); unsigned int vm_size; int (*vm_init)(struct kvm *kvm); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 001f5a94880e..d2f93823f9fd 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -224,7 +224,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); + kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -254,7 +254,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, } kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); + kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index c0da4dd78ac5..480d0354910a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3551,7 +3551,7 @@ static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) return 0; } -static void svm_cpuid_update(struct kvm_vcpu *vcpu) +static void svm_update_vcpu_model(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4051,7 +4051,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .get_exit_info = svm_get_exit_info, - .cpuid_update = svm_cpuid_update, + .update_vcpu_model = svm_update_vcpu_model, .has_wbinvd_exit = svm_has_wbinvd_exit, diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index d1af20b050a8..86ba7cd49c97 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6322,7 +6322,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) /* * secondary cpu-based controls. Do not include those that -* depend on CPUID bits, they are added later by vmx_cpuid_update. +* depend on CPUID bits, they are added later by vmx_update_vcpu_model. */ if (msrs->procbased_ctls_high & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b1a23ad986ff..147c696d6445 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7251,7 +7251,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu) vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4)); } -static void vmx_cpuid_update(struct kvm_vcpu *vcpu) +static void vmx_update_vcpu_model(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -7945,7 +7945,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .get_exit_info = vmx_get_exit_info, - .cpuid_update = vmx_cpuid_update, + .update_vcpu_model = vmx_update_vcpu_model, .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, -- 2.18.2
[PATCH v2 1/7] KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID fails
It needs to invalidate CPUID configruations if usersapce provides illegal input. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 8a294f9747aa..1d13bad42bf9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -207,6 +207,8 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; kvfree(cpuid_entries); out: @@ -230,6 +232,8 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; out: return r; } -- 2.18.2
[PATCH v2 0/7] Refactor handling flow of SET_CPUID*
This serial is the extended version of https://lkml.kernel.org/r/20200528151927.14346-1-xiaoyao...@intel.com First two patches are bug fixing, and the others aim to refactor the flow of SET_CPUID* as: 1. cpuid check: check if userspace provides legal CPUID settings; 2. cpuid update: Update some special CPUID bits based on current vcpu state, e.g., OSXSAVE, OSPKE, ... 3. update vcpu model: Update vcpu model (settings) based on the final CPUID settings. v2: - rebase to kvm/queue: a037ff353ba6 ("Merge branch 'kvm-master' into HEAD") - change the name of kvm_update_state_based_on_cpuid() to kvm_update_vcpu_model() [Sean] - Add patch 5 to rename kvm_x86_ops.cpuid_date() to kvm_x86_ops.update_vcpu_model() v1: https://lkml.kernel.org/r/20200529085545.29242-1-xiaoyao...@intel.com Xiaoyao Li (7): KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID fails KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent KVM: X86: Introduce kvm_check_cpuid() KVM: X86: Split kvm_update_cpuid() KVM: X86: Rename cpuid_update() to update_vcpu_model() KVM: X86: Move kvm_x86_ops.update_vcpu_model() into kvm_update_vcpu_model() KVM: X86: Move kvm_apic_set_version() to kvm_update_vcpu_model() arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 108 arch/x86/kvm/cpuid.h| 3 +- arch/x86/kvm/svm/svm.c | 4 +- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/x86.c | 1 + 7 files changed, 77 insertions(+), 47 deletions(-) -- 2.18.2
[PATCH v2 7/7] KVM: X86: Move kvm_apic_set_version() to kvm_update_vcpu_model()
Obviously, kvm_apic_set_version() fits well in kvm_update_vcpu_model(). Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 5decc2dd5448..3428f4d84b42 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -129,6 +129,8 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) apic->lapic_timer.timer_mode_mask = 3 << 17; else apic->lapic_timer.timer_mode_mask = 1 << 17; + + kvm_apic_set_version(vcpu); } best = kvm_find_cpuid_entry(vcpu, 0xD, 0); @@ -226,7 +228,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, } cpuid_fix_nx_cap(vcpu); - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -255,7 +256,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; } - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: -- 2.18.2
[PATCH v2 6/7] KVM: X86: Move kvm_x86_ops.update_vcpu_model() into kvm_update_vcpu_model()
kvm_x86_ops.update_vcpu_model() is used to update vmx/svm vcpu settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are fully updated, i.e., kvm_update_cpuid(). Move it in kvm_update_vcpu_model(). Signed-off-by: Xiaoyao Li --- --- arch/x86/kvm/cpuid.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d2f93823f9fd..5decc2dd5448 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -121,6 +121,8 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; struct kvm_cpuid_entry2 *best; + kvm_x86_ops.update_vcpu_model(vcpu); + best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best && apic) { if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) @@ -136,6 +138,7 @@ void kvm_update_vcpu_model(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); vcpu->arch.tdp_level = kvm_x86_ops.get_tdp_level(vcpu); @@ -224,7 +227,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); @@ -254,7 +256,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, } kvm_apic_set_version(vcpu); - kvm_x86_ops.update_vcpu_model(vcpu); kvm_update_cpuid(vcpu); kvm_update_vcpu_model(vcpu); out: -- 2.18.2
Re: [PATCH] KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL
On 6/23/2020 8:51 AM, Sean Christopherson wrote: Remove support for context switching between the guest's and host's desired UMWAIT_CONTROL. Propagating the guest's value to hardware isn't required for correct functionality, e.g. KVM intercepts reads and writes to the MSR, and the latency effects of the settings controlled by the MSR are not architecturally visible. As a general rule, KVM should not allow the guest to control power management settings unless explicitly enabled by userspace, e.g. see KVM_CAP_X86_DISABLE_EXITS. E.g. Intel's SDM explicitly states that C0.2 can improve the performance of SMT siblings. A devious guest could disable C0.2 so as to improve the performance of their workloads at the detriment to workloads running in the host or on other VMs. Wholesale removal of UMWAIT_CONTROL context switching also fixes a race condition where updates from the host may cause KVM to enter the guest with the incorrect value. Because updates are are propagated to all CPUs via IPI (SMP function callback), the value in hardware may be stale with respect to the cached value and KVM could enter the guest with the wrong value in hardware. As above, the guest can't observe the bad value, but it's a weird and confusing wart in the implementation. Removal also fixes the unnecessary usage of VMX's atomic load/store MSR lists. Using the lists is only necessary for MSRs that are required for correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on old hardware, or for MSRs that need to-the-uop precision, e.g. perf related MSRs. For UMWAIT_CONTROL, the effects are only visible in the kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in vcpu_vmx_run(). Using the atomic lists is undesirable as they are more expensive than direct RDMSR/WRMSR. Do you mean the extra handling of atomic list facility in kvm? Or just mean vm-exit/-entry MSR-load/save in VMX hardware is expensive than direct RDMSR/WRMSR instruction?
[PATCH] KVM: X86: Fix MSR range of APIC registers in X2APIC mode
Only MSR address range 0x800 through 0x8ff is architecturally reserved and dedicated for accessing APIC registers in x2APIC mode. Fixes: 0105d1a52640 ("KVM: x2apic interface to lapic") Signed-off-by: Xiaoyao Li --- arch/x86/kvm/x86.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 00c88c2f34e4..29d9b078ce69 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2856,7 +2856,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return kvm_mtrr_set_msr(vcpu, msr, data); case MSR_IA32_APICBASE: return kvm_set_apic_base(vcpu, msr_info); - case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff: + case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff: return kvm_x2apic_msr_write(vcpu, msr, data); case MSR_IA32_TSCDEADLINE: kvm_set_lapic_tscdeadline_msr(vcpu, data); @@ -3196,7 +3196,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_APICBASE: msr_info->data = kvm_get_apic_base(vcpu); break; - case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff: + case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff: return kvm_x2apic_msr_read(vcpu, msr_info->index, &msr_info->data); case MSR_IA32_TSCDEADLINE: msr_info->data = kvm_get_lapic_tscdeadline_msr(vcpu); -- 2.18.2
Re: [PATCH v12 07/11] KVM: vmx/pmu: Unmask LBR fields in the MSR_IA32_DEBUGCTLMSR emualtion
On 6/13/2020 4:09 PM, Like Xu wrote: When the LBR feature is reported by the vmx_get_perf_capabilities(), the LBR fields in the [vmx|vcpu]_supported debugctl should be unmasked. The debugctl msr is handled separately in vmx/svm and they're not completely identical, hence remove the common msr handling code. Signed-off-by: Like Xu --- arch/x86/kvm/vmx/capabilities.h | 12 arch/x86/kvm/vmx/pmu_intel.c| 19 +++ arch/x86/kvm/x86.c | 13 - 3 files changed, 31 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index b633a90320ee..f6fcfabb1026 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -21,6 +21,8 @@ extern int __read_mostly pt_mode; #define PMU_CAP_FW_WRITES (1ULL << 13) #define PMU_CAP_LBR_FMT 0x3f +#define DEBUGCTLMSR_LBR_MASK (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) + struct nested_vmx_msrs { /* * We only store the "true" versions of the VMX capability MSRs. We @@ -387,4 +389,14 @@ static inline u64 vmx_get_perf_capabilities(void) return perf_cap; } +static inline u64 vmx_get_supported_debugctl(void) +{ + u64 val = 0; + + if (vmx_get_perf_capabilities() & PMU_CAP_LBR_FMT) + val |= DEBUGCTLMSR_LBR_MASK; + + return val; +} + #endif /* __KVM_X86_VMX_CAPS_H */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index a953c7d633f6..d92e95b64c74 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -187,6 +187,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu->version > 1; break; + case MSR_IA32_DEBUGCTLMSR: case MSR_IA32_PERF_CAPABILITIES: ret = 1; break; @@ -237,6 +238,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; msr_info->data = vcpu->arch.perf_capabilities; return 0; + case MSR_IA32_DEBUGCTLMSR: + msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL); Can we put the emulation of MSR_IA32_DEBUGCTLMSR in vmx_{get/set})_msr(). AFAIK, MSR_IA32_DEBUGCTLMSR is not a pure PMU related MSR that there is bit 2 to enable #DB for bus lock. + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -282,6 +286,16 @@ static inline bool lbr_is_compatible(struct kvm_vcpu *vcpu) return true; } +static inline u64 vcpu_get_supported_debugctl(struct kvm_vcpu *vcpu) +{ + u64 debugctlmsr = vmx_get_supported_debugctl(); + + if (!lbr_is_enabled(vcpu)) + debugctlmsr &= ~DEBUGCTLMSR_LBR_MASK; + + return debugctlmsr; +} + static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); @@ -336,6 +350,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) } vcpu->arch.perf_capabilities = data; return 0; + case MSR_IA32_DEBUGCTLMSR: + if (data & ~vcpu_get_supported_debugctl(vcpu)) + return 1; + vmcs_write64(GUEST_IA32_DEBUGCTL, data); + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 00c88c2f34e4..56f275eb4554 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2840,18 +2840,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; } break; - case MSR_IA32_DEBUGCTLMSR: - if (!data) { - /* We support the non-activated case already */ - break; - } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) { So after this patch, guest trying to set bit DEBUGCTLMSR_BTF will get a #GP instead of being ignored and printing a log in kernel. These codes were introduced ~12 years ago in commit b5e2fec0ebc3 ("KVM: Ignore DEBUGCTL MSRs with no effect"), just to make Netware happy. Maybe I'm overthinking for that too old thing. - /* Values other than LBR and BTF are vendor-specific, - thus reserved and should throw a #GP */ - return 1; - } - vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n", - __func__, data); - break; case 0x200 ... 0x2ff: return kvm_mtrr_set_msr(
Re: [PATCH] x86/split_lock: Sanitize userspace and guest error output
On 6/6/2020 12:42 AM, Prarit Bhargava wrote: On 6/5/20 11:29 AM, Xiaoyao Li wrote: On 6/5/2020 7:44 PM, Prarit Bhargava wrote: There are two problems with kernel messages in fatal mode that were found during testing of guests and userspace programs. The first is that no kernel message is output when the split lock detector is triggered with a userspace program. As a result the userspace process dies from receiving SIGBUS with no indication to the user of what caused the process to die. The second problem is that only the first triggering guest causes a kernel message to be output because the message is output with pr_warn_once(). This also results in a loss of information to the user. While fixing these I noticed that the same message was being output three times so I'm cleaning that up too. Fix fatal mode output, and use consistent messages for fatal and warn modes for both userspace and guests. Signed-off-by: Prarit Bhargava Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Tony Luck Cc: "Peter Zijlstra (Intel)" Cc: Sean Christopherson Cc: Rahul Tanwar Cc: Xiaoyao Li Cc: Ricardo Neri Cc: Dave Hansen --- arch/x86/kernel/cpu/intel.c | 24 ++-- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 166d7c355896..463022aa9b7a 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1074,10 +1074,14 @@ static void split_lock_init(void) split_lock_verify_msr(sld_state != sld_off); } -static void split_lock_warn(unsigned long ip) +static bool split_lock_warn(unsigned long ip, int fatal) { - pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n", - current->comm, current->pid, ip); + pr_warn_ratelimited("#AC: %s/%d %ssplit_lock trap at address: 0x%lx\n", + current->comm, current->pid, + sld_state == sld_fatal ? "fatal " : "", ip); + + if (sld_state == sld_fatal || fatal) + return false; /* * Disable the split lock detection for this task so it can make @@ -1086,18 +1090,13 @@ static void split_lock_warn(unsigned long ip) */ sld_update_msr(false); set_tsk_thread_flag(current, TIF_SLD); + return true; } bool handle_guest_split_lock(unsigned long ip) { - if (sld_state == sld_warn) { - split_lock_warn(ip); + if (split_lock_warn(ip, 0)) return true; - } - - pr_warn_once("#AC: %s/%d %s split_lock trap at address: 0x%lx\n", - current->comm, current->pid, - sld_state == sld_fatal ? "fatal" : "bogus", ip); current->thread.error_code = 0; current->thread.trap_nr = X86_TRAP_AC; @@ -1108,10 +1107,7 @@ EXPORT_SYMBOL_GPL(handle_guest_split_lock); bool handle_user_split_lock(struct pt_regs *regs, long error_code) { - if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) - return false; - split_lock_warn(regs->ip); - return true; + return split_lock_warn(regs->ip, regs->flags & X86_EFLAGS_AC); It's incorrect. You change the behavior that it will print the split lock warning even when CPL 3 Alignment Check is turned on. Do you want the message to be displayed in the fatal case of CPL 3 Alignment check? No. It should never be displayed if #AC happens in CPL 3 and X86_EFLAGS_AC is set. In this case, an unaligned access triggers #AC regardless of #LOCK prefix. What's more, even there is a #LOCK prefix, we still cannot tell the cause because we don't know the priority of legacy alignment check #AC and split lock #AC. If you do want a message, we can only say "unaligned access at address xxx".
Re: [PATCH] x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted
On 6/6/2020 3:26 AM, Sean Christopherson wrote: Choo! Choo! All aboard the Split Lock Express, with direct service to Wreckage! Skip split_lock_verify_msr() if the CPU isn't whitelisted as a possible SLD-enabled CPU model to avoid writing MSR_TEST_CTRL. MSR_TEST_CTRL exists, and is writable, on many generations of CPUs. Writing the MSR, even with '0', can result in bizarre, undocumented behavior. This fixes a crash on Haswell when resuming from suspend with a live KVM guest. Because APs use the standard SMP boot flow for resume, they will go through split_lock_init() and the subsequent RDMSR/WRMSR sequence, which runs even when sld_state==sld_off to ensure SLD is disabled. On Haswell (at least, my Haswell), writing MSR_TEST_CTRL with '0' will succeed and _may_ take the SMT _sibling_ out of VMX root mode. When KVM has an active guest, KVM performs VMXON as part of CPU onlining (see kvm_starting_cpu()). Because SMP boot is serialized, the resulting flow is effectively: on_each_ap_cpu() { WRMSR(MSR_TEST_CTRL, 0) VMXON } As a result, the WRMSR can disable VMX on a different CPU that has already done VMXON. This ultimately results in a #UD on VMPTRLD when KVM regains control and attempt run its vCPUs. The above voodoo was confirmed by reworking KVM's VMXON flow to write MSR_TEST_CTRL prior to VMXON, and to serialize the sequence as above. Further verification of the insanity was done by redoing VMXON on all APs after the initial WRMSR->VMXON sequence. The additional VMXON, which should VM-Fail, occasionally succeeded, and also eliminated the unexpected #UD on VMPTRLD. The damage done by writing MSR_TEST_CTRL doesn't appear to be limited to VMX, e.g. after suspend with an active KVM guest, subsequent reboots almost always hang (even when fudging VMXON), a #UD on a random Jcc was observed, suspend/resume stability is qualitatively poor, and so on and so forth. I'm wondering if all those side-effects of MSR_TEST_CTRL exist on CPUs have SLD feature, have you ever tested on a SLD capable CPU?
Re: [PATCH] x86/split_lock: Sanitize userspace and guest error output
On 6/5/2020 7:44 PM, Prarit Bhargava wrote: There are two problems with kernel messages in fatal mode that were found during testing of guests and userspace programs. The first is that no kernel message is output when the split lock detector is triggered with a userspace program. As a result the userspace process dies from receiving SIGBUS with no indication to the user of what caused the process to die. The second problem is that only the first triggering guest causes a kernel message to be output because the message is output with pr_warn_once(). This also results in a loss of information to the user. While fixing these I noticed that the same message was being output three times so I'm cleaning that up too. Fix fatal mode output, and use consistent messages for fatal and warn modes for both userspace and guests. Signed-off-by: Prarit Bhargava Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Tony Luck Cc: "Peter Zijlstra (Intel)" Cc: Sean Christopherson Cc: Rahul Tanwar Cc: Xiaoyao Li Cc: Ricardo Neri Cc: Dave Hansen --- arch/x86/kernel/cpu/intel.c | 24 ++-- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 166d7c355896..463022aa9b7a 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -1074,10 +1074,14 @@ static void split_lock_init(void) split_lock_verify_msr(sld_state != sld_off); } -static void split_lock_warn(unsigned long ip) +static bool split_lock_warn(unsigned long ip, int fatal) { - pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n", - current->comm, current->pid, ip); + pr_warn_ratelimited("#AC: %s/%d %ssplit_lock trap at address: 0x%lx\n", + current->comm, current->pid, + sld_state == sld_fatal ? "fatal " : "", ip); + + if (sld_state == sld_fatal || fatal) + return false; /* * Disable the split lock detection for this task so it can make @@ -1086,18 +1090,13 @@ static void split_lock_warn(unsigned long ip) */ sld_update_msr(false); set_tsk_thread_flag(current, TIF_SLD); + return true; } bool handle_guest_split_lock(unsigned long ip) { - if (sld_state == sld_warn) { - split_lock_warn(ip); + if (split_lock_warn(ip, 0)) return true; - } - - pr_warn_once("#AC: %s/%d %s split_lock trap at address: 0x%lx\n", -current->comm, current->pid, -sld_state == sld_fatal ? "fatal" : "bogus", ip); current->thread.error_code = 0; current->thread.trap_nr = X86_TRAP_AC; @@ -1108,10 +1107,7 @@ EXPORT_SYMBOL_GPL(handle_guest_split_lock); bool handle_user_split_lock(struct pt_regs *regs, long error_code) { - if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) - return false; - split_lock_warn(regs->ip); - return true; + return split_lock_warn(regs->ip, regs->flags & X86_EFLAGS_AC); It's incorrect. You change the behavior that it will print the split lock warning even when CPL 3 Alignment Check is turned on. } /*
Re: [PATCH][v6] KVM: X86: support APERF/MPERF registers
On 6/5/2020 9:44 AM, Li RongQing wrote: Guest kernel reports a fixed cpu frequency in /proc/cpuinfo, this is confused to user when turbo is enable, and aperf/mperf can be used to show current cpu frequency after 7d5905dc14a "(x86 / CPU: Always show current CPU frequency in /proc/cpuinfo)" so guest should support aperf/mperf capability This patch implements aperf/mperf by three mode: none, software emulation, and pass-through None: default mode, guest does not support aperf/mperf Software emulation: the period of aperf/mperf in guest mode are accumulated as emulated value Pass-though: it is only suitable for KVM_HINTS_REALTIME, Because that hint guarantees we have a 1:1 vCPU:CPU binding and guaranteed no over-commit. And a per-VM capability is added to configure aperfmperf mode Signed-off-by: Li RongQing Signed-off-by: Chai Wen Signed-off-by: Jia Lina --- diff v5: return error if guest is configured with mperf/aperf, but host cpu has not diff v4: fix maybe-uninitialized warning diff v3: fix interception of MSR_IA32_MPERF/APERF in svm diff v2: support aperfmperf pass though move common codes to kvm_get_msr_common diff v1: 1. support AMD, but not test 2. support per-vm capability to enable Documentation/virt/kvm/api.rst | 10 ++ arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/kvm/cpuid.c| 15 ++- arch/x86/kvm/svm/svm.c | 8 arch/x86/kvm/vmx/vmx.c | 6 ++ arch/x86/kvm/x86.c | 42 + arch/x86/kvm/x86.h | 15 +++ include/uapi/linux/kvm.h| 1 + 8 files changed, 107 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index d871dacb984e..f854f4da6fd8 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6126,3 +6126,13 @@ KVM can therefore start protected VMs. This capability governs the KVM_S390_PV_COMMAND ioctl and the KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected guests when the state change is invalid. + +8.23 KVM_CAP_APERFMPERF + + +:Architectures: x86 +:Parameters: args[0] is aperfmperf mode; + 0 for not support, 1 for software emulation, 2 for pass-through +:Returns: 0 on success; -1 on error + +This capability indicates that KVM supports APERF and MPERF MSR registers diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fd78bd44b2d6..14643f8af9c4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -824,6 +824,9 @@ struct kvm_vcpu_arch { /* AMD MSRC001_0015 Hardware Configuration */ u64 msr_hwcr; + + u64 v_mperf; + u64 v_aperf; }; struct kvm_lpage_info { @@ -889,6 +892,12 @@ enum kvm_irqchip_mode { KVM_IRQCHIP_SPLIT,/* created with KVM_CAP_SPLIT_IRQCHIP */ }; +enum kvm_aperfmperf_mode { + KVM_APERFMPERF_NONE, + KVM_APERFMPERF_SOFT, /* software emulate aperfmperf */ + KVM_APERFMPERF_PT,/* pass-through aperfmperf to guest */ +}; + #define APICV_INHIBIT_REASON_DISABLE0 #define APICV_INHIBIT_REASON_HYPERV 1 #define APICV_INHIBIT_REASON_NESTED 2 @@ -986,6 +995,8 @@ struct kvm_arch { struct kvm_pmu_event_filter *pmu_event_filter; struct task_struct *nx_lpage_recovery_thread; + + enum kvm_aperfmperf_mode aperfmperf_mode; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cd708b0b460a..80f18b29a845 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -122,6 +122,16 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) MSR_IA32_MISC_ENABLE_MWAIT); } + best = kvm_find_cpuid_entry(vcpu, 6, 0); + if (best) { + if (guest_has_aperfmperf(vcpu->kvm)) { + if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) + return -EINVAL; kvm_vm_ioctl_enable_cap() ensures that guest_has_aperfmperf() always aligns with boot_cpu_has(X86_FEATURE_APERFMPERF). So above is unnecessary. + best->ecx |= 1; + } else { + best->ecx &= ~1; + } + } you could do bool guest_cpuid_aperfmperf = false; if (best) guest_cpuid_aperfmperf = !!(best->ecx & BIT(0)); if (guest_cpuid_aperfmerf != guest_has_aperfmperf(vcpu->kvm)) return -EINVAL; In fact, I think we can do nothing here. Leave it as what usersapce wants just like how KVM treats other CPUID bits. Paolo, What's your point? /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); vcpu->arch.tdp_level = kvm_x86_ops.get_tdp_level(vcpu); [...] @@ -4930,6 +4939,11 @@ int kvm_vm_ioctl_enable_
[PATCH v2] KVM: x86: Assign correct value to array.maxnent
Delay the assignment of array.maxnent to use correct value for the case cpuid->nent > KVM_MAX_CPUID_ENTRIES. Fixes: e53c95e8d41e ("KVM: x86: Encapsulate CPUID entries and metadata in struct") Signed-off-by: Xiaoyao Li --- v2: - remove "const" of maxnent to fix build error. --- arch/x86/kvm/cpuid.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 253b8e875ccd..3d88ddf781d0 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -426,7 +426,7 @@ EXPORT_SYMBOL_GPL(kvm_set_cpu_caps); struct kvm_cpuid_array { struct kvm_cpuid_entry2 *entries; - const int maxnent; + int maxnent; int nent; }; @@ -870,7 +870,6 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, struct kvm_cpuid_array array = { .nent = 0, - .maxnent = cpuid->nent, }; int r, i; @@ -887,6 +886,8 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, if (!array.entries) return -ENOMEM; + array.maxnent = cpuid->nent; + for (i = 0; i < ARRAY_SIZE(funcs); i++) { r = get_cpuid_func(&array, funcs[i], type); if (r) -- 2.18.2
Re: [PATCH] KVM: x86: Assign correct value to array.maxnent
On 6/4/2020 10:43 AM, Xiaoyao Li wrote: Delay the assignment of array.maxnent to use correct value for the case cpuid->nent > KVM_MAX_CPUID_ENTRIES. Fixes: e53c95e8d41e ("KVM: x86: Encapsulate CPUID entries and metadata in struct") Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 253b8e875ccd..befff01d100c 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -870,7 +870,6 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, struct kvm_cpuid_array array = { .nent = 0, - .maxnent = cpuid->nent, }; int r, i; @@ -887,6 +886,8 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, if (!array.entries) return -ENOMEM; + array.maxnent = cpuid->nent; Miss the fact that maxnent is const, V2 is coming. + for (i = 0; i < ARRAY_SIZE(funcs); i++) { r = get_cpuid_func(&array, funcs[i], type); if (r)
[PATCH] KVM: x86: Assign correct value to array.maxnent
Delay the assignment of array.maxnent to use correct value for the case cpuid->nent > KVM_MAX_CPUID_ENTRIES. Fixes: e53c95e8d41e ("KVM: x86: Encapsulate CPUID entries and metadata in struct") Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 253b8e875ccd..befff01d100c 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -870,7 +870,6 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, struct kvm_cpuid_array array = { .nent = 0, - .maxnent = cpuid->nent, }; int r, i; @@ -887,6 +886,8 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, if (!array.entries) return -ENOMEM; + array.maxnent = cpuid->nent; + for (i = 0; i < ARRAY_SIZE(funcs); i++) { r = get_cpuid_func(&array, funcs[i], type); if (r) -- 2.18.2
Re: [PATCH 4/6] KVM: X86: Split kvm_update_cpuid()
On 6/3/2020 9:10 AM, Sean Christopherson wrote: On Fri, May 29, 2020 at 04:55:43PM +0800, Xiaoyao Li wrote: Split the part of updating KVM states from kvm_update_cpuid(), and put it into a new kvm_update_state_based_on_cpuid(). So it's clear that kvm_update_cpuid() is to update guest CPUID settings, while kvm_update_state_based_on_cpuid() is to update KVM states based on the updated CPUID settings. What about kvm_update_vcpu_model()? "state" isn't necessarily correct either. yeah, it's better.
Re: 答复: [PATCH][v5] KVM: X86: support APERF/MPERF registers
On 5/31/2020 10:08 AM, Li,Rongqing wrote: -邮件原件- 发件人: Xiaoyao Li [mailto:xiaoyao...@intel.com] 发送时间: 2020年5月30日 18:40 收件人: Li,Rongqing ; linux-kernel@vger.kernel.org; k...@vger.kernel.org; x...@kernel.org; h...@zytor.com; b...@alien8.de; mi...@redhat.com; t...@linutronix.de; jmatt...@google.com; wanpen...@tencent.com; vkuzn...@redhat.com; sean.j.christopher...@intel.com; pbonz...@redhat.com; wei.hua...@amd.com 主题: Re: [PATCH][v5] KVM: X86: support APERF/MPERF registers On 5/30/2020 12:35 PM, Li RongQing wrote: Guest kernel reports a fixed cpu frequency in /proc/cpuinfo, this is confused to user when turbo is enable, and aperf/mperf can be used to show current cpu frequency after 7d5905dc14a "(x86 / CPU: Always show current CPU frequency in /proc/cpuinfo)" so guest should support aperf/mperf capability This patch implements aperf/mperf by three mode: none, software emulation, and pass-through None: default mode, guest does not support aperf/mperf Software emulation: the period of aperf/mperf in guest mode are accumulated as emulated value Pass-though: it is only suitable for KVM_HINTS_REALTIME, Because that hint guarantees we have a 1:1 vCPU:CPU binding and guaranteed no over-commit. And a per-VM capability is added to configure aperfmperf mode [...] diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cd708b0b460a..c960dda4251b 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -122,6 +122,14 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) MSR_IA32_MISC_ENABLE_MWAIT); } + best = kvm_find_cpuid_entry(vcpu, 6, 0); + if (best) { + if (guest_has_aperfmperf(vcpu->kvm) && + boot_cpu_has(X86_FEATURE_APERFMPERF)) + best->ecx |= 1; + else + best->ecx &= ~1; + } In my understanding, KVM allows userspace to set a CPUID feature bit for guest even if hardware doesn't support the feature. So what makes X86_FEATURE_APERFMPERF different here? Is there any concern I miss? -Xiaoyao Whether software emulation for aperf/mperf or pass-through depends on host cpu aperf/mperf feature. Software emulation: the period of aperf/mperf in guest mode are accumulated as emulated value I know it that you want to ensure the correctness of exposure of aperf/mperf. But there are so many features other than aperf/mperf that KVM reports the supported settings of them through KVM_GET_SUPPORTED_CPUID, but doesn't check nor force the correctness of userspace input. i.e., KVM allows userspace to set bogus CPUID settings as long as it doesn't break KVM (host kernel). Indeed, bogus CPUID settings more than likely breaks the guest. But it's not KVM's fault. KVM just do what userspace wants. IMO, If we really want to ensure the correctness of userspace provided CPUID settings, we need to return ERROR to userspace instead of fixing it siliently. - Xiaoyao
Re: [PATCH][v5] KVM: X86: support APERF/MPERF registers
On 5/30/2020 12:35 PM, Li RongQing wrote: Guest kernel reports a fixed cpu frequency in /proc/cpuinfo, this is confused to user when turbo is enable, and aperf/mperf can be used to show current cpu frequency after 7d5905dc14a "(x86 / CPU: Always show current CPU frequency in /proc/cpuinfo)" so guest should support aperf/mperf capability This patch implements aperf/mperf by three mode: none, software emulation, and pass-through None: default mode, guest does not support aperf/mperf Software emulation: the period of aperf/mperf in guest mode are accumulated as emulated value Pass-though: it is only suitable for KVM_HINTS_REALTIME, Because that hint guarantees we have a 1:1 vCPU:CPU binding and guaranteed no over-commit. And a per-VM capability is added to configure aperfmperf mode [...] diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cd708b0b460a..c960dda4251b 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -122,6 +122,14 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) MSR_IA32_MISC_ENABLE_MWAIT); } + best = kvm_find_cpuid_entry(vcpu, 6, 0); + if (best) { + if (guest_has_aperfmperf(vcpu->kvm) && + boot_cpu_has(X86_FEATURE_APERFMPERF)) + best->ecx |= 1; + else + best->ecx &= ~1; + } In my understanding, KVM allows userspace to set a CPUID feature bit for guest even if hardware doesn't support the feature. So what makes X86_FEATURE_APERFMPERF different here? Is there any concern I miss? -Xiaoyao /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); vcpu->arch.tdp_level = kvm_x86_ops.get_tdp_level(vcpu);
[PATCH 5/6] KVM: X86: Move kvm_x86_ops.cpuid_update() into kvm_update_state_based_on_cpuid()
kvm_x86_ops.cpuid_update() is used to update vmx/svm settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are fully updated, i.e., kvm_update_cpuid(), not in the middle stage. Put it in kvm_update_state_based_on_cpuid() to make it clear that it's to update vmx/svm specific states based on CPUID. Signed-off-by: Xiaoyao Li --- Should we rename kvm_x86_ops.cpuid_update to something like kvm_x86_ops.update_state_based_on_cpuid? cpuid_update is really confusing especially when kvm_x86_ops.update_cpuid() is needed someday. --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index a4a2072f5253..5d4da8970940 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -136,6 +136,8 @@ void kvm_update_state_based_on_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + kvm_x86_ops.cpuid_update(vcpu); + /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); vcpu->arch.tdp_level = kvm_x86_ops.get_tdp_level(vcpu); @@ -227,7 +229,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); kvm_update_state_based_on_cpuid(vcpu); @@ -257,7 +258,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, } kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); kvm_update_state_based_on_cpuid(vcpu); out: -- 2.18.2
[PATCH 2/6] KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent
As handling of bits other leaf 1 added over time, kvm_update_cpuid() should not return directly if leaf 1 is absent, but should go on updateing other CPUID leaves. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 2f1a9650b7f2..795bbaf37110 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -60,22 +60,21 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); - if (!best) - return 0; - - /* Update OSXSAVE bit */ - if (boot_cpu_has(X86_FEATURE_XSAVE) && best->function == 0x1) - cpuid_entry_change(best, X86_FEATURE_OSXSAVE, + if (best) { + /* Update OSXSAVE bit */ + if (boot_cpu_has(X86_FEATURE_XSAVE)) + cpuid_entry_change(best, X86_FEATURE_OSXSAVE, kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)); - cpuid_entry_change(best, X86_FEATURE_APIC, + cpuid_entry_change(best, X86_FEATURE_APIC, vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); - if (apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; + if (apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } } best = kvm_find_cpuid_entry(vcpu, 7, 0); -- 2.18.2
[PATCH 0/6] Refactor handling flow of SET_CPUID*
This serial is the extended version of https://lkml.kernel.org/r/20200528151927.14346-1-xiaoyao...@intel.com First two patches are bug fixing, and the other aim to refactor the flow of SET_CPUID* as: 1. cpuid check: check if userspace provides legal CPUID settings; 2. cpuid update: Update some special CPUID bits based on current vcpu state, e.g., OSXSAVE, OSPKE, ... 3. update KVM state: Update KVM states based on the final CPUID settings. Xiaoyao Li (6): KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID fails KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent KVM: X86: Introduce kvm_check_cpuid() KVM: X86: Split kvm_update_cpuid() KVM: X86: Move kvm_x86_ops.cpuid_update() into kvm_update_state_based_on_cpuid() KVM: X86: Move kvm_apic_set_version() to kvm_update_state_based_on_cpuid() arch/x86/kvm/cpuid.c | 107 +++ arch/x86/kvm/cpuid.h | 3 +- arch/x86/kvm/x86.c | 1 + 3 files changed, 70 insertions(+), 41 deletions(-) -- 2.18.2
[PATCH 4/6] KVM: X86: Split kvm_update_cpuid()
Split the part of updating KVM states from kvm_update_cpuid(), and put it into a new kvm_update_state_based_on_cpuid(). So it's clear that kvm_update_cpuid() is to update guest CPUID settings, while kvm_update_state_based_on_cpuid() is to update KVM states based on the updated CPUID settings. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 38 -- arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/x86.c | 1 + 3 files changed, 26 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index c8cb373056f1..a4a2072f5253 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -76,7 +76,6 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu) void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; - struct kvm_lapic *apic = vcpu->arch.apic; best = kvm_find_cpuid_entry(vcpu, 1, 0); if (best) { @@ -87,13 +86,6 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) cpuid_entry_change(best, X86_FEATURE_APIC, vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); - - if (apic) { - if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) - apic->lapic_timer.timer_mode_mask = 3 << 17; - else - apic->lapic_timer.timer_mode_mask = 1 << 17; - } } best = kvm_find_cpuid_entry(vcpu, 7, 0); @@ -102,13 +94,8 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); best = kvm_find_cpuid_entry(vcpu, 0xD, 0); - if (!best) { - vcpu->arch.guest_supported_xcr0 = 0; - } else { - vcpu->arch.guest_supported_xcr0 = - (best->eax | ((u64)best->edx << 32)) & supported_xcr0; + if (best) best->ebx = xstate_required_size(vcpu->arch.xcr0, false); - } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) || @@ -127,6 +114,27 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT); } +} + +void kvm_update_state_based_on_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (best && apic) { + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) + apic->lapic_timer.timer_mode_mask = 3 << 17; + else + apic->lapic_timer.timer_mode_mask = 1 << 17; + } + + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); + if (!best) + vcpu->arch.guest_supported_xcr0 = 0; + else + vcpu->arch.guest_supported_xcr0 = + (best->eax | ((u64)best->edx << 32)) & supported_xcr0; /* Note, maxphyaddr must be updated before tdp_level. */ vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); @@ -221,6 +229,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_state_based_on_cpuid(vcpu); out: vfree(cpuid_entries); @@ -250,6 +259,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); kvm_update_cpuid(vcpu); + kvm_update_state_based_on_cpuid(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f136de1debad..0ccf9ec4bf55 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -10,6 +10,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); void kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_state_based_on_cpuid(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index dc993b79ae2f..fcb85432d30b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8100,6 +8100,7 @@ static void enter_smm(struct kvm_vcpu *vcpu) #endif kvm_update_cpuid(vcpu); + kvm_update_state_based_on_cpuid(vcpu); kvm_mmu_reset_context(vcpu); } -- 2.18.2
[PATCH 1/6] KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID fails
It needs to invalidate CPUID configruations if usersapce provides illegal input. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cd708b0b460a..2f1a9650b7f2 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -210,6 +210,8 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; out: vfree(cpuid_entries); @@ -233,6 +235,8 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + vcpu->arch.cpuid_nent = 0; out: return r; } -- 2.18.2
[PATCH 6/6] KVM: X86: Move kvm_apic_set_version() to kvm_update_state_based_on_cpuid()
Obviously, kvm_apic_set_version() fits well in kvm_update_state_based_on_cpuid(). Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 5d4da8970940..eb60098aca29 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -127,6 +127,8 @@ void kvm_update_state_based_on_cpuid(struct kvm_vcpu *vcpu) apic->lapic_timer.timer_mode_mask = 3 << 17; else apic->lapic_timer.timer_mode_mask = 1 << 17; + + kvm_apic_set_version(vcpu); } best = kvm_find_cpuid_entry(vcpu, 0xD, 0); @@ -228,7 +230,6 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, } cpuid_fix_nx_cap(vcpu); - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_state_based_on_cpuid(vcpu); @@ -257,7 +258,6 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; } - kvm_apic_set_version(vcpu); kvm_update_cpuid(vcpu); kvm_update_state_based_on_cpuid(vcpu); out: -- 2.18.2
[PATCH 3/6] KVM: X86: Introduce kvm_check_cpuid()
Use kvm_check_cpuid() to validate if userspace provides legal cpuid settings and call it before KVM updates CPUID. Signed-off-by: Xiaoyao Li --- Is the check of virutal address width really necessary? KVM doesn't check other bits at all. I guess the policy is that KVM allows illegal CPUID settings as long as it doesn't break host kernel. Userspace takes the consequences itself if it sets bogus CPUID settings that breaks its guest. But why vaddr_bits is special? It seems illegal vaddr_bits won't break host kernel. --- arch/x86/kvm/cpuid.c | 54 arch/x86/kvm/cpuid.h | 2 +- 2 files changed, 35 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 795bbaf37110..c8cb373056f1 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -54,7 +54,26 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted) #define F feature_bit -int kvm_update_cpuid(struct kvm_vcpu *vcpu) +static int kvm_check_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + /* +* The existing code assumes virtual address is 48-bit or 57-bit in the +* canonical address checks; exit if it is ever changed. +*/ + best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + if (best) { + int vaddr_bits = (best->eax & 0xff00) >> 8; + + if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) + return -EINVAL; + } + + return 0; +} + +void kvm_update_cpuid(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; struct kvm_lapic *apic = vcpu->arch.apic; @@ -96,18 +115,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) cpuid_entry_has(best, X86_FEATURE_XSAVEC))) best->ebx = xstate_required_size(vcpu->arch.xcr0, true); - /* -* The existing code assumes virtual address is 48-bit or 57-bit in the -* canonical address checks; exit if it is ever changed. -*/ - best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); - if (best) { - int vaddr_bits = (best->eax & 0xff00) >> 8; - - if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0) - return -EINVAL; - } - best = kvm_find_cpuid_entry(vcpu, KVM_CPUID_FEATURES, 0); if (kvm_hlt_in_guest(vcpu->kvm) && best && (best->eax & (1 << KVM_FEATURE_PV_UNHALT))) @@ -127,7 +134,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_mmu_reset_context(vcpu); kvm_pmu_refresh(vcpu); - return 0; } static int is_efer_nx(void) @@ -205,12 +211,16 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_entries[i].padding[2] = 0; } vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; + goto out; + } + cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); out: vfree(cpuid_entries); @@ -231,11 +241,15 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, cpuid->nent * sizeof(struct kvm_cpuid_entry2))) goto out; vcpu->arch.cpuid_nent = cpuid->nent; + r = kvm_check_cpuid(vcpu); + if (r) { + vcpu->arch.cpuid_nent = 0; + goto out; + } + kvm_apic_set_version(vcpu); kvm_x86_ops.cpuid_update(vcpu); - r = kvm_update_cpuid(vcpu); - if (r) - vcpu->arch.cpuid_nent = 0; + kvm_update_cpuid(vcpu); out: return r; } diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 05434cd9342f..f136de1debad 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -9,7 +9,7 @@ extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; void kvm_set_cpu_caps(void); -int kvm_update_cpuid(struct kvm_vcpu *vcpu); +void kvm_update_cpuid(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, -- 2.18.2
Re: [PATCH] KVM: X86: Call kvm_x86_ops.cpuid_update() after CPUIDs fully updated
On 5/29/2020 12:15 AM, Paolo Bonzini wrote: On 28/05/20 17:40, Xiaoyao Li wrote: kvm_x86_ops.cpuid_update() is used to update vmx/svm settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are fully updated, not in the middle stage. Signed-off-by: Xiaoyao Li Are you seeing anything bad happening from this? Not yet. IMO changing the order is more reasonable and less confusing. Indeed, I just could not decide whether to include it in 5.7 or not. Maybe for 5.8 I have a new idea to refactor a bit more that I find it does three things in kvm_update_cpuid(): - update cpuid; - update vcpu states, e.g., apic->lapic_timer.timer_mode_mask, guest_supported_xcr0, maxphyaddr, ... etc, - cpuid check, for vaddr_bits I'm going to split it, and make the order as: 1. kvm_check_cpuid(), if invalid value return error; 2. kvm_update_cpuid(); 3. kvm_update_state_based_on_cpuid(); and kvm_x86_ops.kvm_x86_ops.cpuid_update() can be called inside it. If you feel OK, I'll do it tomorrow. -Xiaoyao
Re: [PATCH] KVM: X86: Call kvm_x86_ops.cpuid_update() after CPUIDs fully updated
On 5/28/2020 11:22 PM, Paolo Bonzini wrote: On 28/05/20 17:19, Xiaoyao Li wrote: kvm_x86_ops.cpuid_update() is used to update vmx/svm settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are fully updated, not in the middle stage. Signed-off-by: Xiaoyao Li Are you seeing anything bad happening from this? Not yet. IMO changing the order is more reasonable and less confusing. Paolo --- arch/x86/kvm/cpuid.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cd708b0b460a..753739bc1bf0 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -208,8 +208,11 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_nent = cpuid->nent; cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + goto out; + + kvm_x86_ops.cpuid_update(vcpu); out: vfree(cpuid_entries); @@ -231,8 +234,11 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; vcpu->arch.cpuid_nent = cpuid->nent; kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + goto out; + + kvm_x86_ops.cpuid_update(vcpu); out: return r; }
[PATCH] KVM: X86: Call kvm_x86_ops.cpuid_update() after CPUIDs fully updated
kvm_x86_ops.cpuid_update() is used to update vmx/svm settings based on updated CPUID settings. So it's supposed to be called after CPUIDs are fully updated, not in the middle stage. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cd708b0b460a..753739bc1bf0 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -208,8 +208,11 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_nent = cpuid->nent; cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + goto out; + + kvm_x86_ops.cpuid_update(vcpu); out: vfree(cpuid_entries); @@ -231,8 +234,11 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; vcpu->arch.cpuid_nent = cpuid->nent; kvm_apic_set_version(vcpu); - kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); + if (r) + goto out; + + kvm_x86_ops.cpuid_update(vcpu); out: return r; } -- 2.18.2
Re: [PATCH v9 0/8] KVM: Add virtualization support of split lock detection
Hi Thomas, On 5/18/2020 9:27 AM, Xiaoyao Li wrote: On 5/9/2020 7:05 PM, Xiaoyao Li wrote: This series aims to add the virtualization of split lock detection in KVM. Due to the fact that split lock detection is tightly coupled with CPU model and CPU model is configurable by host VMM, we elect to use paravirt method to expose and enumerate it for guest. Thomas and Paolo, Do you have time to have a look at this version? Does this series have any chance to meet 5.8? If not, do you plan to take a look at it after merge window? Thanks, -Xiaoyao
Re: [PATCH 2/2] kvm/x86: don't expose MSR_IA32_UMWAIT_CONTROL unconditionally
On 5/21/2020 12:56 AM, Maxim Levitsky wrote: On Wed, 2020-05-20 at 18:33 +0200, Vitaly Kuznetsov wrote: Maxim Levitsky writes: This msr is only available when the host supports WAITPKG feature. This breaks a nested guest, if the L1 hypervisor is set to ignore unknown msrs, because the only other safety check that the kernel does is that it attempts to read the msr and rejects it if it gets an exception. Fixes: 6e3ba4abce KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL Signed-off-by: Maxim Levitsky --- arch/x86/kvm/x86.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fe3a24fd6b263..9c507b32b1b77 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5314,6 +5314,10 @@ static void kvm_init_msr_list(void) if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_EVENTSEL0 >= min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp)) continue; + break; + case MSR_IA32_UMWAIT_CONTROL: + if (!kvm_cpu_cap_has(X86_FEATURE_WAITPKG)) + continue; I'm probably missing something but (if I understand correctly) the only effect of dropping MSR_IA32_UMWAIT_CONTROL from msrs_to_save would be that KVM userspace won't see it in e.g. KVM_GET_MSR_INDEX_LIST. But why is this causing an issue? I see both vmx_get_msr()/vmx_set_msr() have 'host_initiated' check: case MSR_IA32_UMWAIT_CONTROL: if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx)) return 1; Here it fails like that: 1. KVM_GET_MSR_INDEX_LIST returns this msrs, and qemu notes that it is supported in 'has_msr_umwait' global var In general, KVM_GET_MSR_INDEX_LIST won't return MSR_IA32_UMWAIT_CONTROL if KVM cannot read this MSR, see kvm_init_msr_list(). You hit issue because you used "ignore_msrs".
Re: [PATCH 2/2] kvm/x86: don't expose MSR_IA32_UMWAIT_CONTROL unconditionally
On 5/21/2020 1:28 PM, Tao Xu wrote: On 5/21/2020 12:33 PM, Xiaoyao Li wrote: On 5/21/2020 5:05 AM, Paolo Bonzini wrote: On 20/05/20 18:07, Maxim Levitsky wrote: This msr is only available when the host supports WAITPKG feature. This breaks a nested guest, if the L1 hypervisor is set to ignore unknown msrs, because the only other safety check that the kernel does is that it attempts to read the msr and rejects it if it gets an exception. Fixes: 6e3ba4abce KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL Signed-off-by: Maxim Levitsky --- arch/x86/kvm/x86.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fe3a24fd6b263..9c507b32b1b77 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5314,6 +5314,10 @@ static void kvm_init_msr_list(void) if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_EVENTSEL0 >= min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp)) continue; + break; + case MSR_IA32_UMWAIT_CONTROL: + if (!kvm_cpu_cap_has(X86_FEATURE_WAITPKG)) + continue; default: break; } The patch is correct, and matches what is done for the other entries of msrs_to_save_all. However, while looking at it I noticed that X86_FEATURE_WAITPKG is actually never added, and that is because it was also not added to the supported CPUID in commit e69e72faa3a0 ("KVM: x86: Add support for user wait instructions", 2019-09-24), which was before the kvm_cpu_cap mechanism was added. So while at it you should also fix that. The right way to do that is to add a if (vmx_waitpkg_supported()) kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG); + Tao I remember there is certainly some reason why we don't expose WAITPKG to guest by default. Tao, please help clarify it. Thanks, -Xiaoyao Because in VM, umwait and tpause can put a (psysical) CPU into a power saving state. So from host view, this cpu will be 100% usage by VM. Although umwait and tpause just cause short wait(maybe 100 microseconds), we still want to unconditionally expose WAITPKG in VM. I guess you typed "unconditionally" by mistake that you meant to say "conditionally" in fact?
Re: [PATCH 2/2] kvm/x86: don't expose MSR_IA32_UMWAIT_CONTROL unconditionally
On 5/21/2020 5:05 AM, Paolo Bonzini wrote: On 20/05/20 18:07, Maxim Levitsky wrote: This msr is only available when the host supports WAITPKG feature. This breaks a nested guest, if the L1 hypervisor is set to ignore unknown msrs, because the only other safety check that the kernel does is that it attempts to read the msr and rejects it if it gets an exception. Fixes: 6e3ba4abce KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL Signed-off-by: Maxim Levitsky --- arch/x86/kvm/x86.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fe3a24fd6b263..9c507b32b1b77 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5314,6 +5314,10 @@ static void kvm_init_msr_list(void) if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_EVENTSEL0 >= min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp)) continue; + break; + case MSR_IA32_UMWAIT_CONTROL: + if (!kvm_cpu_cap_has(X86_FEATURE_WAITPKG)) + continue; default: break; } The patch is correct, and matches what is done for the other entries of msrs_to_save_all. However, while looking at it I noticed that X86_FEATURE_WAITPKG is actually never added, and that is because it was also not added to the supported CPUID in commit e69e72faa3a0 ("KVM: x86: Add support for user wait instructions", 2019-09-24), which was before the kvm_cpu_cap mechanism was added. So while at it you should also fix that. The right way to do that is to add a if (vmx_waitpkg_supported()) kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG); + Tao I remember there is certainly some reason why we don't expose WAITPKG to guest by default. Tao, please help clarify it. Thanks, -Xiaoyao in vmx_set_cpu_caps. Thanks, Paolo
Re: [PATCH] kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bits
On 5/18/2020 8:31 PM, Paolo Bonzini wrote: On 18/05/20 06:52, Xiaoyao Li wrote: On 5/6/2020 5:44 PM, Paolo Bonzini wrote: Using CPUID data can be useful for the processor compatibility check, but that's it. Using it to compute guest-reserved bits can have both false positives (such as LA57 and UMIP which we are already handling) and false negatives: in particular, with this patch we don't allow anymore a KVM guest to set CR4.PKE when CR4.PKE is clear on the host. A common question about whether a feature can be exposed to guest: Given a feature, there is a CPUID bit to enumerate it, and a CR4 bit to turn it on/off. Whether the feature can be exposed to guest only depends on host CR4 setting? I.e., if CPUID bit is not cleared in cpu_data in host but host kernel doesn't set the corresponding CR4 bit to turn it on, we cannot expose the feature to guest. right? It depends. The most obvious case is that the host kernel doesn't use CR4.PSE but we even use 4MB pages to emulate paging disabled mode when the processor doesn't support unrestricted guests. Basically, the question is whether we are able to save/restore any processor state attached to the CR4 bit on vmexit/vmentry. In this case there is no PKRU field in the VMCS and the RDPKRU/WRPKRU instructions require CR4.PKE=1; therefore, we cannot let the guest enable CR4.PKE unless it's also enabled on the host. aha! That's reason! Thanks for the clarification.
Re: [PATCH] kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bits
On 5/6/2020 5:44 PM, Paolo Bonzini wrote: Using CPUID data can be useful for the processor compatibility check, but that's it. Using it to compute guest-reserved bits can have both false positives (such as LA57 and UMIP which we are already handling) and false negatives: in particular, with this patch we don't allow anymore a KVM guest to set CR4.PKE when CR4.PKE is clear on the host. A common question about whether a feature can be exposed to guest: Given a feature, there is a CPUID bit to enumerate it, and a CR4 bit to turn it on/off. Whether the feature can be exposed to guest only depends on host CR4 setting? I.e., if CPUID bit is not cleared in cpu_data in host but host kernel doesn't set the corresponding CR4 bit to turn it on, we cannot expose the feature to guest. right?
Re: [PATCH v9 0/8] KVM: Add virtualization support of split lock detection
On 5/9/2020 7:05 PM, Xiaoyao Li wrote: This series aims to add the virtualization of split lock detection in KVM. Due to the fact that split lock detection is tightly coupled with CPU model and CPU model is configurable by host VMM, we elect to use paravirt method to expose and enumerate it for guest. Thomas and Paolo, Do you have time to have a look at this version?
Re: [PATCH v9 8/8] x86/split_lock: Enable split lock detection initialization when running as an guest on KVM
On 5/10/2020 1:15 PM, Andy Lutomirski wrote: On Fri, May 8, 2020 at 8:04 PM Xiaoyao Li wrote: When running as guest, enumerating feature split lock detection through CPU model is not easy since CPU model is configurable by host VMM. If running upon KVM, it can be enumerated through KVM_FEATURE_SPLIT_LOCK_DETECT, This needs crystal clear documentation. What, exactly, is the host telling the guest if it sets this flag? and if KVM_HINTS_SLD_FATAL is set, it needs to be set to sld_fatal mode. This needs much better docs. Do you mean: "If KVM_HINTS_SLD_FATAL is set, then the guest will get #AC if it does a split-lock regardless of what is written to MSR_TEST_CTRL?" Hi Andy, KVM_FEATURE_SPLIT_LOCK_DETECT, KVM_HINTS_SLD_FATAL and their docs are introduced in Patch 5. Do I still need to explain them in detail in this patch?
[PATCH v9 3/8] x86/split_lock: Introduce flag X86_FEATURE_SLD_FATAL and drop sld_state
Introduce a synthetic feature flag X86_FEATURE_SLD_FATAL, which means kernel is in sld_fatal mode if set. Now sld_state is not needed any more that the state of SLD can be inferred from X86_FEATURE_SPLIT_LOCK_DETECT and X86_FEATURE_SLD_FATAL. Suggested-by: Sean Christopherson Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/kernel/cpu/intel.c| 16 ++-- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index db189945e9b0..260adfc6c61a 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -286,6 +286,7 @@ #define X86_FEATURE_FENCE_SWAPGS_USER (11*32+ 4) /* "" LFENCE in user entry SWAPGS path */ #define X86_FEATURE_FENCE_SWAPGS_KERNEL(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */ #define X86_FEATURE_SPLIT_LOCK_DETECT (11*32+ 6) /* #AC for split lock */ +#define X86_FEATURE_SLD_FATAL (11*32+ 7) /* split lock detection in fatal mode */ /* Intel-defined CPU features, CPUID level 0x0007:1 (EAX), word 12 */ #define X86_FEATURE_AVX512_BF16(12*32+ 5) /* AVX512 BFLOAT16 instructions */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 4602dac14dcb..93b8ccf2fa11 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -40,12 +40,6 @@ enum split_lock_detect_state { sld_fatal, }; -/* - * Default to sld_off because most systems do not support split lock detection - * split_lock_setup() will switch this to sld_warn on systems that support - * split lock detect, unless there is a command line override. - */ -static enum split_lock_detect_state sld_state __ro_after_init = sld_off; static u64 msr_test_ctrl_cache __ro_after_init; /* @@ -1043,8 +1037,9 @@ static void __init split_lock_setup(void) return; } - sld_state = state; setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); + if (state == sld_fatal) + setup_force_cpu_cap(X86_FEATURE_SLD_FATAL); } /* @@ -1064,7 +1059,7 @@ static void sld_update_msr(bool on) static void split_lock_init(void) { - split_lock_verify_msr(sld_state != sld_off); + split_lock_verify_msr(boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)); } static void split_lock_warn(unsigned long ip) @@ -1083,7 +1078,7 @@ static void split_lock_warn(unsigned long ip) bool handle_guest_split_lock(unsigned long ip) { - if (sld_state == sld_warn) { + if (!boot_cpu_has(X86_FEATURE_SLD_FATAL)) { split_lock_warn(ip); return true; } @@ -1100,7 +1095,8 @@ EXPORT_SYMBOL_GPL(handle_guest_split_lock); bool handle_user_split_lock(struct pt_regs *regs, long error_code) { - if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) + if ((regs->flags & X86_EFLAGS_AC) || + boot_cpu_has(X86_FEATURE_SLD_FATAL)) return false; split_lock_warn(regs->ip); return true; -- 2.18.2
[PATCH v9 7/8] KVM: VMX: virtualize split lock detection
TEST_CTRL MSR is per-core scope, i.e., the sibling threads in the same physical core share the same MSR. This requires additional constraint when exposing it to guest. 1) When host SLD state is sld_off (no X86_FEATURE_SPLIT_LOCK_DETECT), feature split lock detection is unsupported/disabled. Cannot expose it to guest. 2) When host SLD state is sld_warn (has X86_FEATURE_SPLIT_LOCK_DETECT but no X86_FEATURE_SLD_FATAL), feature split lock detection can be exposed to guest only when nosmt due to the per-core scope. In this case, guest's setting can be propagated into real hardware MSR. Further, to avoid the potiential MSR_TEST_CTRL.SLD toggling overhead during every vm-enter and vm-exit, it loads and keeps guest's SLD setting when in vcpu run loop and guest_state_loaded, i.e., betweer vmx_prepare_switch_to_guest() and vmx_prepare_switch_to_host(). 3) when host SLD state is sld_fatal (has X86_FEATURE_SLD_FATAL), feature split lock detection can be exposed to guest regardless of SMT but KVM_HINTS_SLD_FATAL needs to be set. In this case, guest can still set and clear MSR_TEST_CTRL.SLD bit, but the bit value never be propagated to real MSR. KVM always keeps SLD bit turned on for guest vcpu. The reason why not force guest MSR_CTRL.SLD bit to 1 is that guest needs to set this bit to 1 itself to tell KVM it's SLD-aware. Signed-off-by: Xiaoyao Li --- arch/x86/kvm/cpuid.c | 6 arch/x86/kvm/vmx/vmx.c | 68 -- arch/x86/kvm/vmx/vmx.h | 3 ++ arch/x86/kvm/x86.c | 6 +++- arch/x86/kvm/x86.h | 7 + 5 files changed, 80 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 901cd1fdecd9..7d9f2daddaf3 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -717,9 +717,15 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) if (sched_info_on()) entry->eax |= (1 << KVM_FEATURE_STEAL_TIME); + if (kvm_split_lock_detect_supported()) + entry->eax |= (1 << KVM_FEATURE_SPLIT_LOCK_DETECT); + entry->ebx = 0; entry->ecx = 0; entry->edx = 0; + + if (boot_cpu_has(X86_FEATURE_SLD_FATAL)) + entry->edx |= (1 << KVM_HINTS_SLD_FATAL); break; case 0x8000: entry->eax = min(entry->eax, 0x801f); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index dbec38ad5035..1cc386c5801d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1120,6 +1120,29 @@ void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, } } +static inline u64 vmx_msr_test_ctrl_valid_bits(struct vcpu_vmx *vmx) +{ + u64 valid_bits = 0; + + if (vmx->guest_has_sld) + valid_bits |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT; + + return valid_bits; +} + +static inline bool guest_sld_on(struct vcpu_vmx *vmx) +{ + return vmx->msr_test_ctrl & MSR_TEST_CTRL_SPLIT_LOCK_DETECT; +} + +static inline void vmx_update_guest_sld(struct vcpu_vmx *vmx) +{ + preempt_disable(); + if (vmx->guest_state_loaded) + split_lock_set_guest(guest_sld_on(vmx)); + preempt_enable(); +} + void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -1188,6 +1211,10 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) #endif vmx_set_host_fs_gs(host_state, fs_sel, gs_sel, fs_base, gs_base); + + if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && vmx->guest_has_sld) + vmx->host_sld_on = split_lock_set_guest(guest_sld_on(vmx)); + vmx->guest_state_loaded = true; } @@ -1226,6 +1253,10 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx) wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base); #endif load_fixmap_gdt(raw_smp_processor_id()); + + if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && vmx->guest_has_sld) + split_lock_restore_host(vmx->host_sld_on); + vmx->guest_state_loaded = false; vmx->guest_msrs_ready = false; } @@ -1790,7 +1821,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) switch (msr_info->index) { case MSR_TEST_CTRL: - msr_info->data = 0; + msr_info->data = vmx->msr_test_ctrl; break; #ifdef CONFIG_X86_64 case MSR_FS_BASE: @@ -1946,9 +1977,12 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) switch (msr_index) { case MSR_TEST_CTRL: - if (data) + if (data & ~vmx_msr_test_ctrl_valid_bits(vmx)) return 1;
[PATCH v9 8/8] x86/split_lock: Enable split lock detection initialization when running as an guest on KVM
When running as guest, enumerating feature split lock detection through CPU model is not easy since CPU model is configurable by host VMM. If running upon KVM, it can be enumerated through KVM_FEATURE_SPLIT_LOCK_DETECT, and if KVM_HINTS_SLD_FATAL is set, it needs to be set to sld_fatal mode. Signed-off-by: Xiaoyao Li --- arch/x86/include/asm/cpu.h | 2 ++ arch/x86/kernel/cpu/intel.c | 12 ++-- arch/x86/kernel/kvm.c | 3 +++ 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index a57f00f1d5b5..5d5b488b4b45 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -42,12 +42,14 @@ unsigned int x86_model(unsigned int sig); unsigned int x86_stepping(unsigned int sig); #ifdef CONFIG_CPU_SUP_INTEL extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); +extern void __init split_lock_setup(bool fatal); extern void switch_to_sld(unsigned long tifn); extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); extern bool handle_guest_split_lock(unsigned long ip); extern bool split_lock_virt_switch(bool on); #else static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} +static inline void __init split_lock_setup(bool fatal) {} static inline void switch_to_sld(unsigned long tifn) {} static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code) { diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 1e2a74e8c592..02e24134b9b5 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -996,12 +996,18 @@ static bool split_lock_verify_msr(bool on) return ctrl == tmp; } -static void __init split_lock_setup(void) +void __init split_lock_setup(bool fatal) { enum split_lock_detect_state state = sld_warn; char arg[20]; int i, ret; + if (fatal) { + state = sld_fatal; + pr_info("forced on, sending SIGBUS on user-space split_locks\n"); + goto set_cap; + } + if (!split_lock_verify_msr(false)) { pr_info("MSR access failed: Disabled\n"); return; @@ -1037,6 +1043,7 @@ static void __init split_lock_setup(void) return; } +set_cap: setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); if (state == sld_fatal) setup_force_cpu_cap(X86_FEATURE_SLD_FATAL); @@ -1161,6 +1168,7 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) const struct x86_cpu_id *m; u64 ia32_core_caps; + /* Note, paravirt support can enable SLD, e.g., see kvm_guest_init(). */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) return; @@ -1182,5 +1190,5 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) return; } - split_lock_setup(); + split_lock_setup(false); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 6efe0410fb72..489ea89e2e8e 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -670,6 +670,9 @@ static void __init kvm_guest_init(void) * overcommitted. */ hardlockup_detector_disable(); + + if (kvm_para_has_feature(KVM_FEATURE_SPLIT_LOCK_DETECT)) + split_lock_setup(kvm_para_has_hint(KVM_HINTS_SLD_FATAL)); } static noinline uint32_t __kvm_cpuid_base(void) -- 2.18.2