RE: [PATCH 2/4] KVM: Add SMAP support when setting CR4
-Original Message- From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo Bonzini Sent: Friday, March 28, 2014 8:03 PM To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org Subject: Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4 Il 28/03/2014 18:36, Feng Wu ha scritto: + smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); You are overwriting this variable below, but that is not okay because the value of CR4 must be considered separately in each iteration. This also hides a uninitialized-variable bug for smap correctly in the EPT case. To avoid that, rename this variable to cr4_smap; it's probably better to rename smep to cr4_smep too. for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte 1; map = 0; wf = pfec PFERR_WRITE_MASK; uf = pfec PFERR_USER_MASK; ff = pfec PFERR_FETCH_MASK; + smapf = pfec PFERR_RSVD_MASK; The reader will expect PFERR_RSVD_MASK to be zero here. So please add a comment: /* PFERR_RSVD_MASK is set in pfec if ... */. for (bit = 0; bit 8; ++bit) { x = bit ACC_EXEC_MASK; w = bit ACC_WRITE_MASK; @@ -3627,11 +3629,27 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, w |= !is_write_protection(vcpu) !uf; /* Disallow supervisor fetches of user code if cr4.smep */ x = !(smep u !uf); + + /* +* SMAP:kernel-mode data accesses from user-mode +* mappings should fault. A fault is considered +* as a SMAP violation if all of the following +* conditions are ture: +* - X86_CR4_SMAP is set in CR4 +* - An user page is accessed +* - Page fault in kernel mode +* - !(CPL3 X86_EFLAGS_AC is set) +* +* Here, we cover the first three conditions, +* we need to check CPL and X86_EFLAGS_AC in +* permission_fault() dynamiccally dynamically. These three lines however are not entirely correct. We do cover the last condition here, it is in smapf. So perhaps something like * Here, we cover the first three conditions. * The CPL and X86_EFLAGS_AC is in smapf, which * permission_fault() computes dynamically. +*/ + smap = smap smapf u !uf; SMAP does not affect instruction fetches. Do you need !ff here? Perhaps it's clearer to add it even if it is not strictly necessary. Please write just smap = cr4_smap u !uf !ff here, and add back smapf below in the assignment to fault. This makes the code more homogeneous. } else /* Not really needed: no U/S accesses on ept */ u = 1; - fault = (ff !x) || (uf !u) || (wf !w); + fault = (ff !x) || (uf !u) || (wf !w) || smap; ... + + /* +* If CPL 3, SMAP protections are disabled if EFLAGS.AC = 1. +* +* If CPL = 3, SMAP applies to all supervisor-mode data accesses +* (these are implicit supervisor accesses) regardless of the value +* of EFLAGS.AC. +* +* So we need to check CPL and EFLAGS.AC to detect whether there is +* a SMAP violation. +*/ + + smapf = ((mmu-permissions[(pfec|PFERR_RSVD_MASK) 1] pte_access) +1) !((cpl 3) ((rflags X86_EFLAGS_AC) == 1)); + + return ((mmu-permissions[pfec 1] pte_access) 1) || smapf; You do not need two separate accesses. Just add PFERR_RSVD_MASK to pfec if the conditions for SMAP are satisfied. There are two possibilities: 1) setting PFERR_RSVD_MASK if SMAP is being enforced, that is if CPL = 3 || AC = 0. This is what you are doing now. 2) setting PFERR_RSVD_MASK if SMAP is being overridden, that is if CPL 3 AC = 1. You then have to invert the bit in update_permission_bitmask. Please consider both choices, and pick the one that gives better code. Also, this must be written in a branchless way. Branchless tricks are common throughout the MMU code because the hit rate of most branches is pretty much 50%-50%. This is also true in this case, at least if SMAP is in use (if it is not in use, we'll have AC=0 most of the time). I don't want to spoil the fun, but I don't want to waste your time either so I rot13'ed my solution and placed it after the signature. ;) As to nested virtualization, I reread the code and it should already work, because it sets
Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4
Il 31/03/2014 08:16, Wu, Feng ha scritto: /* * If CPL 3, SMAP protections are disabled if EFLAGS.AC = 1. * * If CPL = 3, SMAP applies to all supervisor-mode data accesses * (these are implicit supervisor accesses) regardless of the value * of EFLAGS.AC. * * This computes (cpl 3) (rflags X86_EFLAGS_AC), leaving * the result in X86_EFLAGS_AC. We then insert it in place of * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec, * but it will be one in index if SMAP checks are being overridden. * It is important to keep this branchless. */ smap = (cpl - 3) (rflags X86_EFLAGS_AC); index = (pfec 1) + (smap (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)); return (mmu-permissions[index] pte_access) 1; The direction of PFERR_RSVD_MASK is the opposite compared to your code. --- Correct. I am a little confused about some points of the above code: 1. smap = (cpl - 3) (rflags X86_EFLAGS_AC); smap equals 1 when it is overridden and it is 0 when being enforced. Actually, smap equals X86_EFLAGS_AC when it is overridden. Perhaps this is the source of the confusion. Note that I'm using , not . So index will be (pfec 1) when SMAP is enforced, but in my understanding of this case, we should use the index with PFERR_RSVD_MASK bit being 1 in mmu- permissions[] to check the fault. 2. smap (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1) I am not quite understand this line. BTW, I cannot find the definition of PFERR_RSVD_BIT, Do you mean PFERR_RSVD_BIT equals 3? Yes. You can similarly add PFERR_PRESENT_BIT (equal to 0) etc. I think the basic idea is using group 0 to check permission faults when !((cpl - 3) (rflags X86_EFLAGS_AC)), that is SMAP is overridden while using group 1 to check faults when (cpl - 3) (rflags X86_EFLAGS_AC), that is SMAP is enforced. Here is the code base on your proposal in my understanding: --- smap = !((cpl - 3) (rflags X86_EFLAGS_AC)); index = (pfec 1) + (smap (PFERR_RSVD_BIT - 1)); /*assuming PFERR_RSVD_BIT == 3*/ return (mmu-permissions[index] pte_access) 1; --- Could you please have a look at it? Appreciate your help! :) It is faster if you avoid the ! and shift right from the AC bit into position PFERR_RSVD_BIT - 1. In update_permission_bitmask you can invert the direction of the bit when you extract it from pfec. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/4] KVM: Add SMAP support when setting CR4
-Original Message- From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo Bonzini Sent: Monday, March 31, 2014 3:29 PM To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org Subject: Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4 Il 31/03/2014 08:16, Wu, Feng ha scritto: /* * If CPL 3, SMAP protections are disabled if EFLAGS.AC = 1. * * If CPL = 3, SMAP applies to all supervisor-mode data accesses * (these are implicit supervisor accesses) regardless of the value * of EFLAGS.AC. * * This computes (cpl 3) (rflags X86_EFLAGS_AC), leaving * the result in X86_EFLAGS_AC. We then insert it in place of * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec, * but it will be one in index if SMAP checks are being overridden. * It is important to keep this branchless. */ smap = (cpl - 3) (rflags X86_EFLAGS_AC); index = (pfec 1) + (smap (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)); return (mmu-permissions[index] pte_access) 1; The direction of PFERR_RSVD_MASK is the opposite compared to your code. --- Correct. I am a little confused about some points of the above code: 1. smap = (cpl - 3) (rflags X86_EFLAGS_AC); smap equals 1 when it is overridden and it is 0 when being enforced. Actually, smap equals X86_EFLAGS_AC when it is overridden. Perhaps this is the source of the confusion. Note that I'm using , not . Thanks for your explanation, you are right I didn't notice ** just now, I was thinking it is a **, sorry for making the confusion. So index will be (pfec 1) when SMAP is enforced, but in my understanding of this case, we should use the index with PFERR_RSVD_MASK bit being 1 in mmu- permissions[] to check the fault. 2. smap (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1) I am not quite understand this line. BTW, I cannot find the definition of PFERR_RSVD_BIT, Do you mean PFERR_RSVD_BIT equals 3? Yes. You can similarly add PFERR_PRESENT_BIT (equal to 0) etc. I think the basic idea is using group 0 to check permission faults when !((cpl - 3) (rflags X86_EFLAGS_AC)), that is SMAP is overridden while using group 1 to check faults when (cpl - 3) (rflags X86_EFLAGS_AC), that is SMAP is enforced. Here is the code base on your proposal in my understanding: --- smap = !((cpl - 3) (rflags X86_EFLAGS_AC)); index = (pfec 1) + (smap (PFERR_RSVD_BIT - 1)); /*assuming PFERR_RSVD_BIT == 3*/ return (mmu-permissions[index] pte_access) 1; --- Could you please have a look at it? Appreciate your help! :) It is faster if you avoid the ! and shift right from the AC bit into position PFERR_RSVD_BIT - 1. In update_permission_bitmask you can invert the direction of the bit when you extract it from pfec. So in that case, we should set smapf in update_permission_bitmask() this way, right? smapf = !(pfec PFERR_RSVD_MASK); Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM changes for 3.15 merge window
Linus, The following changes since commit 6d0abeca3242a88cab8232e4acd7e2bf088f3bc2: Linux 3.14-rc3 (2014-02-16 13:30:25 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.15-1 for you to fetch changes up to 7227fc006b0df2c0d2966a7f4859b01bdf74: Merge branch 'kvm-ppchv-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next (2014-03-29 15:44:05 +0100) No other changes expected for this merge window. Paolo PPC and ARM do not have much going on this time. Most of the cool stuff, instead, is in s390 and (after a few releases) x86. ARM has some caching fixes and PPC has transactional memory support in guests. MIPS has some fixes, with more probably coming in 3.16 as QEMU will soon get support for MIPS KVM. For x86 there are optimizations for debug registers, which trigger on some Windows games, and other important fixes for Windows guests. We now expose to the guest Broadwell instruction set extensions and also Intel MPX. There's also a fix/workaround for OS X guests, nested virtualization features (preemption timer), and a couple kvmclock refinements. For s390, the main news is asynchronous page faults, together with improvements to IRQs (floating irqs and adapter irqs) that speed up virtio devices. Andrew Honig (1): kvm: x86: fix emulator buffer overflow (CVE-2014-0049) Andrew Jones (2): x86: kvm: rate-limit global clock updates x86: kvm: introduce periodic global clock updates Anton Blanchard (1): KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n Christian Borntraeger (4): KVM: s390: Provide access to program parameter KVM: s390: expose gbea register to userspace KVM: s390: Optimize ucontrol path KVM: s390: randomize sca address Christoffer Dall (1): KVM: Specify byte order for KVM_EXIT_MMIO Cornelia Huck (6): virtio-ccw: virtio-ccw adapter interrupt support. KVM: eventfd: Fix lock order inversion. KVM: Add per-vm capability enablement. KVM: s390: adapter interrupt sources KVM: s390: irq routing for adapter interrupts. KVM: Bump KVM_MAX_IRQ_ROUTES for s390 Dominik Dingel (7): KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault KVM: async_pf: Provide additional direct page notification KVM: async_pf: Allow to wait for outstanding work KVM: async_pf: Async page fault support on s390 KVM: async_pf: Exploit one reg interface for pfault KVM: async_pf: Add missing call for async page present KVM: s390: Removing untriggerable BUG_ONs Fernando Luis Vázquez Cao (1): kvm: remove redundant registration of BSP's hv_clock area Gabriel L. Somlo (1): kvm: x86: ignore ioapic polarity Greg Kurz (1): KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write Heinz Graalfs (2): virtio_ccw: fix vcdev pointer handling issues virtio_ccw: fix hang in set offline processing Igor Mammedov (2): KVM: x86 emulator: emulate MOVAPS KVM: x86 emulator: emulate MOVAPD James Hogan (4): MIPS: KVM: asm/kvm_host.h: Clean up whitespace MIPS: KVM: Pass reserved instruction exceptions to guest MIPS: KVM: Consult HWREna before emulating RDHWR MIPS: KVM: Remove dead code in CP0 emulation Jan Kiszka (4): KVM: nVMX: Rework interception of IRQs and NMIs KVM: nVMX: Fully emulate preemption timer KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt KVM: x86: Remove return code from enable_irq/nmi_window Jens Freimann (7): KVM: s390: add and extend interrupt information data structs KVM: s390: add floating irq controller KVM: s390: limit floating irqs KVM: s390: Simplify online vcpus counting for stsi KVM: s390: get rid of local_int array KVM: s390: fix calculation of idle_mask array size KVM: s390: clear local interrupts at cpu initial reset Laurent Dufour (1): KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE Liu, Jinsong (6): KVM: x86: expose new instruction RDSEED to guest KVM: x86: expose ADX feature to guest KVM: x86: Fix xsave cpuid exposing bug KVM: x86: Intel MPX vmx and msr handle KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save KVM: x86: Enable Intel MPX for guest Marc Zyngier (13): arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT arm64: KVM: force cache clean on page fault when caches are off arm64: KVM: allows discrimination of AArch32 sysreg access arm64: KVM: trap VM system registers until MMU and caches are ON ARM: KVM: introduce kvm_p*d_addr_end arm64: KVM: flush VM pages before letting the guest enable caches ARM: KVM: force cache clean on page fault when caches are off ARM: KVM: fix handling of
Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4
Il 31/03/2014 10:06, Wu, Feng ha scritto: It is faster if you avoid the ! and shift right from the AC bit into position PFERR_RSVD_BIT - 1. In update_permission_bitmask you can invert the direction of the bit when you extract it from pfec. So in that case, we should set smapf in update_permission_bitmask() this way, right? smapf = !(pfec PFERR_RSVD_MASK); Yep. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM changes for 3.15 merge window
Il 31/03/2014 10:30, Paolo Bonzini ha scritto: Linus, The following changes since commit 6d0abeca3242a88cab8232e4acd7e2bf088f3bc2: Linux 3.14-rc3 (2014-02-16 13:30:25 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.15-1 Also at tags/for-linus, but request-pull chose this other tag for some reason. Paolo for you to fetch changes up to 7227fc006b0df2c0d2966a7f4859b01bdf74: Merge branch 'kvm-ppchv-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next (2014-03-29 15:44:05 +0100) No other changes expected for this merge window. Paolo PPC and ARM do not have much going on this time. Most of the cool stuff, instead, is in s390 and (after a few releases) x86. ARM has some caching fixes and PPC has transactional memory support in guests. MIPS has some fixes, with more probably coming in 3.16 as QEMU will soon get support for MIPS KVM. For x86 there are optimizations for debug registers, which trigger on some Windows games, and other important fixes for Windows guests. We now expose to the guest Broadwell instruction set extensions and also Intel MPX. There's also a fix/workaround for OS X guests, nested virtualization features (preemption timer), and a couple kvmclock refinements. For s390, the main news is asynchronous page faults, together with improvements to IRQs (floating irqs and adapter irqs) that speed up virtio devices. Andrew Honig (1): kvm: x86: fix emulator buffer overflow (CVE-2014-0049) Andrew Jones (2): x86: kvm: rate-limit global clock updates x86: kvm: introduce periodic global clock updates Anton Blanchard (1): KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n Christian Borntraeger (4): KVM: s390: Provide access to program parameter KVM: s390: expose gbea register to userspace KVM: s390: Optimize ucontrol path KVM: s390: randomize sca address Christoffer Dall (1): KVM: Specify byte order for KVM_EXIT_MMIO Cornelia Huck (6): virtio-ccw: virtio-ccw adapter interrupt support. KVM: eventfd: Fix lock order inversion. KVM: Add per-vm capability enablement. KVM: s390: adapter interrupt sources KVM: s390: irq routing for adapter interrupts. KVM: Bump KVM_MAX_IRQ_ROUTES for s390 Dominik Dingel (7): KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault KVM: async_pf: Provide additional direct page notification KVM: async_pf: Allow to wait for outstanding work KVM: async_pf: Async page fault support on s390 KVM: async_pf: Exploit one reg interface for pfault KVM: async_pf: Add missing call for async page present KVM: s390: Removing untriggerable BUG_ONs Fernando Luis Vázquez Cao (1): kvm: remove redundant registration of BSP's hv_clock area Gabriel L. Somlo (1): kvm: x86: ignore ioapic polarity Greg Kurz (1): KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write Heinz Graalfs (2): virtio_ccw: fix vcdev pointer handling issues virtio_ccw: fix hang in set offline processing Igor Mammedov (2): KVM: x86 emulator: emulate MOVAPS KVM: x86 emulator: emulate MOVAPD James Hogan (4): MIPS: KVM: asm/kvm_host.h: Clean up whitespace MIPS: KVM: Pass reserved instruction exceptions to guest MIPS: KVM: Consult HWREna before emulating RDHWR MIPS: KVM: Remove dead code in CP0 emulation Jan Kiszka (4): KVM: nVMX: Rework interception of IRQs and NMIs KVM: nVMX: Fully emulate preemption timer KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt KVM: x86: Remove return code from enable_irq/nmi_window Jens Freimann (7): KVM: s390: add and extend interrupt information data structs KVM: s390: add floating irq controller KVM: s390: limit floating irqs KVM: s390: Simplify online vcpus counting for stsi KVM: s390: get rid of local_int array KVM: s390: fix calculation of idle_mask array size KVM: s390: clear local interrupts at cpu initial reset Laurent Dufour (1): KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE Liu, Jinsong (6): KVM: x86: expose new instruction RDSEED to guest KVM: x86: expose ADX feature to guest KVM: x86: Fix xsave cpuid exposing bug KVM: x86: Intel MPX vmx and msr handle KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save KVM: x86: Enable Intel MPX for guest Marc Zyngier (13): arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT arm64: KVM: force cache clean on page fault when caches are off arm64: KVM: allows discrimination of AArch32 sysreg access arm64: KVM: trap VM system registers until MMU and caches are ON ARM: KVM: introduce kvm_p*d_addr_end arm64: KVM: flush VM pages
[PATCH v3 0/4] KVM: enable Intel SMAP for KVM
Supervisor Mode Access Prevention (SMAP) is a new security feature disclosed by Intel, please refer to the following document: http://software.intel.com/sites/default/files/319433-014.pdf Every access to a linear address is either a supervisor-mode access or a user-mode access. All accesses performed while the current privilege level (CPL) is less than 3 are supervisor-mode accesses. If CPL = 3, accesses are generally user-mode accesses. However, some operations implicitly access system data structures, and the resulting accesses to those data structures are supervisor-mode accesses regardless of CPL. Examples of such implicit supervisor accesses include the following: accesses to the global descriptor table (GDT) or local descriptor table (LDT) to load a segment descriptor; accesses to the interrupt descriptor table (IDT) when delivering an interrupt or exception; and accesses to the task-state segment (TSS) as part of a task switch or change of CPL. If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear addresses that are accessible in user mode. If CPL 3, SMAP protections are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode data accesses (these are implicit supervisor accesses) regardless of the value of EFLAGS.AC. This patchset pass-through SMAP feature to guests, and let guests benefit from it. Version 1: * Remove SMAP bit from CR4_RESERVED_BITS. * Add SMAP support when setting CR4 * Disable SMAP for guests in EPT realmode and EPT unpaging mode * Expose SMAP feature to guest Version 2: * Change the logic of updating mmu permission bitmap for SMAP violation * Expose SMAP feature to guest in the last patch of this series. Version 3: * Changes in update_permission_bitmask(). * Use a branchless way suggested by Paolo Bonzini to detect SMAP violation in permission_fault(). Feng Wu (4): KVM: Remove SMAP bit from CR4_RESERVED_BITS. KVM: Add SMAP support when setting CR4 KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode KVM: expose SMAP feature to guest arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 2 +- arch/x86/kvm/cpuid.h| 8 arch/x86/kvm/mmu.c | 35 +--- arch/x86/kvm/mmu.h | 44 + arch/x86/kvm/paging_tmpl.h | 2 +- arch/x86/kvm/vmx.c | 11 ++- arch/x86/kvm/x86.c | 9 - 8 files changed, 93 insertions(+), 20 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.
This patch removes SMAP bit from CR4_RESERVED_BITS. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/include/asm/kvm_host.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fdf83af..4eeb049 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -60,7 +60,7 @@ | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \ | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \ - | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE)) + | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP)) #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
SMAP is disabled if CPU is in non-paging mode in hardware. However KVM always uses paging mode to emulate guest non-paging mode with TDP. To emulate this behavior, SMAP needs to be manually disabled when guest switches to non-paging mode. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/vmx.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3927528..e58cb5f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3452,13 +3452,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) hw_cr4 = ~X86_CR4_PAE; hw_cr4 |= X86_CR4_PSE; /* -* SMEP is disabled if CPU is in non-paging mode in -* hardware. However KVM always uses paging mode to +* SMEP/SMAP is disabled if CPU is in non-paging mode +* in hardware. However KVM always uses paging mode to * emulate guest non-paging mode with TDP. -* To emulate this behavior, SMEP needs to be manually -* disabled when guest switches to non-paging mode. +* To emulate this behavior, SMEP/SMAP needs to be +* manually disabled when guest switches to non-paging +* mode. */ - hw_cr4 = ~X86_CR4_SMEP; + hw_cr4 = ~(X86_CR4_SMEP | X86_CR4_SMAP); } else if (!(cr4 X86_CR4_PAE)) { hw_cr4 = ~X86_CR4_PAE; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/4] KVM: expose SMAP feature to guest
This patch exposes SMAP feature to guest Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/cpuid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index c697625..deb5f9b 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, /* cpuid 7.0.ebx */ const u32 kvm_supported_word9_x86_features = F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | - F(BMI2) | F(ERMS) | f_invpcid | F(RTM); + F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP); /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/4] KVM: Add SMAP support when setting CR4
This patch adds SMAP handling logic when setting CR4 for guests Thanks a lot to Paolo Bonzini for his suggestion to use the branchless way to detect SMAP violation. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/cpuid.h | 8 arch/x86/kvm/mmu.c | 35 --- arch/x86/kvm/mmu.h | 44 arch/x86/kvm/paging_tmpl.h | 2 +- arch/x86/kvm/x86.c | 9 - 5 files changed, 85 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index a2a1bb7..eeecbed 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu *vcpu) return best (best-ebx bit(X86_FEATURE_SMEP)); } +static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 7, 0); + return best (best-ebx bit(X86_FEATURE_SMAP)); +} + static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9b53135..5a1ed38 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3601,20 +3601,28 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu, } } -static void update_permission_bitmask(struct kvm_vcpu *vcpu, +void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool ept) { unsigned bit, byte, pfec; u8 map; - bool fault, x, w, u, wf, uf, ff, smep; + bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0; smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); + cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte 1; map = 0; wf = pfec PFERR_WRITE_MASK; uf = pfec PFERR_USER_MASK; ff = pfec PFERR_FETCH_MASK; + /* +* PFERR_RSVD_MASK bit is used to detect SMAP violation. +* We will check it in permission_fault(), this bit is +* set in pfec for normal fault, while it is cleared for +* SMAP violations. +*/ + smapf = !(pfec PFERR_RSVD_MASK); for (bit = 0; bit 8; ++bit) { x = bit ACC_EXEC_MASK; w = bit ACC_WRITE_MASK; @@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, w |= !is_write_protection(vcpu) !uf; /* Disallow supervisor fetches of user code if cr4.smep */ x = !(smep u !uf); + + /* +* SMAP:kernel-mode data accesses from user-mode +* mappings should fault. A fault is considered +* as a SMAP violation if all of the following +* conditions are ture: +* - X86_CR4_SMAP is set in CR4 +* - An user page is accessed +* - Page fault in kernel mode +* - !(CPL3 X86_EFLAGS_AC is set) +* +* Here, we cover the first three conditions, +* The CPL and X86_EFLAGS_AC is in smapf,which +* permission_fault() computes dynamically. +* +* Also, SMAP does not affect instruction +* fetches, add the !ff check here to make it +* clearer. +*/ + smap = cr4_smap u !uf !ff; } else /* Not really needed: no U/S accesses on ept */ u = 1; - fault = (ff !x) || (uf !u) || (wf !w); + fault = (ff !x) || (uf !u) || (wf !w) || + (smapf smap); map |= fault bit; } mmu-permissions[byte] = map; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 2926152..822190f 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -44,11 +44,17 @@ #define PT_DIRECTORY_LEVEL 2 #define PT_PAGE_TABLE_LEVEL 1 -#define PFERR_PRESENT_MASK (1U 0) -#define PFERR_WRITE_MASK (1U 1) -#define PFERR_USER_MASK (1U 2) -#define PFERR_RSVD_MASK (1U 3) -#define PFERR_FETCH_MASK (1U 4) +#define PFERR_PRESENT_BIT 0 +#define PFERR_WRITE_BIT 1 +#define PFERR_USER_BIT 2
KVM call agenfda for 2014-04-01
Hi Please, send any topic that you are interested in covering. Thanks, Juan. Call details: 10:00 AM to 11:00 AM EDT Every two weeks If you need phone number details, contact me privately. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
Hi, Am 31.03.2014 12:40, schrieb Juan Quintela: Please, send any topic that you are interested in covering. I would like to discuss the state of the QEMU release process, please: * -rc1 has not been tagged. * Who besides Anthony could upload a tarball if we tag and create it? * make-release fix for SeaBIOS on the list. Ping, and are more affected? Thanks, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
On Mon, Mar 31, 2014 at 12:51:31PM +0200, Andreas Färber wrote: Am 31.03.2014 12:40, schrieb Juan Quintela: Please, send any topic that you are interested in covering. I would like to discuss the state of the QEMU release process, please: * -rc1 has not been tagged. * Who besides Anthony could upload a tarball if we tag and create it? * make-release fix for SeaBIOS on the list. Ping, and are more affected? +1 Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
On 31 March 2014 14:21, Christian Borntraeger borntrae...@de.ibm.com wrote: Another thing might be the release process in general. Currently it seems that everybody tries to push everything just before the hard freeze. I had to debug some problems introduced _after_ soft freeze. Is there some interest in having a Linux-like process (merge window + stabilization)? This would require shorter release cycles of course. merge window has been suggested before. I think it would be a terrible idea for QEMU, personally. We're not the kernel in many ways, notably dev community size and a greater tendency to changes that have effects across the whole tree. Soft + hard freeze is our stabilization period currently. thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
On 31/03/14 12:51, Andreas Färber wrote: Hi, Am 31.03.2014 12:40, schrieb Juan Quintela: Please, send any topic that you are interested in covering. I would like to discuss the state of the QEMU release process, please: * -rc1 has not been tagged. * Who besides Anthony could upload a tarball if we tag and create it? * make-release fix for SeaBIOS on the list. Ping, and are more affected? +1 for this one. Another thing might be the release process in general. Currently it seems that everybody tries to push everything just before the hard freeze. I had to debug some problems introduced _after_ soft freeze. Is there some interest in having a Linux-like process (merge window + stabilization)? This would require shorter release cycles of course. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4
Just a few comments... -static void update_permission_bitmask(struct kvm_vcpu *vcpu, +void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool ept) { unsigned bit, byte, pfec; u8 map; - bool fault, x, w, u, wf, uf, ff, smep; + bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0; smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); Can you make an additional patch to rename this to cr4_smep? + cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte 1; map = 0; wf = pfec PFERR_WRITE_MASK; uf = pfec PFERR_USER_MASK; ff = pfec PFERR_FETCH_MASK; + /* +* PFERR_RSVD_MASK bit is used to detect SMAP violation. +* We will check it in permission_fault(), this bit is +* set in pfec for normal fault, while it is cleared for +* SMAP violations. +*/ This bit is set in PFEC if we the access is _not_ subject to SMAP restrictions, and cleared otherwise. The bit is only meaningful if the SMAP bit is set in CR4. + smapf = !(pfec PFERR_RSVD_MASK); for (bit = 0; bit 8; ++bit) { x = bit ACC_EXEC_MASK; w = bit ACC_WRITE_MASK; @@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, w |= !is_write_protection(vcpu) !uf; /* Disallow supervisor fetches of user code if cr4.smep */ x = !(smep u !uf); + + /* +* SMAP:kernel-mode data accesses from user-mode +* mappings should fault. A fault is considered +* as a SMAP violation if all of the following +* conditions are ture: +* - X86_CR4_SMAP is set in CR4 +* - An user page is accessed +* - Page fault in kernel mode +* - !(CPL3 X86_EFLAGS_AC is set) - if CPL 3, EFLAGS.AC is clear +* Here, we cover the first three conditions, +* The CPL and X86_EFLAGS_AC is in smapf,which +* permission_fault() computes dynamically. The fourth is computed dynamically in permission_fault() and is in SMAPF. +* Also, SMAP does not affect instruction +* fetches, add the !ff check here to make it +* clearer. +*/ + smap = cr4_smap u !uf !ff; } else /* Not really needed: no U/S accesses on ept */ u = 1; - fault = (ff !x) || (uf !u) || (wf !w); + fault = (ff !x) || (uf !u) || (wf !w) || + (smapf smap); map |= fault bit; } mmu-permissions[byte] = map; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 2926152..822190f 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -44,11 +44,17 @@ #define PT_DIRECTORY_LEVEL 2 #define PT_PAGE_TABLE_LEVEL 1 -#define PFERR_PRESENT_MASK (1U 0) -#define PFERR_WRITE_MASK (1U 1) -#define PFERR_USER_MASK (1U 2) -#define PFERR_RSVD_MASK (1U 3) -#define PFERR_FETCH_MASK (1U 4) +#define PFERR_PRESENT_BIT 0 +#define PFERR_WRITE_BIT 1 +#define PFERR_USER_BIT 2 +#define PFERR_RSVD_BIT 3 +#define PFERR_FETCH_BIT 4 + +#define PFERR_PRESENT_MASK (1U PFERR_PRESENT_BIT) +#define PFERR_WRITE_MASK (1U PFERR_WRITE_BIT) +#define PFERR_USER_MASK (1U PFERR_USER_BIT) +#define PFERR_RSVD_MASK (1U PFERR_RSVD_BIT) +#define PFERR_FETCH_MASK (1U PFERR_FETCH_BIT) int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); @@ -73,6 +79,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context, bool execonly); +void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, + bool ept); static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm) { @@ -110,10 +118,30 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu) * Will a fault with a given page-fault error code (pfec) cause a permission * fault with the given access (in
Re: [PATCH 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail
On 03/26/2014 09:52 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: diff --git a/arch/powerpc/kvm/book3s_paired_singles.c b/arch/powerpc/kvm/book3s_paired_singles.c index a59a25a..80c533e 100644 --- a/arch/powerpc/kvm/book3s_paired_singles.c +++ b/arch/powerpc/kvm/book3s_paired_singles.c @@ -640,19 +640,24 @@ static int kvmppc_ps_one_in(struct kvm_vcpu *vcpu, bool rc, int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu) { - u32 inst = kvmppc_get_last_inst(vcpu); + u32 inst; enum emulation_result emulated = EMULATE_DONE; - - int ax_rd = inst_get_field(inst, 6, 10); - int ax_ra = inst_get_field(inst, 11, 15); - int ax_rb = inst_get_field(inst, 16, 20); - int ax_rc = inst_get_field(inst, 21, 25); - short full_d = inst_get_field(inst, 16, 31); - - u64 *fpr_d = vcpu-arch.fpr[ax_rd]; - u64 *fpr_a = vcpu-arch.fpr[ax_ra]; - u64 *fpr_b = vcpu-arch.fpr[ax_rb]; - u64 *fpr_c = vcpu-arch.fpr[ax_rc]; + int ax_rd, ax_ra, ax_rb, ax_rc; + short full_d; + u64 *fpr_d, *fpr_a, *fpr_b, *fpr_c; + + kvmppc_get_last_inst(vcpu, inst); Should probably check for failure here and elsewhere -- even though it can't currently fail on book3s, the interface now allows it. diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 5b9e906..b0d884d 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -624,9 +624,10 @@ void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr) static int kvmppc_read_inst(struct kvm_vcpu *vcpu) { ulong srr0 = kvmppc_get_pc(vcpu); - u32 last_inst = kvmppc_get_last_inst(vcpu); + u32 last_inst; int ret; + kvmppc_get_last_inst(vcpu, last_inst); ret = kvmppc_ld(vcpu, srr0, sizeof(u32), last_inst, false); This isn't new, but this function looks odd to me -- calling kvmppc_get_last_inst() but ignoring last_inst, then calling kvmppc_ld() and ignoring anything but failure. last_inst itself is never read. And no comments to explain the weirdness. :-) I get that kvmppc_get_last_inst() is probably being used for the side effect of filling in vcpu-arch.last_inst, but why store the return value without using it? Why pass the address of it to kvmppc_ld(), which seems to be used only as an indirect way of determining whether kvmppc_get_last_inst() failed? And that whole mechanism becomes stranger once it becomes possible for kvmppc_get_last_inst() to directly return failure. If you're interested in the history of this, here's the patch :) https://github.com/mirrors/linux-2.6/commit/c7f38f46f2a98d232147e47284cb4e7363296a3e The idea is that we need 2 things to be good after this function: 1) vcpu-arch.last_inst is valid 2) if the last instruction is not readable, return failure Hence this weird construct. I don't think it's really necessary though - just remove the kvmppc_ld() call and only fail read_inst() when the caching didn't work and we can't translate the address. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: Load external pid (lwepx) instruction faults (when called from KVM with guest context) needs to be handled by KVM. This implies additional code in DO_KVM macro to identify the source of the exception (which oiginate from KVM host rather than the guest). The hook requires to check the Exception Syndrome Register ESR[EPID] and External PID Load Context Register EPLC[EGS] for some exceptions (DTLB_MISS, DSI and LRAT). Doing this on Data TLB miss exception is obvious intrusive for the host. Get rid of lwepx and acquire last instuction in kvmppc_get_last_inst() by searching for the physical address and kmap it. This fixes an infinite loop caused by lwepx's data TLB miss handled in the host and the TODO for TLB eviction and execute-but-not-read entries. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kvm/bookehv_interrupts.S | 37 +++-- arch/powerpc/kvm/e500_mmu_host.c | 93 + 2 files changed, 102 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S index 20c7a54..c50490c 100644 --- a/arch/powerpc/kvm/bookehv_interrupts.S +++ b/arch/powerpc/kvm/bookehv_interrupts.S @@ -119,38 +119,14 @@ 1: .if \flags NEED_EMU - /* -* This assumes you have external PID support. -* To support a bookehv CPU without external PID, you'll -* need to look up the TLB entry and create a temporary mapping. -* -* FIXME: we don't currently handle if the lwepx faults. PR-mode -* booke doesn't handle it either. Since Linux doesn't use -* broadcast tlbivax anymore, the only way this should happen is -* if the guest maps its memory execute-but-not-read, or if we -* somehow take a TLB miss in the middle of this entry code and -* evict the relevant entry. On e500mc, all kernel lowmem is -* bolted into TLB1 large page mappings, and we don't use -* broadcast invalidates, so we should not take a TLB miss here. -* -* Later we'll need to deal with faults here. Disallowing guest -* mappings that are execute-but-not-read could be an option on -* e500mc, but not on chips with an LRAT if it is used. -*/ - - mfspr r3, SPRN_EPLC /* will already have correct ELPID and EGS */ PPC_STL r15, VCPU_GPR(R15)(r4) PPC_STL r16, VCPU_GPR(R16)(r4) PPC_STL r17, VCPU_GPR(R17)(r4) PPC_STL r18, VCPU_GPR(R18)(r4) PPC_STL r19, VCPU_GPR(R19)(r4) - mr r8, r3 PPC_STL r20, VCPU_GPR(R20)(r4) - rlwimi r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS PPC_STL r21, VCPU_GPR(R21)(r4) - rlwimi r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR PPC_STL r22, VCPU_GPR(R22)(r4) - rlwimi r8, r10, EPC_EPID_SHIFT, EPC_EPID PPC_STL r23, VCPU_GPR(R23)(r4) PPC_STL r24, VCPU_GPR(R24)(r4) PPC_STL r25, VCPU_GPR(R25)(r4) @@ -160,10 +136,15 @@ PPC_STL r29, VCPU_GPR(R29)(r4) PPC_STL r30, VCPU_GPR(R30)(r4) PPC_STL r31, VCPU_GPR(R31)(r4) - mtspr SPRN_EPLC, r8 - isync - lwepx r9, 0, r5 - mtspr SPRN_EPLC, r3 + + /* +* We don't use external PID support. lwepx faults would need to be +* handled by KVM and this implies aditional code in DO_KVM (for +* DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which +* is too intrusive for the host. Get last instuction in +* kvmppc_get_last_inst(). +*/ + li r9, KVM_INST_FETCH_FAILED stw r9, VCPU_LAST_INST(r4) .endif diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 6025cb7..1b4cb41 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -598,9 +598,102 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr, } } +#ifdef CONFIG_KVM_BOOKE_HV +int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr) It'd be interesting to see what the performance impact of doing this on non-HV would be -- it would eliminate divergent code, eliminate the MSR_DS hack, and make exec-only mappings work. We hit the instruction emulation path a lot more often on non-HV, so even a slight performance impact that might not be a major bummer for HV would become critical for PR. But I agree - it'd be interesting to see numbers. +{ + gva_t geaddr; + hpa_t addr; + hfn_t pfn; + hva_t eaddr; + u32 mas0, mas1, mas2, mas3; + u64 mas7_mas3; + struct page *page; + unsigned int addr_space, psize_shift; + bool pr; + unsigned long flags; + + /* Search TLB for guest pc to get the real address */ + geaddr = kvmppc_get_pc(vcpu); + addr_space = (vcpu-arch.shared-msr
Re: [Qemu-devel] KVM call agenda for 2014-04-01
On Mon, Mar 31, 2014 at 6:25 AM, Peter Maydell peter.mayd...@linaro.org wrote: On 31 March 2014 14:21, Christian Borntraeger borntrae...@de.ibm.com wrote: Another thing might be the release process in general. Currently it seems that everybody tries to push everything just before the hard freeze. I had to debug some problems introduced _after_ soft freeze. Is there some interest in having a Linux-like process (merge window + stabilization)? This would require shorter release cycles of course. merge window has been suggested before. I think it would be a terrible idea for QEMU, personally. We're not the kernel in many ways, notably dev community size and a greater tendency to changes that have effects across the whole tree. Soft + hard freeze is our stabilization period currently. Peter, are you willing to do the tagging and announcement for the 2.0 rcs? I sent instructions privately and between stefanha and I we can get your permissions sorted out. Regards, Anthony Liguori thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
Il 31/03/2014 16:01, Anthony Liguori ha scritto: merge window has been suggested before. I think it would be a terrible idea for QEMU, personally. We're not the kernel in many ways, notably dev community size and a greater tendency to changes that have effects across the whole tree. Soft + hard freeze is our stabilization period currently. Peter, are you willing to do the tagging and announcement for the 2.0 rcs? I sent instructions privately and between stefanha and I we can get your permissions sorted out. I think it would be a good idea to separate the committer and release manager roles. Peter is providing the community with a wonderful service, just like you were; putting too much work on his shoulders risks getting us in the same situation if anything were to affect his ability to provide it. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote: I think it would be a good idea to separate the committer and release manager roles. Peter is providing the community with a wonderful service, just like you were; putting too much work on his shoulders risks getting us in the same situation if anything were to affect his ability to provide it. Yes, I strongly agree with this. I think we'll do much better if we can manage to share out responsibilities among a wider group of people. thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
Am 31.03.2014 16:32, schrieb Peter Maydell: On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote: I think it would be a good idea to separate the committer and release manager roles. Peter is providing the community with a wonderful service, just like you were; putting too much work on his shoulders risks getting us in the same situation if anything were to affect his ability to provide it. Yes, I strongly agree with this. I think we'll do much better if we can manage to share out responsibilities among a wider group of people. May I propose Michael Roth, who is already experienced from the N-1 stable releases? If we can enable him to upload the tarballs created from his tags that would also streamline the stable workflow while at it. Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
On Mon, Mar 31, 2014 at 7:46 AM, Andreas Färber afaer...@suse.de wrote: Am 31.03.2014 16:32, schrieb Peter Maydell: On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote: I think it would be a good idea to separate the committer and release manager roles. Peter is providing the community with a wonderful service, just like you were; putting too much work on his shoulders risks getting us in the same situation if anything were to affect his ability to provide it. Yes, I strongly agree with this. I think we'll do much better if we can manage to share out responsibilities among a wider group of people. May I propose Michael Roth, who is already experienced from the N-1 stable releases? If we can enable him to upload the tarballs created from his tags that would also streamline the stable workflow while at it. If mdroth is willing to take this on, I am very supportive. Regards, Anthony Liguori Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
Anthony Liguori anth...@codemonkey.ws writes: On Mon, Mar 31, 2014 at 7:46 AM, Andreas Färber afaer...@suse.de wrote: Am 31.03.2014 16:32, schrieb Peter Maydell: On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote: I think it would be a good idea to separate the committer and release manager roles. Peter is providing the community with a wonderful service, just like you were; putting too much work on his shoulders risks getting us in the same situation if anything were to affect his ability to provide it. Yes, I strongly agree with this. I think we'll do much better if we can manage to share out responsibilities among a wider group of people. May I propose Michael Roth, who is already experienced from the N-1 stable releases? If we can enable him to upload the tarballs created from his tags that would also streamline the stable workflow while at it. If mdroth is willing to take this on, I am very supportive. Another vote of confidence from me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] ARM VM System Specification
On Sun, Mar 30, 2014 at 03:10:50PM -0700, Olof Johansson wrote: On Fri, Mar 28, 2014 at 11:45 AM, Christoffer Dall christoffer.d...@linaro.org wrote: ARM VM System Specification === [not quoting the whole spec here] This looks very sane to me, and aligns very well with non-virtualized images. For what it's worth, even though it's not a patch: Acked-by: Olof Johansson o...@lixom.net Thanks, appreciate it. What's the plan on how to lock this in? Where's this document going to live? In the kernel sources under Documentation/, or external? There were talks during LCA14 about publishing it as a Linaro white-paper, but I would like it to be hosted by the open-source projects that relate to it, for example the Linux kernel under Documentaiton/ to represent both the guest and KVM side, and the Xen sources too. However, that may be a pain to update so the preferred method could be to host it in either its current form (clear-text) or publish some versioned PDF somewhere and simply point to it from the compliant software pieces. Suggestions or input welcome. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. I certainly understand the goal of keeping the guest CLOCK_REALTIME is sync with the host, but pvclock seems like overkill for that. --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Demand paging for VM on KVM
Hi Grigory, On Thu, Mar 20, 2014 at 10:50:07AM -0700, Grigory Makarevich wrote: Andrea, Paolo, Thanks a lot for the comments. I like the idea of userfaultfd a lot. For my prototype I had to solve a problem of accessing to the ondemand page from the paths where exiting is not safe (emulator is one example). I have solved it using send message using netlink socket/blocking calling thread/waking up once page is delivered. userfaultfd might be cleaner way to achieve the same goal. I'm glad you like the idea of userfaultfd, the way the syscall works tends to mirror the eventfd(2) syscall but the protocol talked on the fd is clearly different. So you also made the vcpu to talk from the kernel to the ondemand thread through some file descriptor and protocol. So the difference is that it shouldn't be limited to the emulator but it shall work for all syscalls that the IO thread may invoke and that could hit on missing guest physical memory. My concerns regarding general mm solution are: - Will it work with any memory mapping schema or only with anonymous memory ? Currently I have the hooks only into anonymous memory but there's no reason why this shouldn't work for other kind of page faults. The main issue is not with the technicality in waiting in the page fault but with the semantics of a page not being mapped. If there's an hole in a filebacked mapping things are different than if there's an hole in anonymous memory. As the VM can unmap and free a filebacked page at any time. But if you're ok to notify the migration thread even in such a case, it could work the same. It would be more tricky if we had to differentiate an initial fault after the vma is created (in order to notify the migration thread only for initial faults), from faults triggered after the VM unmapped and freed the page as result of VM pressure. That would require putting a placeholder in the pagetable instead of keeping the VM code identical to now (which zeroes a pagetable when a page is unmapped from a filebacked mapping). There is also some similarity between the userfault mechanism and the volatile ranges but volatile ranges are handled in a finegrined way in the pagetables, but it looked like there was no code to share in the end. Volatile ranges provides a very different functionality from userfault, for example they don't need to provide transparent behavior if the memory is given as parameter to syscalls as far as I know. They are used only to store data in memory accessed by userland through the pagetables (not syscalls). The objective of userfault is also fundamentally different from the objective of volatile ranges. We cannot ever lose any data, while their whole point is to lose data if there's VM pressure. However if you want to extend the userfaultfd functionalty to trap the first access in the pagetabels for filebacked pages, we would also need to mangle pagetables with some placeholders to differentiate the first fault and we may want to revisit if there's some code or placeholder to share across the two features. Ideally support for filebacked ranges could be a added at a second stage. - Before blocking the calling thread while serving the page-fault in host-kernel, one would need carefully release mmu semaphore (otherwise, the user-space might be in trouble serving the page-fault), which may be not that trivial. Yes all locks must be dropped before waiting for the migration thread and that includes the mmap_sem. I don't think we'll need to add much complexity though, because we can relay on the behavior of page faults that can be repeated endlessy until they succeed. It's certainly more trivial for real faults than gup (because gup can work on more than 1 page at time so it may require some rolling back or special lock retaking during the gup loop). With the real faults the only trick as optimization will be to repeat it without returning to userland. Regarding qemu part of that: - Yes, indeed, user-space would need to be careful accessing ondemand pages. However, that should not be a problem, considering that qemu would need to know in advance all ondemand regions. Though, I would expect some refactoring of the qemu internal api might be required. I don't think the migration thread would risk messing with missing memory, but the refactoring of some API and wire protocol will be still needed to make the protocol bidirectional and to handle the postcopy mechanism. David is working on it. Paolo also pointed out one case that won't be entirely transparent. If gdb would debug the migration thread breakpointing into it, and then the gdb user tries to touch missing memory with ptrace after it stopped the migration thread, gdb would then soft lockup. It'll require a sigkill to unblock. There's no real way to fix it in my view... and overall it looks quite reasonable to end up in a soft lockup in such a scenario, I mean gdb has other ways to interfere in bad ways the app by corrupting its
RE: mechanism to allow a driver to bind to any device
-Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Wednesday, March 26, 2014 5:09 PM To: Alexander Graf Cc: kvm@vger.kernel.org; jan.kis...@siemens.com; will.dea...@arm.com; Yoder Stuart-B08248; a.r...@virtualopensystems.com; Michal Hocko; Wood Scott-B07421; Sethi Varun-B16395; kvm...@lists.cs.columbia.edu; Rafael J. Wysocki; Guenter Roeck; Dmitry Kasatkin; Tejun Heo; Bjorn Helgaas; Antonios Motakis; t...@virtualopensystems.com; Toshi Kani; Greg KH; linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; Joe Perches; christoffer.d...@linaro.org Subject: Re: mechanism to allow a driver to bind to any device On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote: On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote: Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk konrad.w...@oracle.com: On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote: Hi Greg, We (Linaro, Freescale, Virtual Open Systems) are trying get an issue closed that has been perculating for a while around creating a mechanism that will allow kernel drivers like vfio can bind to devices of any type. This thread with you: http://www.spinics.net/lists/kvm-arm/msg08370.html ...seems to have died out, so am trying to get your response and will summarize again. Vfio drivers in the kernel (regardless of bus type) need to bind to devices of any type. The driver's function is to simply export hardware resources of any type to user space. There are several approaches that have been proposed: You seem to have missed the one I proposed. 1. new_id -- (current approach) the user explicitly registers each new device type with the vfio driver using the new_id mechanism. Problem: multiple drivers will be resident that handle the same device type...and there is nothing user space hotplug infrastructure can do to help. 2. any id -- the vfio driver could specify a wildcard match of some kind in its ID match table which would allow it to match and bind to any possible device id. However, we don't want the vfio driver grabbing _all_ devices...just the ones we explicitly want to pass to user space. The proposed patch to support this was to create a new flag sysfs_bind_only in struct device_driver. When this flag is set, the driver can only bind to devices via the sysfs bind file. This would allow the wildcard match to work. Patch is here: https://lkml.org/lkml/2013/12/3/253 3. Driver initiated explicit bind -- with this approach the vfio driver would create a private 'bind' sysfs object and the user would echo the requested device into it: echo 0001:03:00.0 /sys/bus/pci/drivers/vfio-pci/vfio_bind In order to make that work, the driver would need to call driver_probe_device() and thus we need this patch: https://lkml.org/lkml/2014/2/8/175 4). Use the 'unbind' (from the original device) and 'bind' to vfio driver. This is approach 2, no? Which I think is what is currently being done. Why is that not sufficient? How would 'bind to vfio driver' look like? The only thing I see in the URL is That works, but it is ugly. There is some mention of race but I don't see how - if you do the 'unbind' on the original driver and then bind the BDF to the VFIO how would you get a race? Typically on PCI, you do a - add wildcard (pci id) match to vfio driver - unbind driver - reprobe - device attaches to vfio driver because it is the least recent match - remove wildcard match from vfio driver If in between you hotplug add a card of the same type, it gets attached to vfio - even though the logical default driver would be the device specific driver. I've mentioned drivers_autoprobe in the past, but I'm not sure we're really factoring it into the discussion. drivers_autoprobe allows us to toggle two points: a) When a new device is added whether we automatically give drivers a try at binding to it b) When a new driver is added whether it gets to try to bind to anything in the system So we do have a mechanism to avoid the race, but the problem is that it becomes the responsibility of userspace to: 1) turn off drivers_autoprobe 2) unbind/new_id/bind/remove_id 3) turn on drivers_autoprobe 4) call drivers_probe for anything added between 1) 3) Is the question about the ugliness of the current solution whether it's unreasonable to ask userspace to do this? What we seem to be asking for above is more like an autoprobe flag per driver where there's some way for this special driver to opt out of auto probing. Option
[PATCH 1/2] kvm: support any-length wildcard ioeventfd
It is sometimes benefitial to ignore IO size, and only match on address. In hindsight this would have been a better default than matching length when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind of access can be optimized on VMX: there no need to do page lookups. This can currently be done with many ioeventfds but in a suboptimal way. However we can't change kernel/userspace ABI without risk of breaking some applications. Use len = 0 to mean ignore length for matching in a more optimal way. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- include/uapi/linux/kvm.h | 3 ++- arch/x86/kvm/x86.c | 1 + virt/kvm/eventfd.c | 27 ++- 3 files changed, 25 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 932d7f2..d73007e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -464,7 +464,7 @@ enum { struct kvm_ioeventfd { __u64 datamatch; __u64 addr;/* legal pio/mmio address */ - __u32 len; /* 1, 2, 4, or 8 bytes*/ + __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */ __s32 fd; __u32 flags; __u8 pad[36]; @@ -675,6 +675,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_SPAPR_MULTITCE 94 #define KVM_CAP_EXT_EMUL_CPUID 95 #define KVM_CAP_HYPERV_TIME 96 +#define KVM_CAP_IOEVENTFD_NO_LENGTH 97 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c6218b..8ae1ff5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2602,6 +2602,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_IRQ_INJECT_STATUS: case KVM_CAP_IRQFD: case KVM_CAP_IOEVENTFD: + case KVM_CAP_IOEVENTFD_NO_LENGTH: case KVM_CAP_PIT2: case KVM_CAP_PIT_STATE2: case KVM_CAP_SET_IDENTITY_MAP_ADDR: diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index abe4d60..9eadb59 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -600,7 +600,15 @@ ioeventfd_in_range(struct _ioeventfd *p, gpa_t addr, int len, const void *val) { u64 _val; - if (!(addr == p-addr len == p-length)) + if (addr != p-addr) + /* address must be precise for a hit */ + return false; + + if (!p-length) + /* length = 0 means only look at the address, so always a hit */ + return true; + + if (len != p-length) /* address-range must be precise for a hit */ return false; @@ -671,9 +679,11 @@ ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p) list_for_each_entry(_p, kvm-ioeventfds, list) if (_p-bus_idx == p-bus_idx - _p-addr == p-addr _p-length == p-length - (_p-wildcard || p-wildcard || -_p-datamatch == p-datamatch)) + _p-addr == p-addr + (!_p-length || !p-length || +(_p-length == p-length + (_p-wildcard || p-wildcard || + _p-datamatch == p-datamatch return true; return false; @@ -697,8 +707,9 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) int ret; bus_idx = ioeventfd_bus_from_flags(args-flags); - /* must be natural-word sized */ + /* must be natural-word sized, or 0 to ignore length */ switch (args-len) { + case 0: case 1: case 2: case 4: @@ -716,6 +727,12 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) if (args-flags ~KVM_IOEVENTFD_VALID_FLAG_MASK) return -EINVAL; + /* ioeventfd with no length can't be combined with DATAMATCH */ + if (!args-len + args-flags (KVM_IOEVENTFD_FLAG_PIO | + KVM_IOEVENTFD_FLAG_DATAMATCH)) + return -EINVAL; + eventfd = eventfd_ctx_fdget(args-fd); if (IS_ERR(eventfd)) return PTR_ERR(eventfd); -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] vmx: speed up wildcard MMIO EVENTFD
With KVM, MMIO is much slower than PIO, due to the need to do page walk and emulation. But with EPT, it does not have to be: we know the address from the VMCS so if the address is unique, we can look up the eventfd directly, bypassing emulation. Unfortunately, this only works if userspace does not need to match on access length and data. The implementation adds a separate FAST_MMIO bus internally. This serves two purposes: - minimize overhead for old userspace that does not use eventfd with lengtth = 0 - minimize disruption in other code (since we don't know the length, devices on the MMIO bus only get a valid address in write, this way we don't need to touch all devices to teach them to handle an invalid length) At the moment, this optimization only has effect for EPT on x86. It will be possible to speed up MMIO for NPT and MMU using the same idea in the future. With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO. I was unable to detect any measureable slowdown to non-eventfd MMIO. Making MMIO faster is important for the upcoming virtio 1.0 which includes an MMIO signalling capability. The idea was suggested by Peter Anvin. Lots of thanks to Gleb for pre-review and suggestions. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- include/linux/kvm_host.h | 1 + include/uapi/linux/kvm.h | 1 + arch/x86/kvm/vmx.c | 4 virt/kvm/eventfd.c | 16 virt/kvm/kvm_main.c | 1 + 5 files changed, 23 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b8e9a43..c2eaa9f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -163,6 +163,7 @@ enum kvm_bus { KVM_MMIO_BUS, KVM_PIO_BUS, KVM_VIRTIO_CCW_NOTIFY_BUS, + KVM_FAST_MMIO_BUS, KVM_NR_BUSES }; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index d73007e..fc75810 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -450,6 +450,7 @@ enum { kvm_ioeventfd_flag_nr_pio, kvm_ioeventfd_flag_nr_deassign, kvm_ioeventfd_flag_nr_virtio_ccw_notify, + kvm_ioeventfd_flag_nr_fast_mmio, kvm_ioeventfd_flag_nr_max, }; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3927528..78a1932 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5493,6 +5493,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) gpa_t gpa; gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); + if (!kvm_io_bus_write(vcpu-kvm, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { + skip_emulated_instruction(vcpu); + return 1; + } ret = handle_mmio_page_fault_common(vcpu, gpa, true); if (likely(ret == RET_MMIO_PF_EMULATE)) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 9eadb59..c61f2eb 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -770,6 +770,16 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) if (ret 0) goto unlock_fail; + /* When length is ignored, MMIO is also put on a separate bus, for +* faster lookups. +*/ + if (!args-len !(args-flags KVM_IOEVENTFD_FLAG_PIO)) { + ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS, + p-addr, 0, p-dev); + if (ret 0) + goto register_fail; + } + kvm-buses[bus_idx]-ioeventfd_count++; list_add_tail(p-list, kvm-ioeventfds); @@ -777,6 +787,8 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) return 0; +register_fail: + kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev); unlock_fail: mutex_unlock(kvm-slots_lock); @@ -816,6 +828,10 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) continue; kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev); + if (!p-length) { + kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS, + p-dev); + } kvm-buses[bus_idx]-ioeventfd_count--; ioeventfd_release(p); ret = 0; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 03a0381..22a4e1b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2920,6 +2920,7 @@ static int __kvm_io_bus_read(struct kvm_io_bus *bus, struct kvm_io_range *range, return -EOPNOTSUPP; } +EXPORT_SYMBOL_GPL(kvm_io_bus_write); /* kvm_io_bus_read - called under kvm-slots_lock */ int kvm_io_bus_read(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
Quoting Andreas Färber (2014-03-31 09:46:45) Am 31.03.2014 16:32, schrieb Peter Maydell: On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote: I think it would be a good idea to separate the committer and release manager roles. Peter is providing the community with a wonderful service, just like you were; putting too much work on his shoulders risks getting us in the same situation if anything were to affect his ability to provide it. Yes, I strongly agree with this. I think we'll do much better if we can manage to share out responsibilities among a wider group of people. May I propose Michael Roth, who is already experienced from the N-1 stable releases? Sure, I would be willing. If we can enable him to upload the tarballs created from his tags that would also streamline the stable workflow while at it. Agreed, though I feel a little weird about creating releases for tags that aren't in the official repo. Would that be acceptable from a community stand-point? I'm honestly not sure. Otherwise I think Anthony/Peter would probably still need to process a pull for stable-y.x branch in advance before we do the tarball/release. Would still help simplify things a bit though by keeping tasks compartmentalized. Anthony, Peter: in the past, prior to release, I just sent an email with a pointer to my github branch with the stable release tagged. Would a proper pull request (with a for-stable-x.y tag or somesuch) be preferable? If we opt to align the stable repo updates with the actual release, what kind of lead time would we need prior to actual release? Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] nVMX: Fixes to run Xen as L1
Minor changes to enable Xen as a L1 hypervisor. Tested with a Haswell host, Xen-4.3 L1 and debian6 L2 v2: * Remove advertising single context invalidation for emulated invept Patch KVM: nVMX: check for null vmcs12 when L1 does invept from v1 is now obsolete and is removed * Reorder patches KVM: nVMX: Advertise support for interrupt acknowledgement and nVMX: Ack and write vector info to intr_info if L1 asks us to * Add commit description to 2/3 and change comment for nested_exit_intr_ack_set Jan, I will send a separate unit-test patch Bandan Das (3): KVM: nVMX: Don't advertise single context invalidation for invept KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to KVM: nVMX: Advertise support for interrupt acknowledgement arch/x86/kvm/irq.c | 1 + arch/x86/kvm/vmx.c | 35 --- 2 files changed, 25 insertions(+), 11 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] KVM: nVMX: Don't advertise single context invalidation for invept
For single context invalidation, we fall through to global invalidation in handle_invept() except for one case - when the operand supplied by L1 is different from what we have in vmcs12. However, typically hypervisors will only call invept for the currently loaded eptp, so the condition will never be true. Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/vmx.c | 15 +-- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3927528..3e7f60c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2331,12 +2331,11 @@ static __init void nested_vmx_setup_ctls_msrs(void) VMX_EPT_INVEPT_BIT; nested_vmx_ept_caps = vmx_capability.ept; /* -* Since invept is completely emulated we support both global -* and context invalidation independent of what host cpu -* supports +* For nested guests, we don't do anything specific +* for single context invalidation. Hence, only advertise +* support for global context invalidation. */ - nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT | - VMX_EPT_EXTENT_CONTEXT_BIT; + nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT; } else nested_vmx_ept_caps = 0; @@ -6383,7 +6382,6 @@ static int handle_invept(struct kvm_vcpu *vcpu) struct { u64 eptp, gpa; } operand; - u64 eptp_mask = ((1ull 51) - 1) PAGE_MASK; if (!(nested_vmx_secondary_ctls_high SECONDARY_EXEC_ENABLE_EPT) || !(nested_vmx_ept_caps VMX_EPT_INVEPT_BIT)) { @@ -6423,16 +6421,13 @@ static int handle_invept(struct kvm_vcpu *vcpu) } switch (type) { - case VMX_EPT_EXTENT_CONTEXT: - if ((operand.eptp eptp_mask) != - (nested_ept_get_cr3(vcpu) eptp_mask)) - break; case VMX_EPT_EXTENT_GLOBAL: kvm_mmu_sync_roots(vcpu); kvm_mmu_flush_tlb(vcpu); nested_vmx_succeed(vcpu); break; default: + /* Trap single context invalidation invept calls */ BUG_ON(1); break; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to
This feature emulates the Acknowledge interrupt on exit behavior. We can safely emulate it for L1 to run L2 even if L0 itself has it disabled (to run L1). Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/irq.c | 1 + arch/x86/kvm/vmx.c | 19 +++ 2 files changed, 20 insertions(+) diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c index 484bc87..bd0da43 100644 --- a/arch/x86/kvm/irq.c +++ b/arch/x86/kvm/irq.c @@ -113,6 +113,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v) return kvm_get_apic_interrupt(v); /* APIC */ } +EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt); void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3e7f60c..bdc8f2d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4489,6 +4489,18 @@ static bool nested_exit_on_intr(struct kvm_vcpu *vcpu) PIN_BASED_EXT_INTR_MASK; } +/* + * In nested virtualization, check if L1 has enabled + * interrupt acknowledgement that writes the interrupt vector + * info on vmexit + * + */ +static bool nested_exit_intr_ack_set(struct kvm_vcpu *vcpu) +{ + return get_vmcs12(vcpu)-vm_exit_controls + VM_EXIT_ACK_INTR_ON_EXIT; +} + static bool nested_exit_on_nmi(struct kvm_vcpu *vcpu) { return get_vmcs12(vcpu)-pin_based_vm_exec_control @@ -8442,6 +8454,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info, exit_qualification); + if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT) +nested_exit_intr_ack_set(vcpu)) { + int irq = kvm_cpu_get_interrupt(vcpu); + vmcs12-vm_exit_intr_info = irq | + INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR; + } + trace_kvm_nested_vmexit_inject(vmcs12-vm_exit_reason, vmcs12-exit_qualification, vmcs12-idt_vectoring_info_field, -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] KVM: nVMX: Advertise support for interrupt acknowledgement
Some Type 1 hypervisors such as XEN won't enable VMX without it present Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/vmx.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e864b7a..a2a03c5 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2273,7 +2273,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_pinbased_ctls_high = ~PIN_BASED_VMX_PREEMPTION_TIMER; } nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | - VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER); + VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER | + VM_EXIT_ACK_INTR_ON_EXIT); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mechanism to allow a driver to bind to any device
On Mon, 31 Mar 2014 20:23:36 + Stuart Yoder stuart.yo...@freescale.com wrote: From: Greg KH [mailto:gre...@linuxfoundation.org] Sent: Monday, March 31, 2014 2:47 PM On Mon, Mar 31, 2014 at 06:47:51PM +, Stuart Yoder wrote: I also, was at the point where I thought we should perhaps just go with current mechanisms and implement new_id for the platform bus...but Greg's recent response is 'platform devices suck' and it sounds like he would reject a new_id patch for the platform bus. So it kind of feels like we are stuck. ids mean nothing in the platform device model, so having a new_id file for them makes no sense. They don't have IDs like PCI, but platform drivers have to match on something. Platform device match tables are based on compatible strings. Example from Freescale DMA driver: static const struct of_device_id fsldma_of_ids[] = { { .compatible = fsl,elo3-dma, }, { .compatible = fsl,eloplus-dma, }, { .compatible = fsl,elo-dma, }, {} }; The process of unbinding, setting a new_id, and binding to vfio would work just like PCI: echo ffe101300.dma /sys/bus/platform/devices/ffe101300.dma/driver/unbind echo fsl,eloplus-dma /sys/bus/platform/drivers/vfio-platform/new_id In platform device land, we don't want to pursue the new_id/match-by-compatible methodology: we know exactly which specific device (not device types) we want bound to which driver, so we just want to be able to simply: echo fff51000.ethernet | sudo tee -a /sys/bus/platform/devices/fff51000.ethernet/driver/unbind echo fff51000.ethernet | sudo tee -a /sys/bus/platform/drivers/vfio-platform/bind and not get involved with how PCI doesn't simply do that, independent of autoprobe/hotplug. Kim -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mechanism to allow a driver to bind to any device
On Fri, 28 Mar 2014 11:10:23 -0600 Alex Williamson alex.william...@redhat.com wrote: On Fri, 2014-03-28 at 12:58 -0400, Konrad Rzeszutek Wilk wrote: On Wed, Mar 26, 2014 at 04:09:21PM -0600, Alex Williamson wrote: On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote: On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote: Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk konrad.w...@oracle.com: On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote: Hi Greg, We (Linaro, Freescale, Virtual Open Systems) are trying get an issue closed that has been perculating for a while around creating a mechanism that will allow kernel drivers like vfio can bind to devices of any type. This thread with you: http://www.spinics.net/lists/kvm-arm/msg08370.html ...seems to have died out, so am trying to get your response and will summarize again. Vfio drivers in the kernel (regardless of bus type) need to bind to devices of any type. The driver's function is to simply export hardware resources of any type to user space. There are several approaches that have been proposed: You seem to have missed the one I proposed. 1. new_id -- (current approach) the user explicitly registers each new device type with the vfio driver using the new_id mechanism. Problem: multiple drivers will be resident that handle the same device type...and there is nothing user space hotplug infrastructure can do to help. 2. any id -- the vfio driver could specify a wildcard match of some kind in its ID match table which would allow it to match and bind to any possible device id. However, we don't want the vfio driver grabbing _all_ devices...just the ones we explicitly want to pass to user space. The proposed patch to support this was to create a new flag sysfs_bind_only in struct device_driver. When this flag is set, the driver can only bind to devices via the sysfs bind file. This would allow the wildcard match to work. Patch is here: https://lkml.org/lkml/2013/12/3/253 3. Driver initiated explicit bind -- with this approach the vfio driver would create a private 'bind' sysfs object and the user would echo the requested device into it: echo 0001:03:00.0 /sys/bus/pci/drivers/vfio-pci/vfio_bind In order to make that work, the driver would need to call driver_probe_device() and thus we need this patch: https://lkml.org/lkml/2014/2/8/175 4). Use the 'unbind' (from the original device) and 'bind' to vfio driver. This is approach 2, no? Which I think is what is currently being done. Why is that not sufficient? How would 'bind to vfio driver' look like? The only thing I see in the URL is That works, but it is ugly. There is some mention of race but I don't see how - if you do the 'unbind' on the original driver and then bind the BDF to the VFIO how would you get a race? Typically on PCI, you do a - add wildcard (pci id) match to vfio driver - unbind driver - reprobe - device attaches to vfio driver because it is the least recent match - remove wildcard match from vfio driver If in between you hotplug add a card of the same type, it gets attached to vfio - even though the logical default driver would be the device specific driver. I've mentioned drivers_autoprobe in the past, but I'm not sure we're really factoring it into the discussion. drivers_autoprobe allows us to toggle two points: a) When a new device is added whether we automatically give drivers a try at binding to it b) When a new driver is added whether it gets to try to bind to anything in the system So we do have a mechanism to avoid the race, but the problem is that it becomes the responsibility of userspace to: 1) turn off drivers_autoprobe 2) unbind/new_id/bind/remove_id 3) turn on drivers_autoprobe 4) call drivers_probe for anything added between 1) 3) Is the question about the ugliness of the current solution whether it's unreasonable to ask userspace to do this? What we seem to be asking for above is more like an autoprobe flag per driver where there's some way for this special driver to opt out of auto probing. Option 2. in Stuart's list does this by short-cutting ID matching so that a match is only found when using the sysfs bind path, option 3. enables a way
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: + /* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mechanism to allow a driver to bind to any device
On Mon, 2014-03-31 at 17:36 -0500, Kim Phillips wrote: On Fri, 28 Mar 2014 11:10:23 -0600 Alex Williamson alex.william...@redhat.com wrote: On Fri, 2014-03-28 at 12:58 -0400, Konrad Rzeszutek Wilk wrote: On Wed, Mar 26, 2014 at 04:09:21PM -0600, Alex Williamson wrote: On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote: On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote: Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk konrad.w...@oracle.com: On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote: Hi Greg, We (Linaro, Freescale, Virtual Open Systems) are trying get an issue closed that has been perculating for a while around creating a mechanism that will allow kernel drivers like vfio can bind to devices of any type. This thread with you: http://www.spinics.net/lists/kvm-arm/msg08370.html ...seems to have died out, so am trying to get your response and will summarize again. Vfio drivers in the kernel (regardless of bus type) need to bind to devices of any type. The driver's function is to simply export hardware resources of any type to user space. There are several approaches that have been proposed: You seem to have missed the one I proposed. 1. new_id -- (current approach) the user explicitly registers each new device type with the vfio driver using the new_id mechanism. Problem: multiple drivers will be resident that handle the same device type...and there is nothing user space hotplug infrastructure can do to help. 2. any id -- the vfio driver could specify a wildcard match of some kind in its ID match table which would allow it to match and bind to any possible device id. However, we don't want the vfio driver grabbing _all_ devices...just the ones we explicitly want to pass to user space. The proposed patch to support this was to create a new flag sysfs_bind_only in struct device_driver. When this flag is set, the driver can only bind to devices via the sysfs bind file. This would allow the wildcard match to work. Patch is here: https://lkml.org/lkml/2013/12/3/253 3. Driver initiated explicit bind -- with this approach the vfio driver would create a private 'bind' sysfs object and the user would echo the requested device into it: echo 0001:03:00.0 /sys/bus/pci/drivers/vfio-pci/vfio_bind In order to make that work, the driver would need to call driver_probe_device() and thus we need this patch: https://lkml.org/lkml/2014/2/8/175 4). Use the 'unbind' (from the original device) and 'bind' to vfio driver. This is approach 2, no? Which I think is what is currently being done. Why is that not sufficient? How would 'bind to vfio driver' look like? The only thing I see in the URL is That works, but it is ugly. There is some mention of race but I don't see how - if you do the 'unbind' on the original driver and then bind the BDF to the VFIO how would you get a race? Typically on PCI, you do a - add wildcard (pci id) match to vfio driver - unbind driver - reprobe - device attaches to vfio driver because it is the least recent match - remove wildcard match from vfio driver If in between you hotplug add a card of the same type, it gets attached to vfio - even though the logical default driver would be the device specific driver. I've mentioned drivers_autoprobe in the past, but I'm not sure we're really factoring it into the discussion. drivers_autoprobe allows us to toggle two points: a) When a new device is added whether we automatically give drivers a try at binding to it b) When a new driver is added whether it gets to try to bind to anything in the system So we do have a mechanism to avoid the race, but the problem is that it becomes the responsibility of userspace to: 1) turn off drivers_autoprobe 2) unbind/new_id/bind/remove_id 3) turn on drivers_autoprobe 4) call drivers_probe for anything added between 1) 3) Is the question about the ugliness of the current solution whether it's unreasonable to ask userspace to do this? What we seem to be asking for above is more like an autoprobe flag per driver where there's some way for this special
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
Thanks Stefan and thanks Michael also. That situation regarding the IRC was very special, since i didnt wanted to tell Michael hey, everyone in the mailing list got it and im here chatting with you and you didn't so i assumed the IRC was 9 times more pro than the mailing list so i decided to keep my head down and assume the communication error was on my side. Still, IMHO, i really believe that if you are a user willing to give KVM a chance enought to make a query on the IRC, you might feel you are not geek enought to be there, and i dont mean be there on IRC, but trying to use the community to support you while you try KVM. In my case, while was very important to understant what were my chances regarding this issue, i knew i would find my answer no matter what because i was decided to find it, i could get mad with 10.5K guests running on my back, yes my experience was more from the virsh stop; virsh start side, but still i felt i needed you guys to try to find this out. Again, thanks to everyone. best. Alejandro Comisario On Fri, Mar 28, 2014 at 5:47 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote: 27.03.2014 20:14, Alejandro Comisario wrote: Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side (ubuntu 12.04 on host and guest). So, how can i adjust the tinmeout on the guest ? After a bit more talks on IRC yesterday, it turned out that the situation is _much_ more interesting than originally described. The OP claims to have 10500 guests running off an NFS server, and that after NFS server downtime, the backing files were disappeared (whatever it means), so they had to restore those files. More, the OP didn't even bother to look at the guest's dmesg, being busy rebooting all 10500 guests. This solution is the most logical one, but i cannot apply it! thanks for all the responses! I suggested the OP to actually describe the _real_ situation, instead of giving random half-pictures, and actually take a look at the actual problem as reported in various places (most importantly the guest kernel log), and reoirt _those_ hints to the list. I also mentioned that, at least for some NFS servers, if a client has a file open on the server, and this file is deleted, the server will report error to the client when client tries to access that file, and this has nothing at all to do with timeouts of any kind. Thanks for the update and for taking time to help on IRC. I feel you're being harsh on Alejandro though. Improving the quality of bug reports is important but it shouldn't be at the expense of quality of communication. We can't assume that everyone is an expert in troubleshooting KVM or Linux. Therefore we can't blame them, which will only drive people away and detract from the community. TL;DR post logs and error messages +1, berate him -1 Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4
-Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Paolo Bonzini Sent: Monday, March 31, 2014 9:31 PM To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org Subject: Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4 Just a few comments... -static void update_permission_bitmask(struct kvm_vcpu *vcpu, +void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool ept) { unsigned bit, byte, pfec; u8 map; - bool fault, x, w, u, wf, uf, ff, smep; + bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0; smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); Can you make an additional patch to rename this to cr4_smep? Sure! I noticed your comments about this issue in the previous email, I was prepare to make a patch for it, will send out it today! + cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte 1; map = 0; wf = pfec PFERR_WRITE_MASK; uf = pfec PFERR_USER_MASK; ff = pfec PFERR_FETCH_MASK; + /* +* PFERR_RSVD_MASK bit is used to detect SMAP violation. +* We will check it in permission_fault(), this bit is +* set in pfec for normal fault, while it is cleared for +* SMAP violations. +*/ This bit is set in PFEC if we the access is _not_ subject to SMAP restrictions, and cleared otherwise. The bit is only meaningful if the SMAP bit is set in CR4. + smapf = !(pfec PFERR_RSVD_MASK); for (bit = 0; bit 8; ++bit) { x = bit ACC_EXEC_MASK; w = bit ACC_WRITE_MASK; @@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, w |= !is_write_protection(vcpu) !uf; /* Disallow supervisor fetches of user code if cr4.smep */ x = !(smep u !uf); + + /* +* SMAP:kernel-mode data accesses from user-mode +* mappings should fault. A fault is considered +* as a SMAP violation if all of the following +* conditions are ture: +* - X86_CR4_SMAP is set in CR4 +* - An user page is accessed +* - Page fault in kernel mode +* - !(CPL3 X86_EFLAGS_AC is set) - if CPL 3, EFLAGS.AC is clear Should it be if CPL =3 or EFLAGS.AC is clear ? +* Here, we cover the first three conditions, +* The CPL and X86_EFLAGS_AC is in smapf,which +* permission_fault() computes dynamically. The fourth is computed dynamically in permission_fault() and is in SMAPF. +* Also, SMAP does not affect instruction +* fetches, add the !ff check here to make it +* clearer. +*/ + smap = cr4_smap u !uf !ff; } else /* Not really needed: no U/S accesses on ept */ u = 1; - fault = (ff !x) || (uf !u) || (wf !w); + fault = (ff !x) || (uf !u) || (wf !w) || + (smapf smap); map |= fault bit; } mmu-permissions[byte] = map; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 2926152..822190f 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -44,11 +44,17 @@ #define PT_DIRECTORY_LEVEL 2 #define PT_PAGE_TABLE_LEVEL 1 -#define PFERR_PRESENT_MASK (1U 0) -#define PFERR_WRITE_MASK (1U 1) -#define PFERR_USER_MASK (1U 2) -#define PFERR_RSVD_MASK (1U 3) -#define PFERR_FETCH_MASK (1U 4) +#define PFERR_PRESENT_BIT 0 +#define PFERR_WRITE_BIT 1 +#define PFERR_USER_BIT 2 +#define PFERR_RSVD_BIT 3 +#define PFERR_FETCH_BIT 4 + +#define PFERR_PRESENT_MASK (1U PFERR_PRESENT_BIT) +#define PFERR_WRITE_MASK (1U PFERR_WRITE_BIT) +#define PFERR_USER_MASK (1U PFERR_USER_BIT) +#define PFERR_RSVD_MASK (1U PFERR_RSVD_BIT) +#define PFERR_FETCH_MASK (1U PFERR_FETCH_BIT) int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); @@ -73,6 +79,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); void
[PATCH v4 0/4] KVM: enable Intel SMAP for KVM
Supervisor Mode Access Prevention (SMAP) is a new security feature disclosed by Intel, please refer to the following document: http://software.intel.com/sites/default/files/319433-014.pdf Every access to a linear address is either a supervisor-mode access or a user-mode access. All accesses performed while the current privilege level (CPL) is less than 3 are supervisor-mode accesses. If CPL = 3, accesses are generally user-mode accesses. However, some operations implicitly access system data structures, and the resulting accesses to those data structures are supervisor-mode accesses regardless of CPL. Examples of such implicit supervisor accesses include the following: accesses to the global descriptor table (GDT) or local descriptor table (LDT) to load a segment descriptor; accesses to the interrupt descriptor table (IDT) when delivering an interrupt or exception; and accesses to the task-state segment (TSS) as part of a task switch or change of CPL. If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear addresses that are accessible in user mode. If CPL 3, SMAP protections are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode data accesses (these are implicit supervisor accesses) regardless of the value of EFLAGS.AC. This patchset pass-through SMAP feature to guests, and let guests benefit from it. Version 1: * Remove SMAP bit from CR4_RESERVED_BITS. * Add SMAP support when setting CR4 * Disable SMAP for guests in EPT realmode and EPT unpaging mode * Expose SMAP feature to guest Version 2: * Change the logic of updating mmu permission bitmap for SMAP violation * Expose SMAP feature to guest in the last patch of this series. Version 3: * Changes in update_permission_bitmask(). * Use a branchless way suggested by Paolo Bonzini to detect SMAP violation in permission_fault(). Version 4: * Changes to some comments and code style. Feng Wu (4): KVM: Remove SMAP bit from CR4_RESERVED_BITS. KVM: Add SMAP support when setting CR4 KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode KVM: expose SMAP feature to guest arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c| 2 +- arch/x86/kvm/cpuid.h| 8 arch/x86/kvm/mmu.c | 34 --- arch/x86/kvm/mmu.h | 44 + arch/x86/kvm/paging_tmpl.h | 2 +- arch/x86/kvm/vmx.c | 11 ++- arch/x86/kvm/x86.c | 9 - 8 files changed, 92 insertions(+), 20 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 4/4] KVM: expose SMAP feature to guest
This patch exposes SMAP feature to guest Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/cpuid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index c697625..deb5f9b 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, /* cpuid 7.0.ebx */ const u32 kvm_supported_word9_x86_features = F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | - F(BMI2) | F(ERMS) | f_invpcid | F(RTM); + F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP); /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
SMAP is disabled if CPU is in non-paging mode in hardware. However KVM always uses paging mode to emulate guest non-paging mode with TDP. To emulate this behavior, SMAP needs to be manually disabled when guest switches to non-paging mode. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/vmx.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3927528..e58cb5f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3452,13 +3452,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) hw_cr4 = ~X86_CR4_PAE; hw_cr4 |= X86_CR4_PSE; /* -* SMEP is disabled if CPU is in non-paging mode in -* hardware. However KVM always uses paging mode to +* SMEP/SMAP is disabled if CPU is in non-paging mode +* in hardware. However KVM always uses paging mode to * emulate guest non-paging mode with TDP. -* To emulate this behavior, SMEP needs to be manually -* disabled when guest switches to non-paging mode. +* To emulate this behavior, SMEP/SMAP needs to be +* manually disabled when guest switches to non-paging +* mode. */ - hw_cr4 = ~X86_CR4_SMEP; + hw_cr4 = ~(X86_CR4_SMEP | X86_CR4_SMAP); } else if (!(cr4 X86_CR4_PAE)) { hw_cr4 = ~X86_CR4_PAE; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.
This patch removes SMAP bit from CR4_RESERVED_BITS. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/include/asm/kvm_host.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fdf83af..4eeb049 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -60,7 +60,7 @@ | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \ | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \ - | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE)) + | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP)) #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/4] KVM: Add SMAP support when setting CR4
This patch adds SMAP handling logic when setting CR4 for guests Thanks a lot to Paolo Bonzini for his suggestion to use the branchless way to detect SMAP violation. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/cpuid.h | 8 arch/x86/kvm/mmu.c | 34 +++--- arch/x86/kvm/mmu.h | 44 arch/x86/kvm/paging_tmpl.h | 2 +- arch/x86/kvm/x86.c | 9 - 5 files changed, 84 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index a2a1bb7..eeecbed 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu *vcpu) return best (best-ebx bit(X86_FEATURE_SMEP)); } +static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 7, 0); + return best (best-ebx bit(X86_FEATURE_SMAP)); +} + static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9b53135..a183783 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3601,20 +3601,27 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu, } } -static void update_permission_bitmask(struct kvm_vcpu *vcpu, +void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool ept) { unsigned bit, byte, pfec; u8 map; - bool fault, x, w, u, wf, uf, ff, smep; + bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0; smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); + cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte 1; map = 0; wf = pfec PFERR_WRITE_MASK; uf = pfec PFERR_USER_MASK; ff = pfec PFERR_FETCH_MASK; + /* +* PFERR_RSVD_MASK bit is set in PFEC if the access is not +* subject to SMAP restrictions, and cleared otherwise. The +* bit is only meaningful if the SMAP bit is set in CR4. +*/ + smapf = !(pfec PFERR_RSVD_MASK); for (bit = 0; bit 8; ++bit) { x = bit ACC_EXEC_MASK; w = bit ACC_WRITE_MASK; @@ -3627,11 +3634,32 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, w |= !is_write_protection(vcpu) !uf; /* Disallow supervisor fetches of user code if cr4.smep */ x = !(smep u !uf); + + /* +* SMAP:kernel-mode data accesses from user-mode +* mappings should fault. A fault is considered +* as a SMAP violation if all of the following +* conditions are ture: +* - X86_CR4_SMAP is set in CR4 +* - An user page is accessed +* - Page fault in kernel mode +* - if CPL = 3 or X86_EFLAGS_AC is clear +* +* Here, we cover the first three conditions. +* The fourth is computed dynamically in +* permission_fault() and is in smapf. +* +* Also, SMAP does not affect instruction +* fetches, add the !ff check here to make it +* clearer. +*/ + smap = cr4_smap u !uf !ff; } else /* Not really needed: no U/S accesses on ept */ u = 1; - fault = (ff !x) || (uf !u) || (wf !w); + fault = (ff !x) || (uf !u) || (wf !w) || + (smapf smap); map |= fault bit; } mmu-permissions[byte] = map; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 2926152..3842e70 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -44,11 +44,17 @@ #define PT_DIRECTORY_LEVEL 2 #define PT_PAGE_TABLE_LEVEL 1 -#define PFERR_PRESENT_MASK (1U 0) -#define PFERR_WRITE_MASK (1U 1) -#define PFERR_USER_MASK (1U 2) -#define PFERR_RSVD_MASK (1U 3) -#define PFERR_FETCH_MASK (1U 4) +#define PFERR_PRESENT_BIT 0 +#define PFERR_WRITE_BIT 1 +#define PFERR_USER_BIT 2 +#define PFERR_RSVD_BIT 3 +#define
[PATCH] Rename variable smep to cr4_smep
This patch is based on the smap patchset Feng Wu (1): Rename variable smep to cr4_smep arch/x86/kvm/mmu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Rename variable smep to cr4_smep
Rename variable smep to cr4_smep, which can better reflect the meaning of the variable. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/mmu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a183783..6000557 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3606,9 +3606,9 @@ void update_permission_bitmask(struct kvm_vcpu *vcpu, { unsigned bit, byte, pfec; u8 map; - bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0; + bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0; - smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); + cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte 1; @@ -3633,7 +3633,7 @@ void update_permission_bitmask(struct kvm_vcpu *vcpu, /* Allow supervisor writes if !cr0.wp */ w |= !is_write_protection(vcpu) !uf; /* Disallow supervisor fetches of user code if cr4.smep */ - x = !(smep u !uf); + x = !(cr4_smep u !uf); /* * SMAP:kernel-mode data accesses from user-mode -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote: On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. VM migration. Can you explain why you consider it so bad ? How you think it could be improved ? I certainly understand the goal of keeping the guest CLOCK_REALTIME is sync with the host, but pvclock seems like overkill for that. VM migration. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote: On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. VM migration. Why does that need percpu stuff? Wouldn't it be sufficient to interrupt all CPUs (or at least all cpus running in userspace) on migration and update the normal timing data structures? Even better: have the VM offer to invalidate the physical page containing the kernel's clock data on migration and interrupt one CPU. If another CPU races, it'll fault and wait for the guest kernel to update its timing. Does the current kvmclock stuff track CLOCK_MONOTONIC and CLOCK_REALTIME separately? Can you explain why you consider it so bad ? How you think it could be improved ? The second rdtsc_barrier looks unnecessary. Even better, if rdtscp is available, then rdtscp can replace rdtsc_barrier, rdtsc, and the getcpu call. It would also be nice to avoid having two sets of rescalings of the timing data. I certainly understand the goal of keeping the guest CLOCK_REALTIME is sync with the host, but pvclock seems like overkill for that. VM migration. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com: On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: +/* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. Ah, this is a guest race where the guest modifies exactly the TLB in question. I see. Why would this ever happen in practice (in a case where we're not executing malicious code)? Alex -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail
On 03/26/2014 09:52 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: diff --git a/arch/powerpc/kvm/book3s_paired_singles.c b/arch/powerpc/kvm/book3s_paired_singles.c index a59a25a..80c533e 100644 --- a/arch/powerpc/kvm/book3s_paired_singles.c +++ b/arch/powerpc/kvm/book3s_paired_singles.c @@ -640,19 +640,24 @@ static int kvmppc_ps_one_in(struct kvm_vcpu *vcpu, bool rc, int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu) { - u32 inst = kvmppc_get_last_inst(vcpu); + u32 inst; enum emulation_result emulated = EMULATE_DONE; - - int ax_rd = inst_get_field(inst, 6, 10); - int ax_ra = inst_get_field(inst, 11, 15); - int ax_rb = inst_get_field(inst, 16, 20); - int ax_rc = inst_get_field(inst, 21, 25); - short full_d = inst_get_field(inst, 16, 31); - - u64 *fpr_d = vcpu-arch.fpr[ax_rd]; - u64 *fpr_a = vcpu-arch.fpr[ax_ra]; - u64 *fpr_b = vcpu-arch.fpr[ax_rb]; - u64 *fpr_c = vcpu-arch.fpr[ax_rc]; + int ax_rd, ax_ra, ax_rb, ax_rc; + short full_d; + u64 *fpr_d, *fpr_a, *fpr_b, *fpr_c; + + kvmppc_get_last_inst(vcpu, inst); Should probably check for failure here and elsewhere -- even though it can't currently fail on book3s, the interface now allows it. diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 5b9e906..b0d884d 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -624,9 +624,10 @@ void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr) static int kvmppc_read_inst(struct kvm_vcpu *vcpu) { ulong srr0 = kvmppc_get_pc(vcpu); - u32 last_inst = kvmppc_get_last_inst(vcpu); + u32 last_inst; int ret; + kvmppc_get_last_inst(vcpu, last_inst); ret = kvmppc_ld(vcpu, srr0, sizeof(u32), last_inst, false); This isn't new, but this function looks odd to me -- calling kvmppc_get_last_inst() but ignoring last_inst, then calling kvmppc_ld() and ignoring anything but failure. last_inst itself is never read. And no comments to explain the weirdness. :-) I get that kvmppc_get_last_inst() is probably being used for the side effect of filling in vcpu-arch.last_inst, but why store the return value without using it? Why pass the address of it to kvmppc_ld(), which seems to be used only as an indirect way of determining whether kvmppc_get_last_inst() failed? And that whole mechanism becomes stranger once it becomes possible for kvmppc_get_last_inst() to directly return failure. If you're interested in the history of this, here's the patch :) https://github.com/mirrors/linux-2.6/commit/c7f38f46f2a98d232147e47284cb4e7363296a3e The idea is that we need 2 things to be good after this function: 1) vcpu-arch.last_inst is valid 2) if the last instruction is not readable, return failure Hence this weird construct. I don't think it's really necessary though - just remove the kvmppc_ld() call and only fail read_inst() when the caching didn't work and we can't translate the address. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: Load external pid (lwepx) instruction faults (when called from KVM with guest context) needs to be handled by KVM. This implies additional code in DO_KVM macro to identify the source of the exception (which oiginate from KVM host rather than the guest). The hook requires to check the Exception Syndrome Register ESR[EPID] and External PID Load Context Register EPLC[EGS] for some exceptions (DTLB_MISS, DSI and LRAT). Doing this on Data TLB miss exception is obvious intrusive for the host. Get rid of lwepx and acquire last instuction in kvmppc_get_last_inst() by searching for the physical address and kmap it. This fixes an infinite loop caused by lwepx's data TLB miss handled in the host and the TODO for TLB eviction and execute-but-not-read entries. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kvm/bookehv_interrupts.S | 37 +++-- arch/powerpc/kvm/e500_mmu_host.c | 93 + 2 files changed, 102 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S index 20c7a54..c50490c 100644 --- a/arch/powerpc/kvm/bookehv_interrupts.S +++ b/arch/powerpc/kvm/bookehv_interrupts.S @@ -119,38 +119,14 @@ 1: .if \flags NEED_EMU - /* -* This assumes you have external PID support. -* To support a bookehv CPU without external PID, you'll -* need to look up the TLB entry and create a temporary mapping. -* -* FIXME: we don't currently handle if the lwepx faults. PR-mode -* booke doesn't handle it either. Since Linux doesn't use -* broadcast tlbivax anymore, the only way this should happen is -* if the guest maps its memory execute-but-not-read, or if we -* somehow take a TLB miss in the middle of this entry code and -* evict the relevant entry. On e500mc, all kernel lowmem is -* bolted into TLB1 large page mappings, and we don't use -* broadcast invalidates, so we should not take a TLB miss here. -* -* Later we'll need to deal with faults here. Disallowing guest -* mappings that are execute-but-not-read could be an option on -* e500mc, but not on chips with an LRAT if it is used. -*/ - - mfspr r3, SPRN_EPLC /* will already have correct ELPID and EGS */ PPC_STL r15, VCPU_GPR(R15)(r4) PPC_STL r16, VCPU_GPR(R16)(r4) PPC_STL r17, VCPU_GPR(R17)(r4) PPC_STL r18, VCPU_GPR(R18)(r4) PPC_STL r19, VCPU_GPR(R19)(r4) - mr r8, r3 PPC_STL r20, VCPU_GPR(R20)(r4) - rlwimi r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS PPC_STL r21, VCPU_GPR(R21)(r4) - rlwimi r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR PPC_STL r22, VCPU_GPR(R22)(r4) - rlwimi r8, r10, EPC_EPID_SHIFT, EPC_EPID PPC_STL r23, VCPU_GPR(R23)(r4) PPC_STL r24, VCPU_GPR(R24)(r4) PPC_STL r25, VCPU_GPR(R25)(r4) @@ -160,10 +136,15 @@ PPC_STL r29, VCPU_GPR(R29)(r4) PPC_STL r30, VCPU_GPR(R30)(r4) PPC_STL r31, VCPU_GPR(R31)(r4) - mtspr SPRN_EPLC, r8 - isync - lwepx r9, 0, r5 - mtspr SPRN_EPLC, r3 + + /* +* We don't use external PID support. lwepx faults would need to be +* handled by KVM and this implies aditional code in DO_KVM (for +* DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which +* is too intrusive for the host. Get last instuction in +* kvmppc_get_last_inst(). +*/ + li r9, KVM_INST_FETCH_FAILED stw r9, VCPU_LAST_INST(r4) .endif diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 6025cb7..1b4cb41 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -598,9 +598,102 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr, } } +#ifdef CONFIG_KVM_BOOKE_HV +int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr) It'd be interesting to see what the performance impact of doing this on non-HV would be -- it would eliminate divergent code, eliminate the MSR_DS hack, and make exec-only mappings work. We hit the instruction emulation path a lot more often on non-HV, so even a slight performance impact that might not be a major bummer for HV would become critical for PR. But I agree - it'd be interesting to see numbers. +{ + gva_t geaddr; + hpa_t addr; + hfn_t pfn; + hva_t eaddr; + u32 mas0, mas1, mas2, mas3; + u64 mas7_mas3; + struct page *page; + unsigned int addr_space, psize_shift; + bool pr; + unsigned long flags; + + /* Search TLB for guest pc to get the real address */ + geaddr = kvmppc_get_pc(vcpu); + addr_space = (vcpu-arch.shared-msr
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: + /* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com: On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: +/* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. Ah, this is a guest race where the guest modifies exactly the TLB in question. I see. Why would this ever happen in practice (in a case where we're not executing malicious code)? Alex -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html