Re: [PATCH 00/13] KVM: MMU: fast page fault
On 04/13/2012 05:25 PM, Takuya Yoshikawa wrote: I forgot to say one important thing -- I might give you wrong impression. I am perfectly fine with your lock-less work. It is really nice! The reason I say much about O(1) is that O(1) and rmap based GET_DIRTY_LOG have fundamentally different characteristics. I am thinking really seriously how to make dirty page tracking work well with QEMU in the future. For example, I am thinking about multi-threaded and fine-grained GET_DIRTY_LOG. If we use rmap based GET_DIRTY_LOG, we can restrict write protection to only a selected area of one guest memory slot. So we may be able to make each thread process dirty pages independently from other threads by calling GET_DIRTY_LOG for its own area. But I know that O(1) has its own good point. So please wait a bit. I will write up what I am thinking or send patches. Anyway, I am looking forward to your lock-less work! It will improve the current GET_DIRTY_LOG performance. Just to throw another idea into the mix - we can have write-protect-less dirty logging, too. Instead of write protection, drop the dirty bit, and check it again when reading the dirty log. It might look like we're accessing the spte twice here, but it's actually just once - when we check it to report for GET_DIRTY_LOG call N, we also prepare it for call N+1. This doesn't work for EPT, which lacks a dirty bit. But we can emulate it: take a free bit and call it spte.NOTDIRTY, when it is set, we also clear spte.WRITE, and teach the mmu that if it sees spte.NOTDIRTY and can just set spte.WRITE and clear spte.NOTDIRTY. Now that looks exactly like Xiao's lockless write enabling. Another note: O(1) write protection is not mutually exclusive with rmap based write protection. In GET_DIRTY_LOG, you write protect everything, and proceed to write enable on faults. When you reach the page table level, you perform the rmap check to see if you should write protect or not. With role.direct=1 the check is very cheap (and sometimes you can drop the entire page table and replace it with a large spte). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DOS VM problem with QEMU-KVM and newer kernels
On 04/12/2012 09:32 PM, Gerhard Wiesinger wrote: Hello, I'm having problems with recents kernels and qemu-kvm with a DOS VM: TD286 System: Bad selector: 0007 System: Bad selector: 0D87 System: Bad selector: 001F System: Bad selector: 0007 GP at 0020 21D4 EC 0DC4 Error 269 loading D:\BP\BIN\TD286.EXE into extended memory Another 286 DOS Extender application also rises a general protection fault: GP at 0020 18A1 CODE 357C Doesn't depend on the used DOS memory manager and is always reproduceable. Depends only on kernel version and not qemu-kvm and seabios (tried to bisect it without success): # NOK: Linux 3.3.1-3.fc16.x86_64 #1 SMP Wed Apr 4 18:08:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # NOK: Linux 3.2.10-3.fc16.x86_64 #1 SMP Thu Mar 15 19:39:46 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # OK: Linux 3.1.9-1.fc16.x86_64 #1 SMP Fri Jan 13 16:37:42 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # OK: Linux 2.6.41.9-1.fc15.x86_64 #1 SMP Fri Jan 13 16:46:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux CPU is an AMD one. Any ideas how to fix it again? Any switches which might help? The trigger is probably commit f1c1da2bde712812a3e0f9a7a7ebe7a916a4b5f4 Author: Jan Kiszka jan.kis...@siemens.com Date: Tue Oct 18 18:23:11 2011 +0200 KVM: SVM: Keep intercepting task switching with NPT enabled AMD processors apparently have a bug in the hardware task switching support when NPT is enabled. If the task switch triggers a NPF, we can get wrong EXITINTINFO along with that fault. On resume, spurious exceptions may then be injected into the guest. We were able to reproduce this bug when our guest triggered #SS and the handler were supposed to run over a separate task with not yet touched stack pages. Work around the issue by continuing to emulate task switches even in NPT mode. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Although it's not the patch's direct fault - it simply exposed an existing bug in kvm. Things to try: - revert the patch with a newer kernel - try 3.4-rc2 which has some task switch fixes from Kevin; if you want a Fedora kernel, use rawhide's [2] - post traces [1] Jan, Joerg, was an AMD erratum published for the bug? [1] http://www.linux-kvm.org/page/Tracing [2] http://mirrors.kernel.org/fedora/development/rawhide/x86_64/os/Packages/k/kernel-3.4.0-0.rc2.git2.1.fc18.x86_64.rpm -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-kvm fails on pax kernel
Hi list, I use the PAX patch in the kernel 3.2.14. My qemu-kvm guest running on a gentoo hardend as host. When I try to start my kvm guest I get: PAX: size overflow detected in function wrmsr_interception arch/x86/kvm/svm.c:3115 Pid: 3565, comm: kvm_webserver Not tainted 3.2.14-rsbac-2.57-sec #5 Call Trace: [8115a407] ? 0x8115a407 [810344f4] ? 0x810344f4 [81032fd0] ? 0x81032fd0 [81016f3c] ? 0x81016f3c [8102f215] ? 0x8102f215 [810019c2] ? 0x810019c2 [8116965b] ? 0x8116965b [81169f3b] ? 0x81169f3b [8116a60b] ? 0x8116a60b [81542226] ? 0x81542226 [8154224d] ? 0x8154224d -- Mit freundlchen Grüßen Jerns Kasten http://www.kasten-edv.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Crash Caused By KVM?
On 04/11/2012 09:59 PM, Eric Northup wrote: On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity a...@redhat.com wrote: On 04/11/2012 05:11 AM, Peijie Yu wrote: For this problem, i found that panic is caused by BUG_ON(in_nmi()) which means NMI happened during another NMI Context; But i check the Intel Technical Manual and found While an NMI interrupt handler is executing, the processor disables additional calls to the NMI handler until the next IRET instruction is executed. So, how this happen? The NMI path for kvm is different; the processor exits from the guest with NMIs blocked, then executes kvm code until it issues int $2 in vmx_complete_interrupts(). If an IRET is executed in this path, then NMIs will be unblocked and nested NMIs may occur. One way this can happen is if we access the vmap area and incur a fault, between the VMEXIT and invoking the NMI handler. Or perhaps the NMI handler itself generates a fault. Or we have a debug exception in that path. Is this reproducible? As an FYI, there have been BIOSes whose SMI handlers ran IRETs. So the NMI blocking can go away surprisingly. See 29.8 NMI handling while in SMM in the Intel SDM vol 3. Interesting, thanks. From 29.8 it looks like you don't even need to issue IRET within SMM, since SMM doesn't save/restore the NMI blocking flag. However, this being a server, and the crash being in kvm code, I don't think we can rule out that this is a kvm bug. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm fails on pax kernel
On 04/15/2012 12:42 PM, Jens Kasten wrote: Hi list, I use the PAX patch in the kernel 3.2.14. My qemu-kvm guest running on a gentoo hardend as host. When I try to start my kvm guest I get: PAX: size overflow detected in function wrmsr_interception arch/x86/kvm/svm.c:3115 Pid: 3565, comm: kvm_webserver Not tainted 3.2.14-rsbac-2.57-sec #5 Call Trace: [8115a407] ? 0x8115a407 [810344f4] ? 0x810344f4 [81032fd0] ? 0x81032fd0 [81016f3c] ? 0x81016f3c [8102f215] ? 0x8102f215 [810019c2] ? 0x810019c2 [8116965b] ? 0x8116965b [81169f3b] ? 0x81169f3b [8116a60b] ? 0x8116a60b [81542226] ? 0x81542226 [8154224d] ? 0x8154224d Out of tree patches are not supported. But if you provided a decoded trace (with symbols instead of numbers), maybe we can learn something. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/16] KVM: MMU: return bool in __rmap_write_protect
On 04/14/2012 05:00 AM, Takuya Yoshikawa wrote: On Fri, 13 Apr 2012 18:11:13 +0800 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: The reture value of __rmap_write_protect is either 1 or 0, use true/false instead of these ... @@ -1689,7 +1690,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, kvm_mmu_pages_init(parent, parents, pages); while (mmu_unsync_walk(parent, pages)) { - int protected = 0; + bool protected = false; for_each_sp(pages, sp, parents, i) protected |= rmap_write_protect(vcpu-kvm, sp-gfn); Isn't this the reason we prefer int to bool? Not sure people like to use |= with boolean. Why not? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable
On 04/13/2012 01:14 PM, Xiao Guangrong wrote: Using bit 1 (PTE_LIST_WP_BIT) in rmap store the write-protect status to avoid unnecessary shadow page walking Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 40 ++-- 1 files changed, 34 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0c6e92d..8b71908 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -796,7 +796,9 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) return level - 1; } -#define PTE_LIST_DESC(0x1ull) +#define PTE_LIST_DESC_BIT0 +#define PTE_LIST_WP_BIT 1 +#define PTE_LIST_DESC(1 PTE_LIST_DESC_BIT) #define PTE_LIST_FLAG_MASK (0x3ull) static void @@ -1067,6 +1069,12 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu) return mmu_memory_cache_free_objects(cache); } +static void host_page_write_protect(u64 *spte, unsigned long *rmapp) +{ + if (!(*spte SPTE_HOST_WRITEABLE)) + __test_and_set_bit(PTE_LIST_WP_BIT, rmapp); +} Why is this needed, in addition to spte.SPTE_WRITE_PROTECT? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kvm tools: Make raw block device work
On Fri, 13 Apr 2012, 'Asias He wrote: From: Asias He asias.he...@gmail.com Previously, we use mmaped host root partition as guest's root filesystem. As virtio-9p based root filesystem is supported, mmaped host root partition approach is not used anymore. It is useful to use raw block device as guest's disk backend for some user. e.g. bypass host's fs layer. This patch makes raw block device work as disk image, user can do read/write on raw block device, by using DISK_IMAGE_REGULAR instead of DISK_IMAGE_MMAP for block device Changes in v3: - Add fclose. Changes in v2: - Check whether the block device is mounted before use it. Signed-off-by: Asias He asias.he...@gmail.com Applied, thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm tools: Fix sdl hang
On Fri, 13 Apr 2012, 'Asias He wrote: From: Asias He asias.he...@gmail.com Commit b4a932d175c6aa975c456e9b05339aa069c961cb sets sdl's .start ops to sdl__stop which makes the sdl never start. Fix it up. Signed-off-by: Asias He asias.he...@gmail.com Both patches applied! Thnx! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv1 dont apply] RFC: kvm eoi PV using shared memory
I got lots of useful feedback from v0 so I thought sending out a brain dump again would be a good idea. This is mainly to show how I'm trying to address the comments I got from the previous round. Flames/feedback are wellcome! Changes from v0: - Tweaked setup MSRs a bit - Keep ISR bit set. Before reading ISR, test EOI in guest memory and clear - Check priority for nested interrupts, we can enable optimization if new one is high priority - Disable optimization for any interrupt handled by ioapic (This is because ioapic handles notifiers and pic and it generally gets messy. It's possible that we can optimize some ioapic-handled edge interrupts - but is it worth it?) - A less intrusive change for guest apic (0 overhead without kvm) --- I took a stub at implementing PV EOI using shared memory. This should reduce the number of exits an interrupt causes as much as by half. A partially complete draft for both host and guest parts is below. The idea is simple: there's a bit, per APIC, in guest memory, that tells the guest that it does not need EOI. We set it before injecting an interrupt and clear before injecting a nested one. Guest tests it using a test and clear operation - this is necessary so that host can detect interrupt nesting - and if set, it can skip the EOI MSR. There's a new MSR to set the address of said register in guest memory. Otherwise not much changed: - Guest EOI is not required - Register is tested ISR is automatically cleared before injection qemu support is incomplete - mostly for feature negotiation. need to add some trace points to enable profiling. No testing was done beyond compiling the kernel. Signed-off-by: Michael S. Tsirkin m...@redhat.com diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h index b97596e..c9c70ea 100644 --- a/arch/x86/include/asm/bitops.h +++ b/arch/x86/include/asm/bitops.h @@ -26,11 +26,13 @@ #if __GNUC__ 4 || (__GNUC__ == 4 __GNUC_MINOR__ 1) /* Technically wrong, but this avoids compilation errors on some gcc versions. */ -#define BITOP_ADDR(x) =m (*(volatile long *) (x)) +#define BITOP_ADDR_CONSTRAINT =m #else -#define BITOP_ADDR(x) +m (*(volatile long *) (x)) +#define BITOP_ADDR_CONSTRAINT +m #endif +#define BITOP_ADDR(x) BITOP_ADDR_CONSTRAINT (*(volatile long *) (x)) + #define ADDR BITOP_ADDR(addr) /* diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index e216ba0..3d09ef1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -481,6 +481,12 @@ struct kvm_vcpu_arch { u64 length; u64 status; } osvw; + + struct { + u64 msr_val; + struct gfn_to_hva_cache data; + bool pending; + } eoi; }; struct kvm_lpage_info { diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 734c376..164376a 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -22,6 +22,7 @@ #define KVM_FEATURE_CLOCKSOURCE23 #define KVM_FEATURE_ASYNC_PF 4 #define KVM_FEATURE_STEAL_TIME 5 +#define KVM_FEATURE_EOI6 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -37,6 +38,8 @@ #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 +#define MSR_KVM_EOI_EN 0x4b564d04 +#define MSR_KVM_EOI_DISABLED 0x0L struct kvm_steal_time { __u64 steal; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index b8ba6e4..450aae4 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -39,6 +39,7 @@ #include asm/desc.h #include asm/tlbflush.h #include asm/idle.h +#include asm/apic.h static int kvmapf = 1; @@ -290,6 +291,33 @@ static void kvm_register_steal_time(void) cpu, __pa(st)); } +/* TODO: needs to be early? aligned? */ +static DEFINE_EARLY_PER_CPU(u8, apic_eoi, 0); + +/* Our own copy of __test_and_clear_bit to make sure + * it is done with a single instruction */ +static inline int kvm_test_and_clear_bit(int nr, volatile u8* addr) +{ + int oldbit; + + asm volatile(btr %2,%1\n\t +sbb %0,%0 +: =r (oldbit), +BITOP_ADDR_CONSTRAINT (*(volatile u8 *) (addr)) +: Ir (nr)); + return oldbit; +} + +static void (*kvm_guest_native_apic_write)(u32 reg, u32 val); +static void kvm_guest_apic_write(u32 reg, u32 val) +{ + if (reg == APIC_EOI + kvm_test_and_clear_bit(0, __get_cpu_var(apic_eoi))) + return; + + kvm_guest_native_apic_write(reg, val); +} + void __cpuinit kvm_guest_cpu_init(void) { if (!kvm_para_available()) @@ -307,11 +335,18 @@ void __cpuinit kvm_guest_cpu_init(void)
[PATCH 08/20] target-ppc: Add hooks for handling tcg and kvm limitations
From: David Gibson da...@gibson.dropbear.id.au On target-ppc, our table of CPU types and features encodes the features as found on the hardware, regardless of whether these features are actually usable under TCG or KVM. We already have cases where the information from the cpu table must be fixed up to account for limitations in the emulation method we're using. e.g. TCG does not support the DFP and VSX instructions and KVM needs different numbering of the CPUs in order to tell it the correct thread to core mappings. This patch cleans up these hacks to handle emulation limitations by consolidating them into a pair of functions specifically for the purpose. Signed-off-by: David Gibson da...@gibson.dropbear.id.au [AF: Style and typo fixes, rename new functions and drop ppc_def_t arg] Signed-off-by: Andreas Färber afaer...@suse.de --- target-ppc/helper.c |9 --- target-ppc/kvm.c| 14 +++ target-ppc/kvm_ppc.h|5 target-ppc/translate_init.c | 51 +- 4 files changed, 54 insertions(+), 25 deletions(-) diff --git a/target-ppc/helper.c b/target-ppc/helper.c index f61b8b2..b34dcbe 100644 --- a/target-ppc/helper.c +++ b/target-ppc/helper.c @@ -3198,15 +3198,6 @@ CPUPPCState *cpu_ppc_init (const char *cpu_model) if (tcg_enabled()) { ppc_translate_init(); } -/* Adjust cpu index for SMT */ -#if !defined(CONFIG_USER_ONLY) -if (kvm_enabled()) { -int smt = kvmppc_smt_threads(); - -env-cpu_index = (env-cpu_index / smp_threads)*smt -+ (env-cpu_index % smp_threads); -} -#endif /* !CONFIG_USER_ONLY */ env-cpu_model_str = cpu_model; cpu_ppc_register_internal(env, def); diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index d929213..c09cc39 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -27,6 +27,7 @@ #include kvm.h #include kvm_ppc.h #include cpu.h +#include cpus.h #include device_tree.h #include hw/sysbus.h #include hw/spapr.h @@ -938,6 +939,19 @@ const ppc_def_t *kvmppc_host_cpu_def(void) return spec; } +int kvmppc_fixup_cpu(CPUPPCState *env) +{ +int smt; + +/* Adjust cpu index for SMT */ +smt = kvmppc_smt_threads(); +env-cpu_index = (env-cpu_index / smp_threads) * smt ++ (env-cpu_index % smp_threads); + +return 0; +} + + bool kvm_arch_stop_on_emulation_error(CPUPPCState *env) { return true; diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h index 8f1267c..34ecad3 100644 --- a/target-ppc/kvm_ppc.h +++ b/target-ppc/kvm_ppc.h @@ -29,6 +29,7 @@ void *kvmppc_create_spapr_tce(uint32_t liobn, uint32_t window_size, int *pfd); int kvmppc_remove_spapr_tce(void *table, int pfd, uint32_t window_size); #endif /* !CONFIG_USER_ONLY */ const ppc_def_t *kvmppc_host_cpu_def(void); +int kvmppc_fixup_cpu(CPUPPCState *env); #else @@ -95,6 +96,10 @@ static inline const ppc_def_t *kvmppc_host_cpu_def(void) return NULL; } +static inline int kvmppc_fixup_cpu(CPUPPCState *env) +{ +return -1; +} #endif #ifndef CONFIG_KVM diff --git a/target-ppc/translate_init.c b/target-ppc/translate_init.c index b1f8785..067e07e 100644 --- a/target-ppc/translate_init.c +++ b/target-ppc/translate_init.c @@ -9889,6 +9889,28 @@ static int gdb_set_spe_reg(CPUPPCState *env, uint8_t *mem_buf, int n) return 0; } +static int ppc_fixup_cpu(CPUPPCState *env) +{ +/* TCG doesn't (yet) emulate some groups of instructions that + * are implemented on some otherwise supported CPUs (e.g. VSX + * and decimal floating point instructions on POWER7). We + * remove unsupported instruction groups from the cpu state's + * instruction masks and hope the guest can cope. For at + * least the pseries machine, the unavailability of these + * instructions can be advertised to the guest via the device + * tree. */ +if ((env-insns_flags ~PPC_TCG_INSNS) +|| (env-insns_flags2 ~PPC_TCG_INSNS2)) { +fprintf(stderr, Warning: Disabling some instructions which are not +emulated by TCG (0x% PRIx64 , 0x% PRIx64 )\n, +env-insns_flags ~PPC_TCG_INSNS, +env-insns_flags2 ~PPC_TCG_INSNS2); +} +env-insns_flags = PPC_TCG_INSNS; +env-insns_flags2 = PPC_TCG_INSNS2; +return 0; +} + int cpu_ppc_register_internal (CPUPPCState *env, const ppc_def_t *def) { env-msr_mask = def-msr_mask; @@ -9897,25 +9919,22 @@ int cpu_ppc_register_internal (CPUPPCState *env, const ppc_def_t *def) env-bus_model = def-bus_model; env-insns_flags = def-insns_flags; env-insns_flags2 = def-insns_flags2; -if (!kvm_enabled()) { -/* TCG doesn't (yet) emulate some groups of instructions that - * are implemented on some otherwise supported CPUs (e.g. VSX - * and decimal floating point instructions on POWER7). We - * remove unsupported instruction groups from the cpu state's - * instruction masks
Re: DOS VM problem with QEMU-KVM and newer kernels
On 15.04.2012 11:44, Avi Kivity wrote: On 04/12/2012 09:32 PM, Gerhard Wiesinger wrote: Hello, I'm having problems with recents kernels and qemu-kvm with a DOS VM: TD286 System: Bad selector: 0007 System: Bad selector: 0D87 System: Bad selector: 001F System: Bad selector: 0007 GP at 0020 21D4 EC 0DC4 Error 269 loading D:\BP\BIN\TD286.EXE into extended memory Another 286 DOS Extender application also rises a general protection fault: GP at 0020 18A1 CODE 357C Doesn't depend on the used DOS memory manager and is always reproduceable. Depends only on kernel version and not qemu-kvm and seabios (tried to bisect it without success): # NOK: Linux 3.3.1-3.fc16.x86_64 #1 SMP Wed Apr 4 18:08:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # NOK: Linux 3.2.10-3.fc16.x86_64 #1 SMP Thu Mar 15 19:39:46 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # OK: Linux 3.1.9-1.fc16.x86_64 #1 SMP Fri Jan 13 16:37:42 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # OK: Linux 2.6.41.9-1.fc15.x86_64 #1 SMP Fri Jan 13 16:46:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux CPU is an AMD one. Any ideas how to fix it again? Any switches which might help? The trigger is probably commit f1c1da2bde712812a3e0f9a7a7ebe7a916a4b5f4 Author: Jan Kiszkajan.kis...@siemens.com Date: Tue Oct 18 18:23:11 2011 +0200 KVM: SVM: Keep intercepting task switching with NPT enabled AMD processors apparently have a bug in the hardware task switching support when NPT is enabled. If the task switch triggers a NPF, we can get wrong EXITINTINFO along with that fault. On resume, spurious exceptions may then be injected into the guest. We were able to reproduce this bug when our guest triggered #SS and the handler were supposed to run over a separate task with not yet touched stack pages. Work around the issue by continuing to emulate task switches even in NPT mode. Signed-off-by: Jan Kiszkajan.kis...@siemens.com Signed-off-by: Marcelo Tosattimtosa...@redhat.com Although it's not the patch's direct fault - it simply exposed an existing bug in kvm. Things to try: - revert the patch with a newer kernel - try 3.4-rc2 which has some task switch fixes from Kevin; if you want a Fedora kernel, use rawhide's [2] - post traces [1] Jan, Joerg, was an AMD erratum published for the bug? [1] http://www.linux-kvm.org/page/Tracing [2] http://mirrors.kernel.org/fedora/development/rawhide/x86_64/os/Packages/k/kernel-3.4.0-0.rc2.git2.1.fc18.x86_64.rpm Hello Avi, Tried newer kernel since this version is no longer available: http://mirrors.kernel.org/fedora/development/rawhide/x86_64/os/Packages/k/kernel-3.4.0-0.rc2.git3.1.fc18.x86_64.rpm But I wasn't successfull. Still same GP fault (but with 18A2 instead of 18A1): GP at 0020 18A2 CODE 357C yum install asciidoc udis86 udis86-devel git clone git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git cd trace-cmd make ./trace-cmd record -b 2 -e kvm ./trace-cmd report Very long output, what should I grep/trigger for? Thnx so far. BTW: Where can I find old kernels like (removed on upgrade :-( ): kernel-2.6.41.9-1.fc15.x86_64.rpm kernel-3.1.9-1.fc16.x86_64.rpm kernel-3.2.10-3.fc16.x86_64.rpm kernel-debug-2.6.41.9-1.fc15.x86_64 Ciao, Gerhard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
Rik van Riel riel at redhat.com writes: On 04/11/2012 01:21 PM, Chegu Vinod wrote: Hello, While running an AIM7 (workfile.high_systime) in a single 40-way (or a single 60-way KVM guest) I noticed pretty bad performance when the guest was booted with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 (RHEL6.2) kernel. For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest kernel was nearly 12x better ! Turned on function tracing and found that there appears to be more time being spent around the lock code in the 3.3.1 guest when compared to the 2.6.32- 220 guest. Looks like you may be running into the ticket spinlock code. During the early RHEL 6 days, Gleb came up with a patch to automatically disable ticket spinlocks when running inside a KVM guest. Thanks for the pointer. Perhaps that is the issue. I did look up that old discussion thread. IIRC that patch got rejected upstream at the time, with upstream developers preferring to wait for a better solution. If such a better solution is not on its way upstream now (two years later), maybe we should just merge Gleb's patch upstream for the time being? Also noticed a recent discussion thread (that originated from the Xen context) http://article.gmane.org/gmane.linux.kernel.virtualization/15078 Not yet sure if this recent discussion is also in some way related to the older one initiated by Gleb. Thanks Vinod -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable
On 04/15/2012 11:16 PM, Avi Kivity wrote: On 04/13/2012 01:14 PM, Xiao Guangrong wrote: Using bit 1 (PTE_LIST_WP_BIT) in rmap store the write-protect status to avoid unnecessary shadow page walking Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 40 ++-- 1 files changed, 34 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0c6e92d..8b71908 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -796,7 +796,9 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) return level - 1; } -#define PTE_LIST_DESC (0x1ull) +#define PTE_LIST_DESC_BIT 0 +#define PTE_LIST_WP_BIT 1 +#define PTE_LIST_DESC (1 PTE_LIST_DESC_BIT) #define PTE_LIST_FLAG_MASK (0x3ull) static void @@ -1067,6 +1069,12 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu) return mmu_memory_cache_free_objects(cache); } +static void host_page_write_protect(u64 *spte, unsigned long *rmapp) +{ +if (!(*spte SPTE_HOST_WRITEABLE)) +__test_and_set_bit(PTE_LIST_WP_BIT, rmapp); +} Why is this needed, in addition to spte.SPTE_WRITE_PROTECT? It is used to avoid the unnecessary overload for fast page fault if KSM is enabled. On the fast check path, it can see the gfn is write-protected by host, then the fast page fault path is not called. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 03/16] KVM: MMU: properly assert spte on rmap walking path
On 04/14/2012 10:15 AM, Takuya Yoshikawa wrote: On Fri, 13 Apr 2012 18:10:45 +0800 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: static u64 *rmap_get_next(struct rmap_iterator *iter) { +u64 *sptep = NULL; + if (iter-desc) { if (iter-pos PTE_LIST_EXT - 1) { -u64 *sptep; - ++iter-pos; sptep = iter-desc-sptes[iter-pos]; if (sptep) -return sptep; +goto exit; } iter-desc = iter-desc-more; @@ -1028,11 +1036,14 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) if (iter-desc) { iter-pos = 0; /* desc-sptes[0] cannot be NULL */ -return iter-desc-sptes[iter-pos]; +sptep = iter-desc-sptes[iter-pos]; +goto exit; } } -return NULL; +exit: +WARN_ON(sptep !is_shadow_present_pte(*sptep)); +return sptep; } This will, probably, again force rmap_get_next function-call even with EPT/NPT: CPU cannot skip it by branch prediction. No, EPT/NPT also needs it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/16] KVM: MMU: abstract spte write-protect
On 04/14/2012 10:26 AM, Takuya Yoshikawa wrote: On Fri, 13 Apr 2012 18:11:45 +0800 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: +/* Return true if the spte is dropped. */ Return value does not correspond with the function name so it is confusing. That is why i put comment here. People may think that true means write protection has been done. You have a better name? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable
On 04/14/2012 11:01 AM, Takuya Yoshikawa wrote: On Fri, 13 Apr 2012 18:14:26 +0800 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: Using bit 1 (PTE_LIST_WP_BIT) in rmap store the write-protect status to avoid unnecessary shadow page walking Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 40 ++-- 1 files changed, 34 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0c6e92d..8b71908 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -796,7 +796,9 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) return level - 1; } -#define PTE_LIST_DESC (0x1ull) +#define PTE_LIST_DESC_BIT 0 +#define PTE_LIST_WP_BIT 1 _BIT ? What is the problem? @@ -2291,9 +2310,15 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, { struct kvm_mmu_page *s; struct hlist_node *node; +unsigned long *rmap; bool need_unsync = false; +rmap = gfn_to_rmap(vcpu-kvm, gfn, PT_PAGE_TABLE_LEVEL); Please use consistent variable names. In other parts of this patch, you are using rmapp for this. Okay. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/16] KVM: MMU: introduce for_each_pte_list_spte
On 04/14/2012 10:44 AM, Takuya Yoshikawa wrote: On Fri, 13 Apr 2012 18:12:41 +0800b Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: It is used to walk all the sptes of the specified pte_list, after this, the code of pte_list_walk can be removed And it can restart the walking automatically if the spte is zapped Well, I want to ask two questions: - why do you prefer pte_list_* naming to rmap_*? (not a big issue but just curious) pte_list is a common infrastructure for both parent-list and rmap. - Are you sure the whole indirection by this patch will not introduce any regression? (not restricted to get_dirty) I tested it with kernbench, no regression is found. It is not a problem since the iter and spte should be in the cache. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Avoid zapping unrelated shadows in __kvm_set_memory_region()
On 04/14/2012 09:12 AM, Takuya Yoshikawa wrote: Hi, On Wed, 11 Apr 2012 11:11:07 +0800 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: restart: - list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link) - if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list)) - goto restart; + zapped = 0; + list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link) { + if ((slot = 0) !test_bit(slot, sp-slot_bitmap)) + continue; + + zapped |= kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); You should goto restart here like the origin code, also, safe version of list_for_each is not needed. Thank you for looking into this part. I understand that we can eliminate _safe in the original implementation. Can you tell me the reason why we should do goto restart immediately here? For performance, or correctness issue? kvm_mmu_prepare_zap_page may remove many sp in kvm-arch.active_mmu_pages list that means the next node cached in list_for_each_entry_safe will become invalid. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/16] KVM: MMU: fast page fault
On 04/14/2012 11:37 AM, Takuya Yoshikawa wrote: On Fri, 13 Apr 2012 18:05:29 +0800 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: Thanks for Avi and Marcelo's review, i have simplified the whole things in this version: - it only fix the page fault with PFEC.P = 1 PFEC.W = 0 that means unlock set_spte path can be dropped. - it only fixes the page fault caused by dirty-log In this version, all the information we need is from spte, the SPTE_ALLOW_WRITE bit and SPTE_WRITE_PROTECT bit: - SPTE_ALLOW_WRITE is set if the gpte is writable and the pfn pointed by the spte is writable on host. - SPTE_WRITE_PROTECT is set if the spte is write-protected by shadow page table protection. All these bits can be protected by cmpxchg, now, all the things is fairly simple than before. :) Well, could you remove cleanup patches not needed for lock-less from this patch series? I want to see them separately. Or everything was needed for lock-less ? The cleanup patches do the prepare work for fast page fault, the later path will be easily implemented, for example, the for_each_spte_rmap patches make store more bits in rmap patch doing little change since spte_list_walk is removed. Performance test: autotest migration: (Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 32G) Please explain what this test result means, not just numbers. There are many aspects: - how fast migration can converge/complete - how fast programs inside the guest can run during migration: -- throughput -- latency - ... The result is rather straightforward, i think explanation is not needed. I think lock-less will reduce latency a lot, but not sure about convergence: why it became fast? It is hard to understand? It is faster since it can be parallel. - For ept: Before: smp2.Fedora.16.64.migrate Times .unix .with_autotest.dbench.unix total 1 104 214 323 2 68238 310 3 68242 314 After: smp2.Fedora.16.64.migrate Times .unix .with_autotest.dbench.unix total 1 101 190 295 2 67188 259 3 66217 289 As discussed on v1-threads, the main goal of this lock-less should be the elimination of mmu_lock contentions So what we should measure is latency. I think the test of migration time is fairly enough to see the effect. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html