Hello, [ I already sent it once as [EMAIL PROTECTED] but it didn't go through for whatever reason, trying again from private email, hope there won't be dups ]
My worst longstanding problem with KVM is that as the uptime of my host system increased, my opensuse guest images started to destabilize and lockup at boot. The weird thing was that fresh after boot everything was always perfectly ok, so I thought it was rmmod/insmod or some other sticky effect on the CPU after restarting the guest a few times that triggered the crash. Furthermore if I loaded the cpu a lot (like with a while :; do true;done), the crash would magically disappear. Decreasing cpu frequency and timings didn't help. Debugging wasn't trivial because it required a certain uptime and it didn't always crash. So I once debugged this more aggressively I figured out KVM was ok, it was the guest that crashed in the tsc clocksource because tsc wasn't monotone. guest was looping in an infinite loop with irq disabled. So I tried to pass "notsc" and that fixed the crash just fine. Initially I thought it was the tsc_offset logic being wrong but then I figured out that the vcpu_put/load wasn't always executed, this bugcheck triggers with current git and so I recommend to apply this to kvm.git to avoid similar nasty hard-to-detect bugs in the future (Avi says vmx would crash hard in such a condition, svm is much simpler and it somewhat survives the lack of sched_in and only crashes the guest due to not monotone tsc): Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ac876ec..26372fa 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -742,6 +742,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + WARN_ON(vcpu->cpu != smp_processor_id()); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); } So trying to understand why the ->cpu was wrong, I looked into the preempt notifiers emulation, and it looked quite fragile without a real sched_in hook. I figured out I could provide a real sched_in hook by loading the proper values in the tsk->thread.debugreg[0/7]. Initially I got the hooking points out of objdump -d vmlinux, but Avi preferred no dependency on the vmlinux and he suggested to try to find the sched_in hook in the stack. So that's what I implemented now and this should provide real robustness to the out of tree module compiled against binary kernel images with CONFIG_KVM=n. I tried to be compatible with all kernels down to 2.6.5 but only 2.6.2x host is tested and only on 64bit and only on SVM (no vmx system around here at all). This fixes my longstanding KVM instability and "-smp 2" now works flawlessy with svm too! -smp 2 -snapshot crashes in qemu userland but that's not kernel related, must be some thread mutex lock recursion or lock inversion in the qcow cow code. Removing -snapshot make -smp 2 stable. Multiple guests UP and SMP seems stable too. To reproduce my crash easily without waiting ages for the two tsc to deviate with an error larger than the number of cycles it takes for a CPU migration, run write_tsc(0,0) in kernel mode (like in the svm.c init function and then insmod kvm-amd; rmmod kvm-amd and then remove write_tsc and recompile kvm-amd). #include <stdio.h> main() { unsigned long x1, x2; unsigned int a, d; asm volatile("rdtsc" : "=a" (a), "=d" (d)); x1 = ((unsigned long)d << 32) | a; for (;;) { asm volatile("rdtsc" : "=a" (a), "=d" (d)); x2 = ((unsigned long)d << 32) | a; if (x2 < x1) printf("error %Ld\n", x1-x2); else printf("good %Ld\n", x1-x2); x1 = x2; } } (the "good.." printf can be commented out if you run this on the host, but it better stay if you run this in the guest because it helps rescheduling X with SDL that increases the frequency of the CPU-switches of the kvm task) So in short with the below fix applied, after a write_tsc(0,0), the UP guest never return any error anymore. Previously it would return frequent errors because sched_in wasn't properly invoked by svm.c and it would crash at boot every single time after a write_tsc(0,0). The SMP guest of course still returns TSC errors but that's ok, the smp host also return TSC errors, that's ok, it's only the UP guest that is forbidden to have a not monotone TSC or the guest would crash like it happened to me. I'm unsure if special_reload_db7 is needed at all, but it certainly can't hurt so it's the only hack I left. Finally I can enjoy KVM stability too ;). If you always compiled your host kernel with CONFIG_KVM=y on a recent kernels including the preempt-notifiers, you could never run into this. If you compile your host kernel with CONFIG_KVM=n please try to test this. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> diff --git a/kernel/hack-module.awk b/kernel/hack-module.awk index 7993aa2..5187c96 100644 --- a/kernel/hack-module.awk +++ b/kernel/hack-module.awk @@ -24,32 +24,6 @@ printf("MODULE_INFO(version, \"%s\");\n", version) } -/^static unsigned long vmcs_readl/ { - in_vmcs_read = 1 -} - -/ASM_VMX_VMREAD_RDX_RAX/ && in_vmcs_read { - printf("\tstart_special_insn();\n") -} - -/return/ && in_vmcs_read { - printf("\tend_special_insn();\n"); - in_vmcs_read = 0 -} - -/^static void vmcs_writel/ { - in_vmcs_write = 1 -} - -/ASM_VMX_VMWRITE_RAX_RDX/ && in_vmcs_write { - printf("\tstart_special_insn();\n") -} - -/if/ && in_vmcs_write { - printf("\tend_special_insn();\n"); - in_vmcs_write = 0 -} - /^static void vmx_load_host_state/ { vmx_load_host_state = 1 } @@ -74,15 +48,6 @@ print "\tspecial_reload_dr7();" } -/static void vcpu_put|static int __vcpu_run|static struct kvm_vcpu \*vmx_create_vcpu/ { - in_tricky_func = 1 -} - -/preempt_disable|get_cpu/ && in_tricky_func { - printf("\tin_special_section();\n"); - in_tricky_func = 0 -} - /unsigned long flags;/ && vmx_load_host_state { print "\tunsigned long gsbase;" } @@ -90,4 +55,3 @@ /local_irq_save/ && vmx_load_host_state { print "\t\tgsbase = vmcs_readl(HOST_GS_BASE);" } - diff --git a/kernel/preempt.c b/kernel/preempt.c index 8bb0405..6e57277 100644 --- a/kernel/preempt.c +++ b/kernel/preempt.c @@ -6,8 +6,6 @@ static DEFINE_SPINLOCK(pn_lock); static LIST_HEAD(pn_list); -static DEFINE_PER_CPU(int, notifier_enabled); -static DEFINE_PER_CPU(struct task_struct *, last_tsk); #define dprintk(fmt) do { \ if (0) \ @@ -15,59 +13,88 @@ static DEFINE_PER_CPU(struct task_struct *, last_tsk); current->pid, raw_smp_processor_id()); \ } while (0) -static void preempt_enable_notifiers(void) +static void preempt_enable_sched_out_notifiers(void) { - int cpu = raw_smp_processor_id(); - - if (per_cpu(notifier_enabled, cpu)) - return; - - dprintk("\n"); - per_cpu(notifier_enabled, cpu) = 1; asm volatile ("mov %0, %%db0" : : "r"(schedule)); - asm volatile ("mov %0, %%db7" : : "r"(0x702ul)); + asm volatile ("mov %0, %%db7" : : "r"(0x701ul)); +#ifdef CONFIG_X86_64 + current->thread.debugreg7 = 0ul; +#else + current->thread.debugreg[7] = 0ul; +#endif +#ifdef TIF_DEBUG + clear_tsk_thread_flag(current, TIF_DEBUG); +#endif +} + +static void preempt_enable_sched_in_notifiers(void * addr) +{ + asm volatile ("mov %0, %%db0" : : "r"(addr)); + asm volatile ("mov %0, %%db7" : : "r"(0x701ul)); +#ifdef CONFIG_X86_64 + current->thread.debugreg0 = (unsigned long) addr; + current->thread.debugreg7 = 0x701ul; +#else + current->thread.debugreg[0] = (unsigned long) addr; + current->thread.debugreg[7] = 0x701ul; +#endif +#ifdef TIF_DEBUG + set_tsk_thread_flag(current, TIF_DEBUG); +#endif } void special_reload_dr7(void) { - asm volatile ("mov %0, %%db7" : : "r"(0x702ul)); + asm volatile ("mov %0, %%db7" : : "r"(0x701ul)); } EXPORT_SYMBOL_GPL(special_reload_dr7); -static void preempt_disable_notifiers(void) +static void __preempt_disable_notifiers(void) { - int cpu = raw_smp_processor_id(); - - if (!per_cpu(notifier_enabled, cpu)) - return; + asm volatile ("mov %0, %%db7" : : "r"(0ul)); +} - dprintk("\n"); - per_cpu(notifier_enabled, cpu) = 0; - asm volatile ("mov %0, %%db7" : : "r"(0x400ul)); +static void preempt_disable_notifiers(void) +{ + __preempt_disable_notifiers(); +#ifdef CONFIG_X86_64 + current->thread.debugreg7 = 0ul; +#else + current->thread.debugreg[7] = 0ul; +#endif +#ifdef TIF_DEBUG + clear_tsk_thread_flag(current, TIF_DEBUG); +#endif } -static void __attribute__((used)) preempt_notifier_trigger(void) +static void fastcall __attribute__((used)) preempt_notifier_trigger(void *** ip) { struct preempt_notifier *pn; int cpu = raw_smp_processor_id(); int found = 0; - unsigned long flags; dprintk(" - in\n"); //dump_stack(); - spin_lock_irqsave(&pn_lock, flags); + spin_lock(&pn_lock); list_for_each_entry(pn, &pn_list, link) if (pn->tsk == current) { found = 1; break; } - spin_unlock_irqrestore(&pn_lock, flags); - preempt_disable_notifiers(); + spin_unlock(&pn_lock); + if (found) { - dprintk("sched_out\n"); - pn->ops->sched_out(pn, NULL); - per_cpu(last_tsk, cpu) = NULL; - } + if ((void *) *ip != schedule) { + dprintk("sched_in\n"); + preempt_enable_sched_out_notifiers(); + pn->ops->sched_in(pn, cpu); + } else { + dprintk("sched_out\n"); + preempt_enable_sched_in_notifiers(**(ip+3)); + pn->ops->sched_out(pn, NULL); + } + } else + __preempt_disable_notifiers(); dprintk(" - out\n"); } @@ -104,6 +131,11 @@ asm ("pn_int1_handler: \n\t" "pop " TMP " \n\t" "jz .Lnotme \n\t" SAVE_REGS "\n\t" +#ifdef CONFIG_X86_64 + "leaq 120(%rsp),%rdi\n\t" +#else + "leal 32(%esp),%eax\n\t" +#endif "call preempt_notifier_trigger \n\t" RESTORE_REGS "\n\t" #ifdef CONFIG_X86_64 @@ -121,75 +153,28 @@ asm ("pn_int1_handler: \n\t" #endif ); -void in_special_section(void) -{ - struct preempt_notifier *pn; - int cpu = raw_smp_processor_id(); - int found = 0; - unsigned long flags; - - if (per_cpu(last_tsk, cpu) == current) - return; - - dprintk(" - in\n"); - spin_lock_irqsave(&pn_lock, flags); - list_for_each_entry(pn, &pn_list, link) - if (pn->tsk == current) { - found = 1; - break; - } - spin_unlock_irqrestore(&pn_lock, flags); - if (found) { - dprintk("\n"); - per_cpu(last_tsk, cpu) = current; - pn->ops->sched_in(pn, cpu); - preempt_enable_notifiers(); - } - dprintk(" - out\n"); -} -EXPORT_SYMBOL_GPL(in_special_section); - -void start_special_insn(void) -{ - preempt_disable(); - in_special_section(); -} -EXPORT_SYMBOL_GPL(start_special_insn); - -void end_special_insn(void) -{ - preempt_enable(); -} -EXPORT_SYMBOL_GPL(end_special_insn); - void preempt_notifier_register(struct preempt_notifier *notifier) { - int cpu = get_cpu(); unsigned long flags; dprintk(" - in\n"); spin_lock_irqsave(&pn_lock, flags); - preempt_enable_notifiers(); + preempt_enable_sched_out_notifiers(); notifier->tsk = current; list_add(¬ifier->link, &pn_list); spin_unlock_irqrestore(&pn_lock, flags); - per_cpu(last_tsk, cpu) = current; - put_cpu(); dprintk(" - out\n"); } void preempt_notifier_unregister(struct preempt_notifier *notifier) { - int cpu = get_cpu(); unsigned long flags; dprintk(" - in\n"); spin_lock_irqsave(&pn_lock, flags); list_del(¬ifier->link); spin_unlock_irqrestore(&pn_lock, flags); - per_cpu(last_tsk, cpu) = NULL; preempt_disable_notifiers(); - put_cpu(); dprintk(" - out\n"); } @@ -238,7 +223,16 @@ void preempt_notifier_sys_init(void) static void do_disable(void *blah) { - preempt_disable_notifiers(); +#ifdef TIF_DEBUG + if (!test_tsk_thread_flag(current, TIF_DEBUG)) +#else +#ifdef CONFIG_X86_64 + if (!current->thread.debugreg7) +#else + if (!current->thread.debugreg[7]) +#endif +#endif + __preempt_disable_notifiers(); } void preempt_notifier_sys_exit(void) ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel