Hello,

[ I already sent it once as [EMAIL PROTECTED] but it didn't go through
  for whatever reason, trying again from private email, hope there
  won't be dups ]

My worst longstanding problem with KVM is that as the uptime of my
host system increased, my opensuse guest images started to destabilize
and lockup at boot. The weird thing was that fresh after boot
everything was always perfectly ok, so I thought it was rmmod/insmod
or some other sticky effect on the CPU after restarting the guest a
few times that triggered the crash. Furthermore if I loaded the cpu a
lot (like with a while :; do true;done), the crash would magically
disappear. Decreasing cpu frequency and timings didn't help. Debugging
wasn't trivial because it required a certain uptime and it didn't
always crash.

So I once debugged this more aggressively I figured out KVM was ok, it
was the guest that crashed in the tsc clocksource because tsc wasn't
monotone. guest was looping in an infinite loop with irq disabled. So
I tried to pass "notsc" and that fixed the crash just fine.

Initially I thought it was the tsc_offset logic being wrong but then I
figured out that the vcpu_put/load wasn't always executed, this
bugcheck triggers with current git and so I recommend to apply this to
kvm.git to avoid similar nasty hard-to-detect bugs in the future (Avi
says vmx would crash hard in such a condition, svm is much simpler and
it somewhat survives the lack of sched_in and only crashes the guest
due to not monotone tsc):

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac876ec..26372fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -742,6 +742,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+       WARN_ON(vcpu->cpu != smp_processor_id());
        kvm_x86_ops->vcpu_put(vcpu);
        kvm_put_guest_fpu(vcpu);
 }



So trying to understand why the ->cpu was wrong, I looked into the
preempt notifiers emulation, and it looked quite fragile without a
real sched_in hook. I figured out I could provide a real sched_in hook
by loading the proper values in the
tsk->thread.debugreg[0/7]. Initially I got the hooking points out of
objdump -d vmlinux, but Avi preferred no dependency on the vmlinux and
he suggested to try to find the sched_in hook in the stack. So that's
what I implemented now and this should provide real robustness to the
out of tree module compiled against binary kernel images with
CONFIG_KVM=n. I tried to be compatible with all kernels down to 2.6.5
but only 2.6.2x host is tested and only on 64bit and only on SVM (no
vmx system around here at all).

This fixes my longstanding KVM instability and "-smp 2" now works
flawlessy with svm too! -smp 2 -snapshot crashes in qemu userland but
that's not kernel related, must be some thread mutex lock recursion or
lock inversion in the qcow cow code. Removing -snapshot make -smp 2
stable. Multiple guests UP and SMP seems stable too.

To reproduce my crash easily without waiting ages for the two tsc to
deviate with an error larger than the number of cycles it takes for a
CPU migration, run write_tsc(0,0) in kernel mode (like in the svm.c
init function and then insmod kvm-amd; rmmod kvm-amd and then remove
write_tsc and recompile kvm-amd).

#include <stdio.h>
main()
{
        unsigned long x1, x2;
        unsigned int a, d;
        asm volatile("rdtsc" : "=a" (a), "=d" (d));
        x1 = ((unsigned long)d << 32) | a;
        for (;;) {
                asm volatile("rdtsc" : "=a" (a), "=d" (d));
                x2 = ((unsigned long)d << 32) | a;
                if (x2 < x1)
                        printf("error %Ld\n", x1-x2);
                else
                        printf("good  %Ld\n", x1-x2);
                x1 = x2;
        }
}

(the "good.." printf can be commented out if you run this on the host,
but it better stay if you run this in the guest because it helps
rescheduling X with SDL that increases the frequency of the
CPU-switches of the kvm task)

So in short with the below fix applied, after a write_tsc(0,0), the
UP guest never return any error anymore. Previously it would return
frequent errors because sched_in wasn't properly invoked by svm.c and
it would crash at boot every single time after a write_tsc(0,0).

The SMP guest of course still returns TSC errors but that's ok, the
smp host also return TSC errors, that's ok, it's only the UP guest
that is forbidden to have a not monotone TSC or the guest would crash
like it happened to me.

I'm unsure if special_reload_db7 is needed at all, but it certainly
can't hurt so it's the only hack I left.

Finally I can enjoy KVM stability too ;). If you always compiled your
host kernel with CONFIG_KVM=y on a recent kernels including the
preempt-notifiers, you could never run into this. If you compile your
host kernel with CONFIG_KVM=n please try to test this.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/kernel/hack-module.awk b/kernel/hack-module.awk
index 7993aa2..5187c96 100644
--- a/kernel/hack-module.awk
+++ b/kernel/hack-module.awk
@@ -24,32 +24,6 @@
     printf("MODULE_INFO(version, \"%s\");\n", version)
 }
 
-/^static unsigned long vmcs_readl/ {
-    in_vmcs_read = 1
-}
-
-/ASM_VMX_VMREAD_RDX_RAX/ && in_vmcs_read {
-    printf("\tstart_special_insn();\n")
-}
-
-/return/ && in_vmcs_read {
-    printf("\tend_special_insn();\n");
-    in_vmcs_read = 0
-}
-
-/^static void vmcs_writel/ {
-    in_vmcs_write = 1
-}
-
-/ASM_VMX_VMWRITE_RAX_RDX/ && in_vmcs_write {
-    printf("\tstart_special_insn();\n")
-}
-
-/if/ && in_vmcs_write {
-    printf("\tend_special_insn();\n");
-    in_vmcs_write = 0
-}
-
 /^static void vmx_load_host_state/ {
     vmx_load_host_state = 1
 }
@@ -74,15 +48,6 @@
     print "\tspecial_reload_dr7();"
 }
 
-/static void vcpu_put|static int __vcpu_run|static struct kvm_vcpu 
\*vmx_create_vcpu/ {
-    in_tricky_func = 1
-}
-
-/preempt_disable|get_cpu/ && in_tricky_func {
-    printf("\tin_special_section();\n");
-    in_tricky_func = 0
-}
-
 /unsigned long flags;/ &&  vmx_load_host_state {
     print "\tunsigned long gsbase;"
 }
@@ -90,4 +55,3 @@
 /local_irq_save/ &&  vmx_load_host_state {
     print "\t\tgsbase = vmcs_readl(HOST_GS_BASE);"
 }
-
diff --git a/kernel/preempt.c b/kernel/preempt.c
index 8bb0405..6e57277 100644
--- a/kernel/preempt.c
+++ b/kernel/preempt.c
@@ -6,8 +6,6 @@
 
 static DEFINE_SPINLOCK(pn_lock);
 static LIST_HEAD(pn_list);
-static DEFINE_PER_CPU(int, notifier_enabled);
-static DEFINE_PER_CPU(struct task_struct *, last_tsk);
 
 #define dprintk(fmt) do {                                              \
                if (0)                                                  \
@@ -15,59 +13,88 @@ static DEFINE_PER_CPU(struct task_struct *, last_tsk);
                               current->pid, raw_smp_processor_id());   \
        } while (0)
 
-static void preempt_enable_notifiers(void)
+static void preempt_enable_sched_out_notifiers(void)
 {
-       int cpu = raw_smp_processor_id();
-
-       if (per_cpu(notifier_enabled, cpu))
-               return;
-
-       dprintk("\n");
-       per_cpu(notifier_enabled, cpu) = 1;
        asm volatile ("mov %0, %%db0" : : "r"(schedule));
-       asm volatile ("mov %0, %%db7" : : "r"(0x702ul));
+       asm volatile ("mov %0, %%db7" : : "r"(0x701ul));
+#ifdef CONFIG_X86_64
+       current->thread.debugreg7 = 0ul;
+#else
+       current->thread.debugreg[7] = 0ul;
+#endif
+#ifdef TIF_DEBUG
+       clear_tsk_thread_flag(current, TIF_DEBUG);
+#endif
+}
+
+static void preempt_enable_sched_in_notifiers(void * addr)
+{
+       asm volatile ("mov %0, %%db0" : : "r"(addr));
+       asm volatile ("mov %0, %%db7" : : "r"(0x701ul));
+#ifdef CONFIG_X86_64
+       current->thread.debugreg0 = (unsigned long) addr;
+       current->thread.debugreg7 = 0x701ul;
+#else
+       current->thread.debugreg[0] = (unsigned long) addr;
+       current->thread.debugreg[7] = 0x701ul;
+#endif
+#ifdef TIF_DEBUG
+       set_tsk_thread_flag(current, TIF_DEBUG);
+#endif
 }
 
 void special_reload_dr7(void)
 {
-       asm volatile ("mov %0, %%db7" : : "r"(0x702ul));
+       asm volatile ("mov %0, %%db7" : : "r"(0x701ul));
 }
 EXPORT_SYMBOL_GPL(special_reload_dr7);
 
-static void preempt_disable_notifiers(void)
+static void __preempt_disable_notifiers(void)
 {
-       int cpu = raw_smp_processor_id();
-
-       if (!per_cpu(notifier_enabled, cpu))
-               return;
+       asm volatile ("mov %0, %%db7" : : "r"(0ul));
+}
 
-       dprintk("\n");
-       per_cpu(notifier_enabled, cpu) = 0;
-       asm volatile ("mov %0, %%db7" : : "r"(0x400ul));
+static void preempt_disable_notifiers(void)
+{
+       __preempt_disable_notifiers();
+#ifdef CONFIG_X86_64
+       current->thread.debugreg7 = 0ul;
+#else
+       current->thread.debugreg[7] = 0ul;
+#endif
+#ifdef TIF_DEBUG
+       clear_tsk_thread_flag(current, TIF_DEBUG);
+#endif
 }
 
-static void  __attribute__((used)) preempt_notifier_trigger(void)
+static void fastcall  __attribute__((used)) preempt_notifier_trigger(void *** 
ip)
 {
        struct preempt_notifier *pn;
        int cpu = raw_smp_processor_id();
        int found = 0;
-       unsigned long flags;
 
        dprintk(" - in\n");
        //dump_stack();
-       spin_lock_irqsave(&pn_lock, flags);
+       spin_lock(&pn_lock);
        list_for_each_entry(pn, &pn_list, link)
                if (pn->tsk == current) {
                        found = 1;
                        break;
                }
-       spin_unlock_irqrestore(&pn_lock, flags);
-       preempt_disable_notifiers();
+       spin_unlock(&pn_lock);
+
        if (found) {
-               dprintk("sched_out\n");
-               pn->ops->sched_out(pn, NULL);
-               per_cpu(last_tsk, cpu) = NULL;
-       }
+               if ((void *) *ip != schedule) {
+                       dprintk("sched_in\n");
+                       preempt_enable_sched_out_notifiers();
+                       pn->ops->sched_in(pn, cpu);
+               } else {
+                       dprintk("sched_out\n");
+                       preempt_enable_sched_in_notifiers(**(ip+3));
+                       pn->ops->sched_out(pn, NULL);
+               }
+       } else
+               __preempt_disable_notifiers();
        dprintk(" - out\n");
 }
 
@@ -104,6 +131,11 @@ asm ("pn_int1_handler:  \n\t"
      "pop "  TMP " \n\t"
      "jz .Lnotme \n\t"
      SAVE_REGS "\n\t"
+#ifdef CONFIG_X86_64
+     "leaq 120(%rsp),%rdi\n\t"
+#else
+     "leal 32(%esp),%eax\n\t"
+#endif
      "call preempt_notifier_trigger \n\t"
      RESTORE_REGS "\n\t"
 #ifdef CONFIG_X86_64
@@ -121,75 +153,28 @@ asm ("pn_int1_handler:  \n\t"
 #endif
        );
 
-void in_special_section(void)
-{
-       struct preempt_notifier *pn;
-       int cpu = raw_smp_processor_id();
-       int found = 0;
-       unsigned long flags;
-
-       if (per_cpu(last_tsk, cpu) == current)
-               return;
-
-       dprintk(" - in\n");
-       spin_lock_irqsave(&pn_lock, flags);
-       list_for_each_entry(pn, &pn_list, link)
-               if (pn->tsk == current) {
-                       found = 1;
-                       break;
-               }
-       spin_unlock_irqrestore(&pn_lock, flags);
-       if (found) {
-               dprintk("\n");
-               per_cpu(last_tsk, cpu) = current;
-               pn->ops->sched_in(pn, cpu);
-               preempt_enable_notifiers();
-       }
-       dprintk(" - out\n");
-}
-EXPORT_SYMBOL_GPL(in_special_section);
-
-void start_special_insn(void)
-{
-       preempt_disable();
-       in_special_section();
-}
-EXPORT_SYMBOL_GPL(start_special_insn);
-
-void end_special_insn(void)
-{
-       preempt_enable();
-}
-EXPORT_SYMBOL_GPL(end_special_insn);
-
 void preempt_notifier_register(struct preempt_notifier *notifier)
 {
-       int cpu = get_cpu();
        unsigned long flags;
 
        dprintk(" - in\n");
        spin_lock_irqsave(&pn_lock, flags);
-       preempt_enable_notifiers();
+       preempt_enable_sched_out_notifiers();
        notifier->tsk = current;
        list_add(&notifier->link, &pn_list);
        spin_unlock_irqrestore(&pn_lock, flags);
-       per_cpu(last_tsk, cpu) = current;
-       put_cpu();
        dprintk(" - out\n");
 }
 
 void preempt_notifier_unregister(struct preempt_notifier *notifier)
 {
-       int cpu = get_cpu();
        unsigned long flags;
 
        dprintk(" - in\n");
        spin_lock_irqsave(&pn_lock, flags);
        list_del(&notifier->link);
        spin_unlock_irqrestore(&pn_lock, flags);
-       per_cpu(last_tsk, cpu) = NULL;
        preempt_disable_notifiers();
-       put_cpu();
        dprintk(" - out\n");
 }
 
@@ -238,7 +223,16 @@ void preempt_notifier_sys_init(void)
 
 static void do_disable(void *blah)
 {
-       preempt_disable_notifiers();
+#ifdef TIF_DEBUG
+       if (!test_tsk_thread_flag(current, TIF_DEBUG))
+#else
+#ifdef CONFIG_X86_64
+       if (!current->thread.debugreg7)
+#else
+       if (!current->thread.debugreg[7])
+#endif
+#endif
+               __preempt_disable_notifiers();
 }
 
 void preempt_notifier_sys_exit(void)

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to