Re: [PATCH] smp/call: Detect stuck CSD locks

2015-05-20 Thread Chris J Arges
On Mon, May 11, 2015 at 04:00:03PM +0200, Ingo Molnar wrote: > > So potentially, CPU0 generated an interrupt that caused > > vcpu_enter_guest to be called on CPU1. However, when > > vmx_handle_external_intr was called, it didn't progress any further. > > So the IPI does look like to be lost in

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-05-20 Thread Chris J Arges
On Mon, May 11, 2015 at 04:00:03PM +0200, Ingo Molnar wrote: So potentially, CPU0 generated an interrupt that caused vcpu_enter_guest to be called on CPU1. However, when vmx_handle_external_intr was called, it didn't progress any further. So the IPI does look like to be lost in the KVM

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-05-11 Thread Ingo Molnar
* Chris J Arges wrote: > Later in the trace we see the same call followed by > vmx_handle_external_intr() ignoring the call: > > [ 603.248016] 2452.083823 | 0) |ptep_clear_flush() { > [ 603.248016] 2452.083824 | 0) | flush_tlb_page() { > [

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-05-11 Thread Ingo Molnar
* Chris J Arges chris.j.ar...@canonical.com wrote: Later in the trace we see the same call followed by vmx_handle_external_intr() ignoring the call: [ 603.248016] 2452.083823 | 0) |ptep_clear_flush() { [ 603.248016] 2452.083824 | 0) |

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-29 Thread Chris J Arges
On Thu, Apr 16, 2015 at 06:31:40PM +0200, Ingo Molnar wrote: > Now what would be nice is to observe it whether the CPU that is not > doing the CSD wait is truly locked up. > > It might be executing random KVM-ish workloads and the various > backtraces we've seen so far are just a random

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-29 Thread Chris J Arges
On Thu, Apr 16, 2015 at 06:31:40PM +0200, Ingo Molnar wrote: snip Now what would be nice is to observe it whether the CPU that is not doing the CSD wait is truly locked up. It might be executing random KVM-ish workloads and the various backtraces we've seen so far are just a random

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Ingo Molnar
* Chris J Arges wrote: > A previous backtrace of a 3.19 series kernel is here and showing interrupts > enabled on both CPUs on L1: > https://lkml.org/lkml/2015/2/23/234 > http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt > > [...] > > Yes, I think at this point I'll go through the

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Chris J Arges
On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote: > > * Chris J Arges wrote: > > > Ingo, > > > > Below are the patches and data I've gathered from the reproducer. My > > methodology was as described previously; however I used gdb on the > > qemu process in order to breakpoint L1

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Ingo Molnar
* Chris J Arges wrote: > Ingo, > > Below are the patches and data I've gathered from the reproducer. My > methodology was as described previously; however I used gdb on the > qemu process in order to breakpoint L1 once we've detected the hang. > This made dumping the kvm_lapic structures on

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Ingo Molnar
* Chris J Arges chris.j.ar...@canonical.com wrote: Ingo, Below are the patches and data I've gathered from the reproducer. My methodology was as described previously; however I used gdb on the qemu process in order to breakpoint L1 once we've detected the hang. This made dumping the

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Chris J Arges
On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote: * Chris J Arges chris.j.ar...@canonical.com wrote: Ingo, Below are the patches and data I've gathered from the reproducer. My methodology was as described previously; however I used gdb on the qemu process in order to

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Ingo Molnar
* Chris J Arges chris.j.ar...@canonical.com wrote: A previous backtrace of a 3.19 series kernel is here and showing interrupts enabled on both CPUs on L1: https://lkml.org/lkml/2015/2/23/234 http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt [...] Yes, I think at this point

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-15 Thread Chris J Arges
On Mon, Apr 13, 2015 at 08:14:51AM +0200, Ingo Molnar wrote: > > FYI, I'm working on getting better data at the moment and here is my > > approach: > > * For the L0 kernel: > > - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and > > print > >the addresses of various

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-15 Thread Chris J Arges
On Mon, Apr 13, 2015 at 08:14:51AM +0200, Ingo Molnar wrote: snip FYI, I'm working on getting better data at the moment and here is my approach: * For the L0 kernel: - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print the addresses of various useful

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-13 Thread Ingo Molnar
* Chris J Arges wrote: > /sys/module/kvm_intel/parameters/enable_apicv on the affected > hardware is not enabled, and unfortunately my hardware doesn't have > the necessary features to enable it. So we are dealing with KVM's > lapic implementation only. That's actually pretty fortunate, as

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-13 Thread Ingo Molnar
* Chris J Arges chris.j.ar...@canonical.com wrote: /sys/module/kvm_intel/parameters/enable_apicv on the affected hardware is not enabled, and unfortunately my hardware doesn't have the necessary features to enable it. So we are dealing with KVM's lapic implementation only. That's

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-12 Thread Chris J Arges
> So it would be really important to see the stack dump of CPU#0 at this > point, and also do an APIC state dump of it. > > Because from previous dumps it appeared that the 'stuck' CPU was just > in its idle loop - which is 'impossible' as the idle loop should still > allow APIC irqs

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-12 Thread Chris J Arges
snip So it would be really important to see the stack dump of CPU#0 at this point, and also do an APIC state dump of it. Because from previous dumps it appeared that the 'stuck' CPU was just in its idle loop - which is 'impossible' as the idle loop should still allow APIC irqs arriving.

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-08 Thread Ingo Molnar
* Chris J Arges wrote: > Ingo, > > Looks like sched_clock() works in this case. > > Adding the dump_stack() line caused various issues such as the VM > oopsing on boot or the softlockup never being detected properly (and > thus not crashing). So the below is running with your patch and >

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-08 Thread Ingo Molnar
* Chris J Arges chris.j.ar...@canonical.com wrote: Ingo, Looks like sched_clock() works in this case. Adding the dump_stack() line caused various issues such as the VM oopsing on boot or the softlockup never being detected properly (and thus not crashing). So the below is running with

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Linus Torvalds
On Tue, Apr 7, 2015 at 1:59 PM, Chris J Arges wrote: > > Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to > 5s): > [ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x > [ 38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Chris J Arges
On Tue, Apr 07, 2015 at 11:21:21AM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges > > wrote: > > > > > > I noticed that with this patch it never reached 'csd: Detected > > > non-responsive CSD lock...' because it seems that ts_delta

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Ingo Molnar
* Linus Torvalds wrote: > On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges > wrote: > > > > I noticed that with this patch it never reached 'csd: Detected > > non-responsive CSD lock...' because it seems that ts_delta never reached > > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Ingo Molnar
* Linus Torvalds torva...@linux-foundation.org wrote: On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges chris.j.ar...@canonical.com wrote: I noticed that with this patch it never reached 'csd: Detected non-responsive CSD lock...' because it seems that ts_delta never reached

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Chris J Arges
On Tue, Apr 07, 2015 at 11:21:21AM +0200, Ingo Molnar wrote: * Linus Torvalds torva...@linux-foundation.org wrote: On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges chris.j.ar...@canonical.com wrote: I noticed that with this patch it never reached 'csd: Detected non-responsive CSD

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Linus Torvalds
On Tue, Apr 7, 2015 at 1:59 PM, Chris J Arges chris.j.ar...@canonical.com wrote: Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to 5s): [ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x [ 38.712710] csd: Detected non-responsive CSD lock

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-06 Thread Linus Torvalds
On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges wrote: > > I noticed that with this patch it never reached 'csd: Detected > non-responsive CSD lock...' because it seems that ts_delta never reached > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a > hang without reaching this

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-06 Thread Chris J Arges
On 04/03/2015 12:43 AM, Ingo Molnar wrote: > > * Chris J Arges wrote: > >> Ingo, >> >> I think tracking IPI calls from 'generic_exec_single' would make a lot >> of sense. When you say poll for completion do you mean a loop after >> 'arch_send_call_function_single_ipi' in kernel/smp.c? My main

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-06 Thread Linus Torvalds
On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges chris.j.ar...@canonical.com wrote: I noticed that with this patch it never reached 'csd: Detected non-responsive CSD lock...' because it seems that ts_delta never reached CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a hang

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-06 Thread Chris J Arges
On 04/03/2015 12:43 AM, Ingo Molnar wrote: * Chris J Arges chris.j.ar...@canonical.com wrote: Ingo, I think tracking IPI calls from 'generic_exec_single' would make a lot of sense. When you say poll for completion do you mean a loop after 'arch_send_call_function_single_ipi' in

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-02 Thread Ingo Molnar
* Ingo Molnar wrote: > +static void csd_lock_wait(struct call_single_data *csd, int cpu) > { > - while (csd->flags & CSD_FLAG_LOCK) > + int bug_id = 0; > + u64 ts0, ts1, ts_delta; > + > + ts0 = jiffies_to_msecs(jiffies); Note that while 'jiffies' is a global variable, it's

[PATCH] smp/call: Detect stuck CSD locks

2015-04-02 Thread Ingo Molnar
* Chris J Arges wrote: > Ingo, > > I think tracking IPI calls from 'generic_exec_single' would make a lot > of sense. When you say poll for completion do you mean a loop after > 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern > would be to not alter the timings too much

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-02 Thread Ingo Molnar
* Ingo Molnar mi...@kernel.org wrote: +static void csd_lock_wait(struct call_single_data *csd, int cpu) { - while (csd-flags CSD_FLAG_LOCK) + int bug_id = 0; + u64 ts0, ts1, ts_delta; + + ts0 = jiffies_to_msecs(jiffies); Note that while 'jiffies' is a global variable,

[PATCH] smp/call: Detect stuck CSD locks

2015-04-02 Thread Ingo Molnar
* Chris J Arges chris.j.ar...@canonical.com wrote: Ingo, I think tracking IPI calls from 'generic_exec_single' would make a lot of sense. When you say poll for completion do you mean a loop after 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern would be to not alter