On Mon, May 11, 2015 at 04:00:03PM +0200, Ingo Molnar wrote:
> > So potentially, CPU0 generated an interrupt that caused
> > vcpu_enter_guest to be called on CPU1. However, when
> > vmx_handle_external_intr was called, it didn't progress any further.
>
> So the IPI does look like to be lost in
On Mon, May 11, 2015 at 04:00:03PM +0200, Ingo Molnar wrote:
So potentially, CPU0 generated an interrupt that caused
vcpu_enter_guest to be called on CPU1. However, when
vmx_handle_external_intr was called, it didn't progress any further.
So the IPI does look like to be lost in the KVM
* Chris J Arges wrote:
> Later in the trace we see the same call followed by
> vmx_handle_external_intr() ignoring the call:
>
> [ 603.248016] 2452.083823 | 0) |ptep_clear_flush() {
> [ 603.248016] 2452.083824 | 0) | flush_tlb_page() {
> [
* Chris J Arges chris.j.ar...@canonical.com wrote:
Later in the trace we see the same call followed by
vmx_handle_external_intr() ignoring the call:
[ 603.248016] 2452.083823 | 0) |ptep_clear_flush() {
[ 603.248016] 2452.083824 | 0) |
On Thu, Apr 16, 2015 at 06:31:40PM +0200, Ingo Molnar wrote:
> Now what would be nice is to observe it whether the CPU that is not
> doing the CSD wait is truly locked up.
>
> It might be executing random KVM-ish workloads and the various
> backtraces we've seen so far are just a random
On Thu, Apr 16, 2015 at 06:31:40PM +0200, Ingo Molnar wrote:
snip
Now what would be nice is to observe it whether the CPU that is not
doing the CSD wait is truly locked up.
It might be executing random KVM-ish workloads and the various
backtraces we've seen so far are just a random
* Chris J Arges wrote:
> A previous backtrace of a 3.19 series kernel is here and showing interrupts
> enabled on both CPUs on L1:
> https://lkml.org/lkml/2015/2/23/234
> http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt
>
> [...]
>
> Yes, I think at this point I'll go through the
On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote:
>
> * Chris J Arges wrote:
>
> > Ingo,
> >
> > Below are the patches and data I've gathered from the reproducer. My
> > methodology was as described previously; however I used gdb on the
> > qemu process in order to breakpoint L1
* Chris J Arges wrote:
> Ingo,
>
> Below are the patches and data I've gathered from the reproducer. My
> methodology was as described previously; however I used gdb on the
> qemu process in order to breakpoint L1 once we've detected the hang.
> This made dumping the kvm_lapic structures on
* Chris J Arges chris.j.ar...@canonical.com wrote:
Ingo,
Below are the patches and data I've gathered from the reproducer. My
methodology was as described previously; however I used gdb on the
qemu process in order to breakpoint L1 once we've detected the hang.
This made dumping the
On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote:
* Chris J Arges chris.j.ar...@canonical.com wrote:
Ingo,
Below are the patches and data I've gathered from the reproducer. My
methodology was as described previously; however I used gdb on the
qemu process in order to
* Chris J Arges chris.j.ar...@canonical.com wrote:
A previous backtrace of a 3.19 series kernel is here and showing interrupts
enabled on both CPUs on L1:
https://lkml.org/lkml/2015/2/23/234
http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt
[...]
Yes, I think at this point
On Mon, Apr 13, 2015 at 08:14:51AM +0200, Ingo Molnar wrote:
> > FYI, I'm working on getting better data at the moment and here is my
> > approach:
> > * For the L0 kernel:
> > - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and
> > print
> >the addresses of various
On Mon, Apr 13, 2015 at 08:14:51AM +0200, Ingo Molnar wrote:
snip
FYI, I'm working on getting better data at the moment and here is my
approach:
* For the L0 kernel:
- In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and
print
the addresses of various useful
* Chris J Arges wrote:
> /sys/module/kvm_intel/parameters/enable_apicv on the affected
> hardware is not enabled, and unfortunately my hardware doesn't have
> the necessary features to enable it. So we are dealing with KVM's
> lapic implementation only.
That's actually pretty fortunate, as
* Chris J Arges chris.j.ar...@canonical.com wrote:
/sys/module/kvm_intel/parameters/enable_apicv on the affected
hardware is not enabled, and unfortunately my hardware doesn't have
the necessary features to enable it. So we are dealing with KVM's
lapic implementation only.
That's
> So it would be really important to see the stack dump of CPU#0 at this
> point, and also do an APIC state dump of it.
>
> Because from previous dumps it appeared that the 'stuck' CPU was just
> in its idle loop - which is 'impossible' as the idle loop should still
> allow APIC irqs
snip
So it would be really important to see the stack dump of CPU#0 at this
point, and also do an APIC state dump of it.
Because from previous dumps it appeared that the 'stuck' CPU was just
in its idle loop - which is 'impossible' as the idle loop should still
allow APIC irqs arriving.
* Chris J Arges wrote:
> Ingo,
>
> Looks like sched_clock() works in this case.
>
> Adding the dump_stack() line caused various issues such as the VM
> oopsing on boot or the softlockup never being detected properly (and
> thus not crashing). So the below is running with your patch and
>
* Chris J Arges chris.j.ar...@canonical.com wrote:
Ingo,
Looks like sched_clock() works in this case.
Adding the dump_stack() line caused various issues such as the VM
oopsing on boot or the softlockup never being detected properly (and
thus not crashing). So the below is running with
On Tue, Apr 7, 2015 at 1:59 PM, Chris J Arges
wrote:
>
> Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to
> 5s):
> [ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
> [ 38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting
On Tue, Apr 07, 2015 at 11:21:21AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds wrote:
>
> > On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
> > wrote:
> > >
> > > I noticed that with this patch it never reached 'csd: Detected
> > > non-responsive CSD lock...' because it seems that ts_delta
* Linus Torvalds wrote:
> On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
> wrote:
> >
> > I noticed that with this patch it never reached 'csd: Detected
> > non-responsive CSD lock...' because it seems that ts_delta never reached
> > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and
* Linus Torvalds torva...@linux-foundation.org wrote:
On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
chris.j.ar...@canonical.com wrote:
I noticed that with this patch it never reached 'csd: Detected
non-responsive CSD lock...' because it seems that ts_delta never reached
On Tue, Apr 07, 2015 at 11:21:21AM +0200, Ingo Molnar wrote:
* Linus Torvalds torva...@linux-foundation.org wrote:
On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
chris.j.ar...@canonical.com wrote:
I noticed that with this patch it never reached 'csd: Detected
non-responsive CSD
On Tue, Apr 7, 2015 at 1:59 PM, Chris J Arges
chris.j.ar...@canonical.com wrote:
Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to
5s):
[ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
[ 38.712710] csd: Detected non-responsive CSD lock
On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
wrote:
>
> I noticed that with this patch it never reached 'csd: Detected
> non-responsive CSD lock...' because it seems that ts_delta never reached
> CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
> hang without reaching this
On 04/03/2015 12:43 AM, Ingo Molnar wrote:
>
> * Chris J Arges wrote:
>
>> Ingo,
>>
>> I think tracking IPI calls from 'generic_exec_single' would make a lot
>> of sense. When you say poll for completion do you mean a loop after
>> 'arch_send_call_function_single_ipi' in kernel/smp.c? My main
On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
chris.j.ar...@canonical.com wrote:
I noticed that with this patch it never reached 'csd: Detected
non-responsive CSD lock...' because it seems that ts_delta never reached
CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
hang
On 04/03/2015 12:43 AM, Ingo Molnar wrote:
* Chris J Arges chris.j.ar...@canonical.com wrote:
Ingo,
I think tracking IPI calls from 'generic_exec_single' would make a lot
of sense. When you say poll for completion do you mean a loop after
'arch_send_call_function_single_ipi' in
* Ingo Molnar wrote:
> +static void csd_lock_wait(struct call_single_data *csd, int cpu)
> {
> - while (csd->flags & CSD_FLAG_LOCK)
> + int bug_id = 0;
> + u64 ts0, ts1, ts_delta;
> +
> + ts0 = jiffies_to_msecs(jiffies);
Note that while 'jiffies' is a global variable, it's
* Chris J Arges wrote:
> Ingo,
>
> I think tracking IPI calls from 'generic_exec_single' would make a lot
> of sense. When you say poll for completion do you mean a loop after
> 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
> would be to not alter the timings too much
* Ingo Molnar mi...@kernel.org wrote:
+static void csd_lock_wait(struct call_single_data *csd, int cpu)
{
- while (csd-flags CSD_FLAG_LOCK)
+ int bug_id = 0;
+ u64 ts0, ts1, ts_delta;
+
+ ts0 = jiffies_to_msecs(jiffies);
Note that while 'jiffies' is a global variable,
* Chris J Arges chris.j.ar...@canonical.com wrote:
Ingo,
I think tracking IPI calls from 'generic_exec_single' would make a lot
of sense. When you say poll for completion do you mean a loop after
'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
would be to not alter
34 matches
Mail list logo