Re: smp_call_function_single lockups

2015-04-06 Thread Chris J Arges
On Thu, Apr 02, 2015 at 10:31:50AM -0700, Linus Torvalds wrote: > On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges > wrote: > > > > It is worthwhile to do a 'bisect' to see where on average it takes > > longer to reproduce? Perhaps it will point to a relevant change, or it > > may be completely usele

Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar
* Chris J Arges wrote: > > > On 04/02/2015 02:07 PM, Ingo Molnar wrote: > > > > * Chris J Arges wrote: > > > >> Whenever we look through the crashdump we see csd_lock_wait waiting > >> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading > >> up to that looks like the follo

Re: smp_call_function_single lockups

2015-04-02 Thread Chris J Arges
On 04/02/2015 02:07 PM, Ingo Molnar wrote: > > * Chris J Arges wrote: > >> Whenever we look through the crashdump we see csd_lock_wait waiting >> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading >> up to that looks like the following (in the openstack tempest on >> openst

Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2015 at 12:07 PM, Ingo Molnar wrote: > > So one possibility would be that an 'IPI was sent but lost'. Yes, the "sent but lost" thing would certainly explain the lockups. At the same time, that sounds like a huge hardware bug, and that's somewhat surprising/unlikely. That said. >

Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar
* Chris J Arges wrote: > Whenever we look through the crashdump we see csd_lock_wait waiting > for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading > up to that looks like the following (in the openstack tempest on > openstack and nested VM stress case) > > (qemu-system-x86 ta

Re: smp_call_function_single lockups

2015-04-02 Thread Chris J Arges
On 04/02/2015 01:26 PM, Ingo Molnar wrote: > > * Linus Torvalds wrote: > >> So unless we find a real clear signature of the bug (I was hoping >> that the ISR bit would be that sign), I don't think trying to bisect >> it based on how quickly you can reproduce things is worthwhile. > > So I'm w

Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar
* Linus Torvalds wrote: > So unless we find a real clear signature of the bug (I was hoping > that the ISR bit would be that sign), I don't think trying to bisect > it based on how quickly you can reproduce things is worthwhile. So I'm wondering (and I might have missed some earlier report th

Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2015 at 2:55 AM, Ingo Molnar wrote: > > So another possibility would be that it's the third change causing > this change in behavior: Oh, yes, that looks much more likely. I overlooked that small change entirely. > ... since with this we won't send IPIs in a semi-nested fashion wi

Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges wrote: > > It is worthwhile to do a 'bisect' to see where on average it takes > longer to reproduce? Perhaps it will point to a relevant change, or it > may be completely useless. It's likely to be an exercise in futility. "git bisect" is realyl bad a

Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar
* Linus Torvalds wrote: > On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges > wrote: > > > > I included the full patch in reply to Ingo's email, and when > > running with that I no longer get the ack_APIC_irq WARNs. > > Ok. That means that the printk's themselves just change timing > enough, or

Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On 04/01/2015 11:14 AM, Linus Torvalds wrote: > On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges > wrote: >> >> Even with irqbalance removed from the L0/L1 machines the hang still occurs. >> >> This results in no 'apic: vector* or 'ack_APIC*' messages being displayed. > > Ok. So the ack_APIC debug

Re: smp_call_function_single lockups

2015-04-01 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges wrote: > > Even with irqbalance removed from the L0/L1 machines the hang still occurs. > > This results in no 'apic: vector* or 'ack_APIC*' messages being displayed. Ok. So the ack_APIC debug patch found *something*, but it seems to be unrelated to th

Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Wed, Apr 01, 2015 at 02:43:36PM +0200, Ingo Molnar wrote: > Have you already tested whether the hang goes away if you remove > irq-affinity fiddling daemons from the system? Do you have irqbalance > installed or similar mechanisms? > > Thanks, > > Ingo > Even with irqbalance remove

Re: smp_call_function_single lockups

2015-04-01 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges wrote: > > I included the full patch in reply to Ingo's email, and when running with that > I no longer get the ack_APIC_irq WARNs. Ok. That means that the printk's themselves just change timing enough, or change the compiler instruction scheduling so

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar
* Chris J Arges wrote: > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > > index 6cedd7914581..79d6de6fdf0a 100644 > > --- a/arch/x86/kernel/apic/vector.c > > +++ b/arch/x86/kernel/apic/vector.c > > @@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *c

Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Tue, Mar 31, 2015 at 04:07:32PM -0700, Linus Torvalds wrote: > On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges > wrote: > > > > I had a few runs with your patch plus modifications, and got the following > > results (modified patch inlined below): > > Ok, thanks. > > > [ 14.423916] ack_APIC_i

Re: smp_call_function_single lockups

2015-04-01 Thread Frederic Weisbecker
On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote: > [ Added Frederic to the cc, since he's touched this file/area most ] > > On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds > wrote: > > > > So the caller has a really hard time guaranteeing that CSD_LOCK isn't > > set. And if the ca

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Wed, Apr 01, 2015 at 02:39:13PM +0200, Ingo Molnar wrote: > > * Chris J Arges wrote: > > > This was only tested only on the L1, so I can put this on the L0 host and > > run > > this as well. The results: > > > > [ 124.897002] apic: vector c1, new-domain move in progress >

Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar
* Chris J Arges wrote: > Linus, > > I had a few runs with your patch plus modifications, and got the following > results (modified patch inlined below): > > [ 14.423916] ack_APIC_irq: vector = d1, irq = > [ 176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > [qemu-

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar
* Chris J Arges wrote: > This was only tested only on the L1, so I can put this on the L0 host and run > this as well. The results: > > [ 124.897002] apic: vector c1, new-domain move in progress > > [ 124.954827] apic: vector d1, sent cleanup vector, move completed

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Daniel J Blueman
On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote: > On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote: > > > > * Linus Torvalds wrote: > > > > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR > > > bit clear" seems to be a real issue. > > > > It'

Re: smp_call_function_single lockups

2015-03-31 Thread Linus Torvalds
On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges wrote: > > I had a few runs with your patch plus modifications, and got the following > results (modified patch inlined below): Ok, thanks. > [ 14.423916] ack_APIC_irq: vector = d1, irq = > [ 176.060005] NMI watchdog: BUG: soft lockup -

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Chris J Arges
On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR > > bit clear" seems to be a real issue. > > It's interesting in particular when it happens with an edge-triggered > interrupt so

Re: smp_call_function_single lockups

2015-03-31 Thread Chris J Arges
On Tue, Mar 31, 2015 at 08:08:40AM -0700, Linus Torvalds wrote: > On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges > wrote: > > > > I modified the posted patch with the following: > > Actually, in addition to Ingo's patches (and the irq printout), which > you should try first, if none of that reall

Re: smp_call_function_single lockups

2015-03-31 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges wrote: > > I modified the posted patch with the following: Actually, in addition to Ingo's patches (and the irq printout), which you should try first, if none of that really gives any different behavior, can modify that ack_APIC_irq() debugging code

[debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Ingo Molnar
* Linus Torvalds wrote: > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR > bit clear" seems to be a real issue. It's interesting in particular when it happens with an edge-triggered interrupt source: it's much harder to miss level triggered IRQs, which stay around until

Re: smp_call_function_single lockups

2015-03-30 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges wrote: > [ 13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 > apic_ack_edge+0x84/0x90() > [ 13.613531] [] apic_ack_edge+0x84/0x90 > [ 13.613531] [] handle_edge_irq+0x57/0x120 > [ 13.613531] [] handle_irq+0x22/0x40 > [

Re: smp_call_function_single lockups

2015-03-30 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges wrote: > > I've been able to repro with your patch and observed the WARN_ON when booting > a > VM on affected hardware and non affected hardware: Ok, interesting. So the whole "we try to do an APIC ACK with the ISR bit clear" seems to be a real issu

Re: smp_call_function_single lockups

2015-03-30 Thread Chris J Arges
On Thu, Feb 19, 2015 at 02:45:54PM -0800, Linus Torvalds wrote: > On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds > wrote: > > > > Is this worth looking at? Or is it something spurious? I might have > > gotten the vectors wrong, and maybe the warning is not because the ISR > > bit isn't set, but b

Re: smp_call_function_single lockups

2015-03-20 Thread Mike Galbraith
On Fri, 2015-03-20 at 09:26 -0700, Linus Torvalds wrote: > On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra wrote: > > > > Linus, any plans for this patch? I think it does solve a fair few issues > > the current code has. > > So I didn't really have any plans. I think it's a good patch, and it > m

Re: smp_call_function_single lockups

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra wrote: > > Linus, any plans for this patch? I think it does solve a fair few issues > the current code has. So I didn't really have any plans. I think it's a good patch, and it might even fix a bug and improve code generation, but it didn't fix the

Re: smp_call_function_single lockups

2015-03-20 Thread Peter Zijlstra
On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote: > Ok, this is a more involved patch than I'd like, but making the > *caller* do all the CSD maintenance actually cleans things up. > > And this is still completely untested, and may be entirely buggy. What > do you guys think? > Lin

Re: smp_call_function_single lockups

2015-02-23 Thread Rafael David Tinoco
> > [11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs > 01/01/2011 > > But its a virtual machine right? Its not running bare metal, its running > a !virt kernel on a virt machine, so maybe some of the virt muck is > borked? > > A very subtly broken APIC emulation would

Re: smp_call_function_single lockups

2015-02-23 Thread Peter Zijlstra
On Mon, Feb 23, 2015 at 11:32:50AM -0800, Linus Torvalds wrote: > On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco > wrote: > > > > This is v3.19 + your patch (smp acquire/release) > > - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode > > and acpi_idle) > > Hmm. There is

Re: smp_call_function_single lockups

2015-02-23 Thread Linus Torvalds
On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco wrote: > > This is v3.19 + your patch (smp acquire/release) > - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode > and acpi_idle) Hmm. There is absolutely nothing else going on on that machine, except for the single call to

Re: smp_call_function_single lockups

2015-02-23 Thread Rafael David Tinoco
On Thu, Feb 19, 2015 at 2:14 PM, Linus Torvalds wrote: > > Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if > you can get the core-file to work and figure out where the other CPU > is, that would be good. > This is v3.19 + your patch (smp acquire/release) - (nested kvm with

Re: smp_call_function_single lockups

2015-02-22 Thread Ingo Molnar
* Daniel J Blueman wrote: > The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs > are queued, so the wait_icr_idle() polling is only > necessary on PPro and older, and maybe then to avoid > delivery retry. This unnecessarily ties up the IPI > caller, so we bypass the polling in the Numac

Re: smp_call_function_single lockups

2015-02-22 Thread Daniel J Blueman
On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote: > * Linus Torvalds wrote: > > > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar wrote: > > > > > > So if my memory serves me right, I think it was for > > > local APICs, and even there mostly it was a performance > > > issue: if

Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar
* Linus Torvalds wrote: > On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar wrote: > > > > I'm not so sure about that aspect: I think disabling > > IRQs might be necessary with some APICs (if lower > > levels don't disable IRQs), to make sure the 'local > > APIC busy' bit isn't set: > > Right.

Re: smp_call_function_single lockups

2015-02-20 Thread Linus Torvalds
On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar wrote: > > I'm not so sure about that aspect: I think disabling IRQs > might be necessary with some APICs (if lower levels don't > disable IRQs), to make sure the 'local APIC busy' bit isn't > set: Right. But afaik not for the x2apic case, which this

Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar
* Linus Torvalds wrote: > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar wrote: > > > > So if my memory serves me right, I think it was for > > local APICs, and even there mostly it was a performance > > issue: if an IO-APIC sent more than 2 IRQs per 'level' > > to a local APIC then the IO-API

Re: smp_call_function_single lockups

2015-02-20 Thread Linus Torvalds
On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar wrote: > > So if my memory serves me right, I think it was for local > APICs, and even there mostly it was a performance issue: if > an IO-APIC sent more than 2 IRQs per 'level' to a local > APIC then the IO-APIC might be forced to resend those IRQs, >

Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar
* Linus Torvalds wrote: > On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds > wrote: > > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds > > wrote: > >> > >> Are there known errata for the x2apic? > > > > .. and in particular, do we still have to worry about > > the traditional local apic "if t

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds wrote: > > Is this worth looking at? Or is it something spurious? I might have > gotten the vectors wrong, and maybe the warning is not because the ISR > bit isn't set, but because I test the wrong bit. I edited the patch to do ratelimiting (one per

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 12:29 PM, Linus Torvalds wrote: > > Now, what happens if we send an EOI for an ExtINT interrupt? It > basically ends up being a spurious IPI. And I *think* that what > normally happens is absolutely nothing at all. But if in addition to > the ExtINT, there was a pending IPI

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds wrote: > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds > wrote: >> >> Are there known errata for the x2apic? > > .. and in particular, do we still have to worry about the traditional > local apic "if there are more than two pending interrupts per

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds wrote: > > Are there known errata for the x2apic? .. and in particular, do we still have to worry about the traditional local apic "if there are more than two pending interrupts per priority level, things get lost" problem? I forget the exact detai

Re: smp_call_function_single lockups

2015-02-19 Thread Rafael David Tinoco
I could only find an advisory (regarding sr-iov and irq remaps) from HP to RHEL6.2 users stating that Gen8 firmware does not enable it by default. http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03645796 """ The interrupt remapping capability depends on x2apic enabled in the BIOS

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco wrote: > Feb 19 08:21:28 derain kernel: [3.637682] Switched APIC routing to > cluster x2apic. Ok. That "cluster x2apic" mode is just about the nastiest mode when it comes to sending a single ipi. We do that insane dance where we - turn si

Re: smp_call_function_single lockups

2015-02-19 Thread Rafael David Tinoco
For the host, we are using "intremap=no_x2apic_optout intel_idle.max_cstate=0" for cmdline. It looks like that DL360/DL380 Gen8 firmware still asks to optout from x2apic but HP engineering team said that using x2apic for Gen8 would be ok (intel_idle causes these servers to generate NMIs when idling

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco wrote: > > Same environment as before: Nested KVM (2 vcpus) on top of Proliant > DL380G8 with acpi_idle and no x2apic optout. Btw, which apic model does that end up using? Does "no x2apic optout" mean you're using the x2apic? What does "dmesg

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco wrote: > > Just a quick feedback, We were able to reproduce the lockup with this > proposed patch (3.19 + patch). Unfortunately we had problems with the > core file and I have only the stack trace for now but I think we are > able to reproduce i

Re: smp_call_function_single lockups

2015-02-19 Thread Peter Zijlstra
On Thu, Feb 19, 2015 at 01:42:39PM -0200, Rafael David Tinoco wrote: > Linus, Peter, Thomas > > Just a quick feedback, We were able to reproduce the lockup with this > proposed patch (3.19 + patch). Unfortunately we had problems with the > core file and I have only the stack trace for now but I th

Re: smp_call_function_single lockups

2015-02-19 Thread Rafael David Tinoco
Linus, Peter, Thomas Just a quick feedback, We were able to reproduce the lockup with this proposed patch (3.19 + patch). Unfortunately we had problems with the core file and I have only the stack trace for now but I think we are able to reproduce it again and provide more details (sorry for the d

Re: smp_call_function_single lockups

2015-02-18 Thread Peter Zijlstra
On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote: > Ok, this is a more involved patch than I'd like, but making the > *caller* do all the CSD maintenance actually cleans things up. > > And this is still completely untested, and may be entirely buggy. What > do you guys think? I thin

Re: smp_call_function_single lockups

2015-02-12 Thread Rafael David Tinoco
Meanwhile we'll take the opportunity to run same tests with the "smp_load_acquire/smp_store_release + outside sync/async" approach made by your latest patch on top of 3.19. If anything comes up I'll provide full back traces (2 vcpus). Here I can only reproduce this inside nested kvm on top of Pro

Re: smp_call_function_single lockups

2015-02-11 Thread Linus Torvalds
[ Added Frederic to the cc, since he's touched this file/area most ] On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds wrote: > > So the caller has a really hard time guaranteeing that CSD_LOCK isn't > set. And if the call is done in interrupt context, for all we know it > is interrupting the code

Re: smp_call_function_single lockups

2015-02-11 Thread Linus Torvalds
On Wed, Feb 11, 2015 at 10:18 AM, Linus Torvalds wrote: > > I'll think about this all, but we couldn't figure anything out last > time we looked at it, so without more clues, don't hold your breath. So having looked at it once more, one thing struck me: Look at smp_call_function_single_async().

Re: smp_call_function_single lockups

2015-02-11 Thread Linus Torvalds
On Wed, Feb 11, 2015 at 5:19 AM, Rafael David Tinoco wrote: > > - After applying patch provided by Thomas we were able to cause the > lockup only after 6 days (also locked inside > smp_call_function_single). Test performance (even for a nested kvm) > was reduced substantially with 3.19 + this patc

smp_call_function_single lockups

2015-02-11 Thread Rafael David Tinoco
Linus, Thomas, Jens.. During the 3.18 - 3.19 "frequent lockups discussion", in some point you have observed csd_lock() and csd_unlock() possible synchronization problems. I think we have managed to reproduce that issue in a constant basis with 3.13 (ubuntu) and 3.19 (latest vanilla). - When runni