Re: smp_call_function_single lockups

2015-04-06 Thread Chris J Arges
On Thu, Apr 02, 2015 at 10:31:50AM -0700, Linus Torvalds wrote:
> On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges
>  wrote:
> >
> > It is worthwhile to do a 'bisect' to see where on average it takes
> > longer to reproduce? Perhaps it will point to a relevant change, or it
> > may be completely useless.
> 
> It's likely to be an exercise in futility. "git bisect" is realyl bad
> at "gray area" things, and when it's a question of "it takes hours or
> days to reproduce", it's almost certainly not worth it. Not unless
> there is some really clear cut-off that we can believably say "this
> causes it to get much slower". And in this case, I don't think it's
> that clear-cut. Judging by DaveJ's attempts at bisecting things, the
> timing just changes. And the differences might be due to entirely
> unrelated changes like cacheline alignment etc.
> 
> So unless we find a real clear signature of the bug (I was hoping that
> the ISR bit would be that sign), I don't think trying to bisect it
> based on how quickly you can reproduce things is worthwhile.
> 
> Linus
>

Linus, Ingo,

I did some testing and found that at the following patch level, the issue was
much, much more likely to reproduce within < 15m.

commit b6b8a1451fc40412c57d10c94b62e22acab28f94
Author: Jan Kiszka 
Date:   Fri Mar 7 20:03:12 2014 +0100

KVM: nVMX: Rework interception of IRQs and NMIs

Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

However, when the following patch was applied the average time to reproduction
goes down greatly (the stress reproducer ran for hours without issue):

commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3
Author: Bandan Das 
Date:   Tue Jul 8 00:30:23 2014 -0400

KVM: x86: Check for nested events if there is an injectable interrupt

With commit b6b8a1451fc40412c57d1 that introduced
vmx_check_nested_events, checks for injectable interrupts happen
at different points in time for L1 and L2 that could potentially
cause a race. The regression occurs because KVM_REQ_EVENT is always
set when nested_run_pending is set even if there's no pending interrupt.
Consequently, there could be a small window when check_nested_events
returns without exiting to L1, but an interrupt comes through soon
after and it incorrectly, gets injected to L2 by inject_pending_event
Fix this by adding a call to check for nested events too when a check
for injectable interrupt returns true

However we reproduced with v3.19 (containing these two patches) which did
eventually softlockup with a similar backtrace.

So far, this agrees with the current understanding that we may be not ACK'ing
certain interrupts (IPIs from the L1 guest) causing csd_lock_wait to spin and
causing the soft lockup.

Hopefully this helps shed more light on this issue.

Thanks,
--chris j arges 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-06 Thread Chris J Arges
On Thu, Apr 02, 2015 at 10:31:50AM -0700, Linus Torvalds wrote:
 On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges
 chris.j.ar...@canonical.com wrote:
 
  It is worthwhile to do a 'bisect' to see where on average it takes
  longer to reproduce? Perhaps it will point to a relevant change, or it
  may be completely useless.
 
 It's likely to be an exercise in futility. git bisect is realyl bad
 at gray area things, and when it's a question of it takes hours or
 days to reproduce, it's almost certainly not worth it. Not unless
 there is some really clear cut-off that we can believably say this
 causes it to get much slower. And in this case, I don't think it's
 that clear-cut. Judging by DaveJ's attempts at bisecting things, the
 timing just changes. And the differences might be due to entirely
 unrelated changes like cacheline alignment etc.
 
 So unless we find a real clear signature of the bug (I was hoping that
 the ISR bit would be that sign), I don't think trying to bisect it
 based on how quickly you can reproduce things is worthwhile.
 
 Linus


Linus, Ingo,

I did some testing and found that at the following patch level, the issue was
much, much more likely to reproduce within  15m.

commit b6b8a1451fc40412c57d10c94b62e22acab28f94
Author: Jan Kiszka jan.kis...@siemens.com
Date:   Fri Mar 7 20:03:12 2014 +0100

KVM: nVMX: Rework interception of IRQs and NMIs

Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

However, when the following patch was applied the average time to reproduction
goes down greatly (the stress reproducer ran for hours without issue):

commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3
Author: Bandan Das b...@redhat.com
Date:   Tue Jul 8 00:30:23 2014 -0400

KVM: x86: Check for nested events if there is an injectable interrupt

With commit b6b8a1451fc40412c57d1 that introduced
vmx_check_nested_events, checks for injectable interrupts happen
at different points in time for L1 and L2 that could potentially
cause a race. The regression occurs because KVM_REQ_EVENT is always
set when nested_run_pending is set even if there's no pending interrupt.
Consequently, there could be a small window when check_nested_events
returns without exiting to L1, but an interrupt comes through soon
after and it incorrectly, gets injected to L2 by inject_pending_event
Fix this by adding a call to check for nested events too when a check
for injectable interrupt returns true

However we reproduced with v3.19 (containing these two patches) which did
eventually softlockup with a similar backtrace.

So far, this agrees with the current understanding that we may be not ACK'ing
certain interrupts (IPIs from the L1 guest) causing csd_lock_wait to spin and
causing the soft lockup.

Hopefully this helps shed more light on this issue.

Thanks,
--chris j arges 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Chris J Arges  wrote:

> 
> 
> On 04/02/2015 02:07 PM, Ingo Molnar wrote:
> > 
> > * Chris J Arges  wrote:
> > 
> >> Whenever we look through the crashdump we see csd_lock_wait waiting 
> >> for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
> >> up to that looks like the following (in the openstack tempest on 
> >> openstack and nested VM stress case)
> >>
> >> (qemu-system-x86 task)
> >> kvm_sched_in
> >>  -> kvm_arch_vcpu_load
> >>   -> vmx_vcpu_load
> >>-> loaded_vmcs_clear
> >> -> smp_call_function_single
> >>
> >> (ksmd task)
> >> pmdp_clear_flush
> >>  -> flush_tlb_mm_range
> >>   -> native_flush_tlb_others
> >> -> smp_call_function_many
> > 
> > So is this two separate smp_call_function instances, crossing each 
> > other, and none makes any progress, indefinitely - as if the two IPIs 
> > got lost?
> > 
> 
> This is two different crash signatures. Sorry for the confusion.

So just in case, both crash signatures ought to be detected by the 
patch I just sent.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Chris J Arges


On 04/02/2015 02:07 PM, Ingo Molnar wrote:
> 
> * Chris J Arges  wrote:
> 
>> Whenever we look through the crashdump we see csd_lock_wait waiting 
>> for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
>> up to that looks like the following (in the openstack tempest on 
>> openstack and nested VM stress case)
>>
>> (qemu-system-x86 task)
>> kvm_sched_in
>>  -> kvm_arch_vcpu_load
>>   -> vmx_vcpu_load
>>-> loaded_vmcs_clear
>> -> smp_call_function_single
>>
>> (ksmd task)
>> pmdp_clear_flush
>>  -> flush_tlb_mm_range
>>   -> native_flush_tlb_others
>> -> smp_call_function_many
> 
> So is this two separate smp_call_function instances, crossing each 
> other, and none makes any progress, indefinitely - as if the two IPIs 
> got lost?
> 

This is two different crash signatures. Sorry for the confusion.

> The traces Rafael he linked to show a simpler scenario with two CPUs 
> apparently locked up, doing this:
> 
> CPU0:
> 
>  #5 [81c03e88] native_safe_halt at 81059386
>  #6 [81c03e90] default_idle at 8101eaee
>  #7 [81c03eb0] arch_cpu_idle at 8101f46f
>  #8 [81c03ec0] cpu_startup_entry at 810b6563
>  #9 [81c03f30] rest_init at 817a6067
> #10 [81c03f40] start_kernel at 81d4cfce
> #11 [81c03f80] x86_64_start_reservations at 81d4c4d7
> #12 [81c03f90] x86_64_start_kernel at 81d4c61c
> 
> This CPU is idle.
> 
> CPU1:
> 
> #10 [88081993fa70] smp_call_function_single at 810f4d69
> #11 [88081993fb10] native_flush_tlb_others at 810671ae
> #12 [88081993fb40] flush_tlb_mm_range at 810672d4
> #13 [88081993fb80] pmdp_splitting_flush at 81065e0d
> #14 [88081993fba0] split_huge_page_to_list at 811ddd39
> #15 [88081993fc30] __split_huge_page_pmd at 811dec65
> #16 [88081993fcc0] unmap_single_vma at 811a4f03
> #17 [88081993fdc0] zap_page_range at 811a5d08
> #18 [88081993fe80] sys_madvise at 811b9775
> #19 [88081993ff80] system_call_fastpath at 817b8bad
> 
> This CPU is busy-waiting for the TLB flush IPI to finish.
> 
> There's no unexpected pattern here (other than it not finishing) 
> AFAICS, the smp_call_function_single() is just the usual way we invoke 
> the TLB flushing methods AFAICS.
> 
> So one possibility would be that an 'IPI was sent but lost'.
> 
> We could try the following trick: poll for completion for a couple of 
> seconds (since an IPI is not held up by anything but irqs-off 
> sections, it should arrive within microseconds typically - seconds of 
> polling should be more than enough), and if the IPI does not arrive, 
> print a warning message and re-send the IPI.
>
> If the IPI was lost due to some race and there's no other failure mode 
> that we don't understand, then this would work around the bug and 
> would make the tests pass indefinitely - with occasional hickups and a 
> handful of messages produced along the way whenever it would have 
> locked up with a previous kernel.
> 
> If testing indeed confirms that kind of behavior we could drill down 
> more closely to figure out why the IPI did not get to its destination.
> 
> Or if the behavior is different, we'd have some new behavior to look 
> at. (for example the IPI sending mechanism might be wedged 
> indefinitely for some reason, so that even a resend won't work.)
> 
> Agreed?
> 
> Thanks,
> 
>   Ingo
> 

Ingo,

I think tracking IPI calls from 'generic_exec_single' would make a lot
of sense. When you say poll for completion do you mean a loop after
'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
would be to not alter the timings too much so we can still reproduce the
original problem.

Another approach:
If we want to check for non-ACKed IPIs a possibility would be to add a
timestamp field to 'struct call_single_data' and just record jiffies
when the IPI gets called. Then have a per-cpu kthread check the
'call_single_queue' percpu list periodically if (jiffies - timestamp) >
THRESHOLD. When we reach that condition print the stale entry in
call_single_queue, backtrace, then re-send the IPI.

Let me know what makes the most sense to hack on.

Thanks,
--chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2015 at 12:07 PM, Ingo Molnar  wrote:
>
> So one possibility would be that an 'IPI was sent but lost'.

Yes, the "sent but lost" thing would certainly explain the lockups.

At the same time, that sounds like a huge hardware bug, and that's
somewhat surprising/unlikely.

That said.

> We could try the following trick: poll for completion for a couple of
> seconds (since an IPI is not held up by anything but irqs-off
> sections, it should arrive within microseconds typically - seconds of
> polling should be more than enough), and if the IPI does not arrive,
> print a warning message and re-send the IPI.

Sounds like a reasonable approach. At worst it doesn't fix anything,
and we never see any messages, and that tells us something too.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Chris J Arges  wrote:

> Whenever we look through the crashdump we see csd_lock_wait waiting 
> for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
> up to that looks like the following (in the openstack tempest on 
> openstack and nested VM stress case)
> 
> (qemu-system-x86 task)
> kvm_sched_in
>  -> kvm_arch_vcpu_load
>   -> vmx_vcpu_load
>-> loaded_vmcs_clear
> -> smp_call_function_single
> 
> (ksmd task)
> pmdp_clear_flush
>  -> flush_tlb_mm_range
>   -> native_flush_tlb_others
> -> smp_call_function_many

So is this two separate smp_call_function instances, crossing each 
other, and none makes any progress, indefinitely - as if the two IPIs 
got lost?

The traces Rafael he linked to show a simpler scenario with two CPUs 
apparently locked up, doing this:

CPU0:

 #5 [81c03e88] native_safe_halt at 81059386
 #6 [81c03e90] default_idle at 8101eaee
 #7 [81c03eb0] arch_cpu_idle at 8101f46f
 #8 [81c03ec0] cpu_startup_entry at 810b6563
 #9 [81c03f30] rest_init at 817a6067
#10 [81c03f40] start_kernel at 81d4cfce
#11 [81c03f80] x86_64_start_reservations at 81d4c4d7
#12 [81c03f90] x86_64_start_kernel at 81d4c61c

This CPU is idle.

CPU1:

#10 [88081993fa70] smp_call_function_single at 810f4d69
#11 [88081993fb10] native_flush_tlb_others at 810671ae
#12 [88081993fb40] flush_tlb_mm_range at 810672d4
#13 [88081993fb80] pmdp_splitting_flush at 81065e0d
#14 [88081993fba0] split_huge_page_to_list at 811ddd39
#15 [88081993fc30] __split_huge_page_pmd at 811dec65
#16 [88081993fcc0] unmap_single_vma at 811a4f03
#17 [88081993fdc0] zap_page_range at 811a5d08
#18 [88081993fe80] sys_madvise at 811b9775
#19 [88081993ff80] system_call_fastpath at 817b8bad

This CPU is busy-waiting for the TLB flush IPI to finish.

There's no unexpected pattern here (other than it not finishing) 
AFAICS, the smp_call_function_single() is just the usual way we invoke 
the TLB flushing methods AFAICS.

So one possibility would be that an 'IPI was sent but lost'.

We could try the following trick: poll for completion for a couple of 
seconds (since an IPI is not held up by anything but irqs-off 
sections, it should arrive within microseconds typically - seconds of 
polling should be more than enough), and if the IPI does not arrive, 
print a warning message and re-send the IPI.

If the IPI was lost due to some race and there's no other failure mode 
that we don't understand, then this would work around the bug and 
would make the tests pass indefinitely - with occasional hickups and a 
handful of messages produced along the way whenever it would have 
locked up with a previous kernel.

If testing indeed confirms that kind of behavior we could drill down 
more closely to figure out why the IPI did not get to its destination.

Or if the behavior is different, we'd have some new behavior to look 
at. (for example the IPI sending mechanism might be wedged 
indefinitely for some reason, so that even a resend won't work.)

Agreed?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Chris J Arges
On 04/02/2015 01:26 PM, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
>> So unless we find a real clear signature of the bug (I was hoping 
>> that the ISR bit would be that sign), I don't think trying to bisect 
>> it based on how quickly you can reproduce things is worthwhile.
> 
> So I'm wondering (and I might have missed some earlier report that 
> outlines just that), now that the possible location of the bug is 
> again sadly up to 15+ million lines of code, I have no better idea 
> than to debug by symptoms again: what kind of effort was made to 
> examine the locked up state itself?
>

Ingo,

Rafael did some analysis when I was out earlier here:
https://lkml.org/lkml/2015/2/23/234

My reproducer setup is as follows:
L0 - 8-way CPU, 48 GB memory
L1 - 2-way vCPU, 4 GB memory
L2 - 1-way vCPU, 1 GB memory

Stress is only run in the L2 VM, and running top on L0/L1 doesn't show
excessive load.

> Softlockups always have some direct cause, which task exactly causes 
> scheduling to stop altogether, why does it lock up - or is it not a 
> clear lockup, just a very slow system?
> 
> Thanks,
> 
>   Ingo
> 

Whenever we look through the crashdump we see csd_lock_wait waiting for
CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading up to
that looks like the following (in the openstack tempest on openstack and
nested VM stress case)

(qemu-system-x86 task)
kvm_sched_in
 -> kvm_arch_vcpu_load
  -> vmx_vcpu_load
   -> loaded_vmcs_clear
-> smp_call_function_single

(ksmd task)
pmdp_clear_flush
 -> flush_tlb_mm_range
  -> native_flush_tlb_others
-> smp_call_function_many

--chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Linus Torvalds  wrote:

> So unless we find a real clear signature of the bug (I was hoping 
> that the ISR bit would be that sign), I don't think trying to bisect 
> it based on how quickly you can reproduce things is worthwhile.

So I'm wondering (and I might have missed some earlier report that 
outlines just that), now that the possible location of the bug is 
again sadly up to 15+ million lines of code, I have no better idea 
than to debug by symptoms again: what kind of effort was made to 
examine the locked up state itself?

Softlockups always have some direct cause, which task exactly causes 
scheduling to stop altogether, why does it lock up - or is it not a 
clear lockup, just a very slow system?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2015 at 2:55 AM, Ingo Molnar  wrote:
>
> So another possibility would be that it's the third change causing
> this change in behavior:

Oh, yes, that looks much more likely. I overlooked that small change entirely.

> ... since with this we won't send IPIs in a semi-nested fashion with
> an unacked APIC, which is a good idea to do in general. It's also a
> weird enough hardware pattern that virtualization's APIC emulation
> might get it slightly wrong or slightly different.

Yup, that's more likely than the subtle timing/reordering differences
of just the added debug printouts. I agree.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges
 wrote:
>
> It is worthwhile to do a 'bisect' to see where on average it takes
> longer to reproduce? Perhaps it will point to a relevant change, or it
> may be completely useless.

It's likely to be an exercise in futility. "git bisect" is realyl bad
at "gray area" things, and when it's a question of "it takes hours or
days to reproduce", it's almost certainly not worth it. Not unless
there is some really clear cut-off that we can believably say "this
causes it to get much slower". And in this case, I don't think it's
that clear-cut. Judging by DaveJ's attempts at bisecting things, the
timing just changes. And the differences might be due to entirely
unrelated changes like cacheline alignment etc.

So unless we find a real clear signature of the bug (I was hoping that
the ISR bit would be that sign), I don't think trying to bisect it
based on how quickly you can reproduce things is worthwhile.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges
>  wrote:
> >
> > I included the full patch in reply to Ingo's email, and when 
> > running with that I no longer get the ack_APIC_irq WARNs.
> 
> Ok. That means that the printk's themselves just change timing 
> enough, or change the compiler instruction scheduling so that it 
> hides the apic problem.

So another possibility would be that it's the third change causing 
this change in behavior:

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..833a981c5420 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)
 
 void apic_ack_edge(struct irq_data *data)
 {
+   ack_APIC_irq();
+
+   /* Might generate IPIs, so do this after having ACKed the APIC: */
irq_complete_move(irqd_cfg(data));
irq_move_irq(data);
-   ack_APIC_irq();
 }
 
 /*

... since with this we won't send IPIs in a semi-nested fashion with 
an unacked APIC, which is a good idea to do in general. It's also a 
weird enough hardware pattern that virtualization's APIC emulation 
might get it slightly wrong or slightly different.

> Which very much indicates that these things are interconnected.
> 
> For example, Ingo's printk patch does
> 
> cfg->move_in_progress =
>cpumask_intersects(cfg->old_domain, 
> cpu_online_mask);
> +   if (cfg->move_in_progress)
> +   pr_info("apic: vector %02x,
> same-domain move in progress\n", cfg->vector);
> cpumask_and(cfg->domain, cfg->domain, tmp_mask);
> 
> and that means that now the setting of move_in_progress is 
> serialized with the cpumask_and() in a way that it wasn't before.

Yeah, that's a possibility too. It all looks very fragile.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2015 at 12:07 PM, Ingo Molnar mi...@kernel.org wrote:

 So one possibility would be that an 'IPI was sent but lost'.

Yes, the sent but lost thing would certainly explain the lockups.

At the same time, that sounds like a huge hardware bug, and that's
somewhat surprising/unlikely.

That said.

 We could try the following trick: poll for completion for a couple of
 seconds (since an IPI is not held up by anything but irqs-off
 sections, it should arrive within microseconds typically - seconds of
 polling should be more than enough), and if the IPI does not arrive,
 print a warning message and re-send the IPI.

Sounds like a reasonable approach. At worst it doesn't fix anything,
and we never see any messages, and that tells us something too.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Chris J Arges


On 04/02/2015 02:07 PM, Ingo Molnar wrote:
 
 * Chris J Arges chris.j.ar...@canonical.com wrote:
 
 Whenever we look through the crashdump we see csd_lock_wait waiting 
 for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
 up to that looks like the following (in the openstack tempest on 
 openstack and nested VM stress case)

 (qemu-system-x86 task)
 kvm_sched_in
  - kvm_arch_vcpu_load
   - vmx_vcpu_load
- loaded_vmcs_clear
 - smp_call_function_single

 (ksmd task)
 pmdp_clear_flush
  - flush_tlb_mm_range
   - native_flush_tlb_others
 - smp_call_function_many
 
 So is this two separate smp_call_function instances, crossing each 
 other, and none makes any progress, indefinitely - as if the two IPIs 
 got lost?
 

This is two different crash signatures. Sorry for the confusion.

 The traces Rafael he linked to show a simpler scenario with two CPUs 
 apparently locked up, doing this:
 
 CPU0:
 
  #5 [81c03e88] native_safe_halt at 81059386
  #6 [81c03e90] default_idle at 8101eaee
  #7 [81c03eb0] arch_cpu_idle at 8101f46f
  #8 [81c03ec0] cpu_startup_entry at 810b6563
  #9 [81c03f30] rest_init at 817a6067
 #10 [81c03f40] start_kernel at 81d4cfce
 #11 [81c03f80] x86_64_start_reservations at 81d4c4d7
 #12 [81c03f90] x86_64_start_kernel at 81d4c61c
 
 This CPU is idle.
 
 CPU1:
 
 #10 [88081993fa70] smp_call_function_single at 810f4d69
 #11 [88081993fb10] native_flush_tlb_others at 810671ae
 #12 [88081993fb40] flush_tlb_mm_range at 810672d4
 #13 [88081993fb80] pmdp_splitting_flush at 81065e0d
 #14 [88081993fba0] split_huge_page_to_list at 811ddd39
 #15 [88081993fc30] __split_huge_page_pmd at 811dec65
 #16 [88081993fcc0] unmap_single_vma at 811a4f03
 #17 [88081993fdc0] zap_page_range at 811a5d08
 #18 [88081993fe80] sys_madvise at 811b9775
 #19 [88081993ff80] system_call_fastpath at 817b8bad
 
 This CPU is busy-waiting for the TLB flush IPI to finish.
 
 There's no unexpected pattern here (other than it not finishing) 
 AFAICS, the smp_call_function_single() is just the usual way we invoke 
 the TLB flushing methods AFAICS.
 
 So one possibility would be that an 'IPI was sent but lost'.
 
 We could try the following trick: poll for completion for a couple of 
 seconds (since an IPI is not held up by anything but irqs-off 
 sections, it should arrive within microseconds typically - seconds of 
 polling should be more than enough), and if the IPI does not arrive, 
 print a warning message and re-send the IPI.

 If the IPI was lost due to some race and there's no other failure mode 
 that we don't understand, then this would work around the bug and 
 would make the tests pass indefinitely - with occasional hickups and a 
 handful of messages produced along the way whenever it would have 
 locked up with a previous kernel.
 
 If testing indeed confirms that kind of behavior we could drill down 
 more closely to figure out why the IPI did not get to its destination.
 
 Or if the behavior is different, we'd have some new behavior to look 
 at. (for example the IPI sending mechanism might be wedged 
 indefinitely for some reason, so that even a resend won't work.)
 
 Agreed?
 
 Thanks,
 
   Ingo
 

Ingo,

I think tracking IPI calls from 'generic_exec_single' would make a lot
of sense. When you say poll for completion do you mean a loop after
'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
would be to not alter the timings too much so we can still reproduce the
original problem.

Another approach:
If we want to check for non-ACKed IPIs a possibility would be to add a
timestamp field to 'struct call_single_data' and just record jiffies
when the IPI gets called. Then have a per-cpu kthread check the
'call_single_queue' percpu list periodically if (jiffies - timestamp) 
THRESHOLD. When we reach that condition print the stale entry in
call_single_queue, backtrace, then re-send the IPI.

Let me know what makes the most sense to hack on.

Thanks,
--chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges
 chris.j.ar...@canonical.com wrote:
 
  I included the full patch in reply to Ingo's email, and when 
  running with that I no longer get the ack_APIC_irq WARNs.
 
 Ok. That means that the printk's themselves just change timing 
 enough, or change the compiler instruction scheduling so that it 
 hides the apic problem.

So another possibility would be that it's the third change causing 
this change in behavior:

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..833a981c5420 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)
 
 void apic_ack_edge(struct irq_data *data)
 {
+   ack_APIC_irq();
+
+   /* Might generate IPIs, so do this after having ACKed the APIC: */
irq_complete_move(irqd_cfg(data));
irq_move_irq(data);
-   ack_APIC_irq();
 }
 
 /*

... since with this we won't send IPIs in a semi-nested fashion with 
an unacked APIC, which is a good idea to do in general. It's also a 
weird enough hardware pattern that virtualization's APIC emulation 
might get it slightly wrong or slightly different.

 Which very much indicates that these things are interconnected.
 
 For example, Ingo's printk patch does
 
 cfg-move_in_progress =
cpumask_intersects(cfg-old_domain, 
 cpu_online_mask);
 +   if (cfg-move_in_progress)
 +   pr_info(apic: vector %02x,
 same-domain move in progress\n, cfg-vector);
 cpumask_and(cfg-domain, cfg-domain, tmp_mask);
 
 and that means that now the setting of move_in_progress is 
 serialized with the cpumask_and() in a way that it wasn't before.

Yeah, that's a possibility too. It all looks very fragile.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Chris J Arges chris.j.ar...@canonical.com wrote:

 
 
 On 04/02/2015 02:07 PM, Ingo Molnar wrote:
  
  * Chris J Arges chris.j.ar...@canonical.com wrote:
  
  Whenever we look through the crashdump we see csd_lock_wait waiting 
  for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
  up to that looks like the following (in the openstack tempest on 
  openstack and nested VM stress case)
 
  (qemu-system-x86 task)
  kvm_sched_in
   - kvm_arch_vcpu_load
- vmx_vcpu_load
 - loaded_vmcs_clear
  - smp_call_function_single
 
  (ksmd task)
  pmdp_clear_flush
   - flush_tlb_mm_range
- native_flush_tlb_others
  - smp_call_function_many
  
  So is this two separate smp_call_function instances, crossing each 
  other, and none makes any progress, indefinitely - as if the two IPIs 
  got lost?
  
 
 This is two different crash signatures. Sorry for the confusion.

So just in case, both crash signatures ought to be detected by the 
patch I just sent.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Chris J Arges chris.j.ar...@canonical.com wrote:

 Whenever we look through the crashdump we see csd_lock_wait waiting 
 for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
 up to that looks like the following (in the openstack tempest on 
 openstack and nested VM stress case)
 
 (qemu-system-x86 task)
 kvm_sched_in
  - kvm_arch_vcpu_load
   - vmx_vcpu_load
- loaded_vmcs_clear
 - smp_call_function_single
 
 (ksmd task)
 pmdp_clear_flush
  - flush_tlb_mm_range
   - native_flush_tlb_others
 - smp_call_function_many

So is this two separate smp_call_function instances, crossing each 
other, and none makes any progress, indefinitely - as if the two IPIs 
got lost?

The traces Rafael he linked to show a simpler scenario with two CPUs 
apparently locked up, doing this:

CPU0:

 #5 [81c03e88] native_safe_halt at 81059386
 #6 [81c03e90] default_idle at 8101eaee
 #7 [81c03eb0] arch_cpu_idle at 8101f46f
 #8 [81c03ec0] cpu_startup_entry at 810b6563
 #9 [81c03f30] rest_init at 817a6067
#10 [81c03f40] start_kernel at 81d4cfce
#11 [81c03f80] x86_64_start_reservations at 81d4c4d7
#12 [81c03f90] x86_64_start_kernel at 81d4c61c

This CPU is idle.

CPU1:

#10 [88081993fa70] smp_call_function_single at 810f4d69
#11 [88081993fb10] native_flush_tlb_others at 810671ae
#12 [88081993fb40] flush_tlb_mm_range at 810672d4
#13 [88081993fb80] pmdp_splitting_flush at 81065e0d
#14 [88081993fba0] split_huge_page_to_list at 811ddd39
#15 [88081993fc30] __split_huge_page_pmd at 811dec65
#16 [88081993fcc0] unmap_single_vma at 811a4f03
#17 [88081993fdc0] zap_page_range at 811a5d08
#18 [88081993fe80] sys_madvise at 811b9775
#19 [88081993ff80] system_call_fastpath at 817b8bad

This CPU is busy-waiting for the TLB flush IPI to finish.

There's no unexpected pattern here (other than it not finishing) 
AFAICS, the smp_call_function_single() is just the usual way we invoke 
the TLB flushing methods AFAICS.

So one possibility would be that an 'IPI was sent but lost'.

We could try the following trick: poll for completion for a couple of 
seconds (since an IPI is not held up by anything but irqs-off 
sections, it should arrive within microseconds typically - seconds of 
polling should be more than enough), and if the IPI does not arrive, 
print a warning message and re-send the IPI.

If the IPI was lost due to some race and there's no other failure mode 
that we don't understand, then this would work around the bug and 
would make the tests pass indefinitely - with occasional hickups and a 
handful of messages produced along the way whenever it would have 
locked up with a previous kernel.

If testing indeed confirms that kind of behavior we could drill down 
more closely to figure out why the IPI did not get to its destination.

Or if the behavior is different, we'd have some new behavior to look 
at. (for example the IPI sending mechanism might be wedged 
indefinitely for some reason, so that even a resend won't work.)

Agreed?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges
chris.j.ar...@canonical.com wrote:

 It is worthwhile to do a 'bisect' to see where on average it takes
 longer to reproduce? Perhaps it will point to a relevant change, or it
 may be completely useless.

It's likely to be an exercise in futility. git bisect is realyl bad
at gray area things, and when it's a question of it takes hours or
days to reproduce, it's almost certainly not worth it. Not unless
there is some really clear cut-off that we can believably say this
causes it to get much slower. And in this case, I don't think it's
that clear-cut. Judging by DaveJ's attempts at bisecting things, the
timing just changes. And the differences might be due to entirely
unrelated changes like cacheline alignment etc.

So unless we find a real clear signature of the bug (I was hoping that
the ISR bit would be that sign), I don't think trying to bisect it
based on how quickly you can reproduce things is worthwhile.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2015 at 2:55 AM, Ingo Molnar mi...@kernel.org wrote:

 So another possibility would be that it's the third change causing
 this change in behavior:

Oh, yes, that looks much more likely. I overlooked that small change entirely.

 ... since with this we won't send IPIs in a semi-nested fashion with
 an unacked APIC, which is a good idea to do in general. It's also a
 weird enough hardware pattern that virtualization's APIC emulation
 might get it slightly wrong or slightly different.

Yup, that's more likely than the subtle timing/reordering differences
of just the added debug printouts. I agree.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Chris J Arges
On 04/02/2015 01:26 PM, Ingo Molnar wrote:
 
 * Linus Torvalds torva...@linux-foundation.org wrote:
 
 So unless we find a real clear signature of the bug (I was hoping 
 that the ISR bit would be that sign), I don't think trying to bisect 
 it based on how quickly you can reproduce things is worthwhile.
 
 So I'm wondering (and I might have missed some earlier report that 
 outlines just that), now that the possible location of the bug is 
 again sadly up to 15+ million lines of code, I have no better idea 
 than to debug by symptoms again: what kind of effort was made to 
 examine the locked up state itself?


Ingo,

Rafael did some analysis when I was out earlier here:
https://lkml.org/lkml/2015/2/23/234

My reproducer setup is as follows:
L0 - 8-way CPU, 48 GB memory
L1 - 2-way vCPU, 4 GB memory
L2 - 1-way vCPU, 1 GB memory

Stress is only run in the L2 VM, and running top on L0/L1 doesn't show
excessive load.

 Softlockups always have some direct cause, which task exactly causes 
 scheduling to stop altogether, why does it lock up - or is it not a 
 clear lockup, just a very slow system?
 
 Thanks,
 
   Ingo
 

Whenever we look through the crashdump we see csd_lock_wait waiting for
CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading up to
that looks like the following (in the openstack tempest on openstack and
nested VM stress case)

(qemu-system-x86 task)
kvm_sched_in
 - kvm_arch_vcpu_load
  - vmx_vcpu_load
   - loaded_vmcs_clear
- smp_call_function_single

(ksmd task)
pmdp_clear_flush
 - flush_tlb_mm_range
  - native_flush_tlb_others
- smp_call_function_many

--chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-02 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 So unless we find a real clear signature of the bug (I was hoping 
 that the ISR bit would be that sign), I don't think trying to bisect 
 it based on how quickly you can reproduce things is worthwhile.

So I'm wondering (and I might have missed some earlier report that 
outlines just that), now that the possible location of the bug is 
again sadly up to 15+ million lines of code, I have no better idea 
than to debug by symptoms again: what kind of effort was made to 
examine the locked up state itself?

Softlockups always have some direct cause, which task exactly causes 
scheduling to stop altogether, why does it lock up - or is it not a 
clear lockup, just a very slow system?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges


On 04/01/2015 11:14 AM, Linus Torvalds wrote:
> On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges
>  wrote:
>>
>> Even with irqbalance removed from the L0/L1 machines the hang still occurs.
>>
>> This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.
> 
> Ok. So the ack_APIC debug patch found *something*, but it seems to be
> unrelated to the hang.
> 
> Dang. Oh well.  Back to square one.
> 
>Linus
> 

With my L0 testing I've normally used a 3.13 series kernel since it
tends to reproduce the hang very quickly with the testcase. Note we have
reproduced an identical hang with newer kernels (3.19+patches) using the
openstack tempest on openstack reproducer, but the timing can vary
between hours and days.

Installing a v4.0-rc6+patch kernel on L0 makes the problem very slow to
reproduce, so I am running these tests now which may take day(s).

It is worthwhile to do a 'bisect' to see where on average it takes
longer to reproduce? Perhaps it will point to a relevant change, or it
may be completely useless.

--chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges
 wrote:
>
> Even with irqbalance removed from the L0/L1 machines the hang still occurs.
>
> This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.

Ok. So the ack_APIC debug patch found *something*, but it seems to be
unrelated to the hang.

Dang. Oh well.  Back to square one.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Wed, Apr 01, 2015 at 02:43:36PM +0200, Ingo Molnar wrote:

 
> Have you already tested whether the hang goes away if you remove 
> irq-affinity fiddling daemons from the system? Do you have irqbalance 
> installed or similar mechanisms?
> 
> Thanks,
> 
>   Ingo
>

Even with irqbalance removed from the L0/L1 machines the hang still occurs.
Using the debug patch as described here:
https://lkml.org/lkml/2015/4/1/338

This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.

--chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges
 wrote:
>
> I included the full patch in reply to Ingo's email, and when running with that
> I no longer get the ack_APIC_irq WARNs.

Ok. That means that the printk's themselves just change timing enough,
or change the compiler instruction scheduling so that it hides the
apic problem.

Which very much indicates that these things are interconnected.

For example, Ingo's printk patch does

cfg->move_in_progress =
   cpumask_intersects(cfg->old_domain, cpu_online_mask);
+   if (cfg->move_in_progress)
+   pr_info("apic: vector %02x,
same-domain move in progress\n", cfg->vector);
cpumask_and(cfg->domain, cfg->domain, tmp_mask);

and that means that now the setting of move_in_progress is serialized
with the cpumask_and() in a way that it wasn't before.

And while the code takes the "vector_lock" and disables interrupts,
the interrupts themselves can happily continue on other cpu's, and
they don't take the vector_lock. Neither does send_cleanup_vector(),
which clears that bit, afaik.

I don't know. The locking there is odd.

> My next homework assignments are:
> - Testing with irqbalance disabled

Definitely.

> - Testing w/ the appropriate dump_stack() in Ingo's patch
> - L0 testing

Thanks,

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar

* Chris J Arges  wrote:

> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 6cedd7914581..79d6de6fdf0a 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
> > struct cpumask *mask)
> > cfg->move_in_progress =
> >cpumask_intersects(cfg->old_domain, cpu_online_mask);
> > cpumask_and(cfg->domain, cfg->domain, tmp_mask);
> > +   if (cfg->move_in_progress)
> > +   dump_stack();
> > break;
> > }
> >  
> > 
> 
> Should I add the dump_stack() to the 'new-domain move' area line 
> ~184 since that's the path I seem to be triggering?

Yeah, please do, that's what I wanted to mark.

My guess is that it will show a stack dump to within set_affinity().

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Tue, Mar 31, 2015 at 04:07:32PM -0700, Linus Torvalds wrote:
> On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges
>  wrote:
> >
> > I had a few runs with your patch plus modifications, and got the following
> > results (modified patch inlined below):
> 
> Ok, thanks.
> 
> > [   14.423916] ack_APIC_irq: vector = d1, irq = 
> > [  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> > [qemu-system-x86:1630]
> >
> > [   17.995298] ack_APIC_irq: vector = d1, irq = 
> > [  182.993828] ack_APIC_irq: vector = e1, irq = 
> > [  202.919691] ack_APIC_irq: vector = 22, irq = 
> > [  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
> > [qemu-system-x86:1586]
> >
> > [   15.592032] ack_APIC_irq: vector = d1, irq = 
> > [  304.993490] ack_APIC_irq: vector = e1, irq = 
> > [  315.174755] ack_APIC_irq: vector = 22, irq = 
> > [  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
> > [ksmd:26]
> .. snip snip ..
> 
> So yeah, that's VECTOR_UNDEFINED, and while it could happen as part of
> irq setup, I'm not seeing that being something that your load should
> trigger.
> 
> It could also obviously just be the vector being somehow corrupted,
> either due to crazy hardware or software.
> 
> But quite frankly, the most likely reason is that whole irq vector movement.
> 
> Especially since it sounds from your other email that when you apply
> Ingo's patches, the ack_APIC_irq warnings go away. Is that correct? Or
> did you just grep for "move" in the messages?
> 
> If you do get both movement messages (from Info's patch) _and_ the
> ack_APIC_irq warnings (from mine), it would be interesting to see if
> the vectors line up somehow..
> 
>  Linus
> 

Linus,

I included the full patch in reply to Ingo's email, and when running with that
I no longer get the ack_APIC_irq WARNs.

My next homework assignments are:
- Testing with irqbalance disabled
- Testing w/ the appropriate dump_stack() in Ingo's patch
- L0 testing

Thanks,
--chris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Frederic Weisbecker
On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
> [ Added Frederic to the cc, since he's touched this file/area most ]
> 
> On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds
>  wrote:
> >
> > So the caller has a really hard time guaranteeing that CSD_LOCK isn't
> > set. And if the call is done in interrupt context, for all we know it
> > is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
> > will never be cleared at all, and csd_lock() will wait forever.
> >
> > So I actually think that for the async case, we really *should* unlock
> > before doing the callback (which is what Thomas' old patch did).
> >
> > And we migth well be better off doing something like
> >
> > WARN_ON_ONCE(csd->flags & CSD_LOCK);
> >
> > in smp_call_function_single_async(), because that really is a hard 
> > requirement.
> >
> > And it strikes me that hrtick_csd is one of these cases that do this
> > with interrupts disabled, and use the callback for serialization. So I
> > really wonder if this is part of the problem..
> >
> > Thomas? Am I missing something?
> 
> Ok, this is a more involved patch than I'd like, but making the
> *caller* do all the CSD maintenance actually cleans things up.
> 
> And this is still completely untested, and may be entirely buggy. What
> do you guys think?
> 
>  Linus

>  kernel/smp.c | 78 
> 
>  1 file changed, 47 insertions(+), 31 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f38a1e692259..2aaac2c47683 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -19,7 +19,7 @@
>  
>  enum {
>   CSD_FLAG_LOCK   = 0x01,
> - CSD_FLAG_WAIT   = 0x02,
> + CSD_FLAG_SYNCHRONOUS= 0x02,
>  };
>  
>  struct call_function_data {
> @@ -107,7 +107,7 @@ void __init call_function_init(void)
>   */
>  static void csd_lock_wait(struct call_single_data *csd)
>  {
> - while (csd->flags & CSD_FLAG_LOCK)
> + while (smp_load_acquire(>flags) & CSD_FLAG_LOCK)
>   cpu_relax();
>  }
>  
> @@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd)
>* to ->flags with any subsequent assignments to other
>* fields of the specified call_single_data structure:
>*/
> - smp_mb();
> + smp_wmb();
>  }
>  
>  static void csd_unlock(struct call_single_data *csd)
>  {
> - WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
> + WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
>  
>   /*
>* ensure we're all done before releasing data:
>*/
> - smp_mb();
> -
> - csd->flags &= ~CSD_FLAG_LOCK;
> + smp_store_release(>flags, 0);
>  }
>  
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
> @@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct 
> call_single_data, csd_data);
>   * ->func, ->info, and ->flags set.
>   */
>  static int generic_exec_single(int cpu, struct call_single_data *csd,
> -smp_call_func_t func, void *info, int wait)
> +smp_call_func_t func, void *info)
>  {
> - struct call_single_data csd_stack = { .flags = 0 };
> - unsigned long flags;
> -
> -
>   if (cpu == smp_processor_id()) {
> + unsigned long flags;
> +
> + /*
> +  * We can unlock early even for the synchronous on-stack case,
> +  * since we're doing this from the same CPU..
> +  */
> + csd_unlock(csd);
>   local_irq_save(flags);
>   func(info);
>   local_irq_restore(flags);
> @@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct 
> call_single_data *csd,
>   if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu))
>   return -ENXIO;
>  
> -
> - if (!csd) {
> - csd = _stack;
> - if (!wait)
> - csd = this_cpu_ptr(_data);
> - }
> -
> - csd_lock(csd);
> -
>   csd->func = func;
>   csd->info = info;
>  
> - if (wait)
> - csd->flags |= CSD_FLAG_WAIT;
> -
>   /*
>* The list addition should be visible before sending the IPI
>* handler locks the list to pull the entry off it because of
> @@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct 
> call_single_data *csd,
>   if (llist_add(>llist, _cpu(call_single_queue, cpu)))
>   arch_send_call_function_single_ipi(cpu);
>  
> - if (wait)
> - csd_lock_wait(csd);
> -
>   return 0;
>  }
>  
> @@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool 
> warn_cpu_offline)
>   }
>  
>   llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> - csd->func(csd->info);
> - csd_unlock(csd);
> + smp_call_func_t func = csd->func;
> + void *info = csd->info;
> +
> + /* Do we wait until *after* 

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Wed, Apr 01, 2015 at 02:39:13PM +0200, Ingo Molnar wrote:
> 
> * Chris J Arges  wrote:
> 
> > This was only tested only on the L1, so I can put this on the L0 host and 
> > run
> > this as well. The results:
> > 
> > [  124.897002] apic: vector c1, new-domain move in progress 
> >
> > [  124.954827] apic: vector d1, sent cleanup vector, move completed 
> >
> > [  163.477270] apic: vector d1, new-domain move in progress 
> >
> > [  164.041938] apic: vector e1, sent cleanup vector, move completed 
> >
> > [  213.466971] apic: vector e1, new-domain move in progress 
> >
> > [  213.775639] apic: vector 22, sent cleanup vector, move completed 
> >
> > [  365.996747] apic: vector 22, new-domain move in progress 
> >
> > [  366.011136] apic: vector 42, sent cleanup vector, move completed 
> >
> > [  393.836032] apic: vector 42, new-domain move in progress 
> >
> > [  393.837727] apic: vector 52, sent cleanup vector, move completed 
> >
> > [  454.977514] apic: vector 52, new-domain move in progress 
> >
> > [  454.978880] apic: vector 62, sent cleanup vector, move completed 
> >
> > [  467.055730] apic: vector 62, new-domain move in progress 
> >
> > [  467.058129] apic: vector 72, sent cleanup vector, move completed 
> >
> > [  545.280125] apic: vector 72, new-domain move in progress 
> >
> > [  545.282801] apic: vector 82, sent cleanup vector, move completed 
> >
> > [  567.631652] apic: vector 82, new-domain move in progress 
> >
> > [  567.632207] apic: vector 92, sent cleanup vector, move completed 
> >
> > [  628.940638] apic: vector 92, new-domain move in progress 
> >
> > [  628.965274] apic: vector a2, sent cleanup vector, move completed 
> >
> > [  635.187433] apic: vector a2, new-domain move in progress 
> >
> > [  635.191643] apic: vector b2, sent cleanup vector, move completed 
> >
> > [  673.548020] apic: vector b2, new-domain move in progress 
> >
> > [  673.553843] apic: vector c2, sent cleanup vector, move completed 
> >
> > [  688.221906] apic: vector c2, new-domain move in progress 
> >
> > [  688.229487] apic: vector d2, sent cleanup vector, move completed 
> >
> > [  723.818916] apic: vector d2, new-domain move in progress 
> >
> > [  723.828970] apic: vector e2, sent cleanup vector, move completed 
> >
> > [  733.485435] apic: vector e2, new-domain move in progress 
> >
> > [  733.615007] apic: vector 23, sent cleanup vector, move completed 
> >
> > [  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> > [ksmd:26] 
> 
> Are these all the messages? Looks like Linus's warnings went away, or 
> did you filter them out?
>
Ingo,

The patch I ran with included both patches, but I had to modify your patch
to accomodate Linus' vector tracking, I've inlined the complete patch below.

The above kernel log was all kernel messages (no grepping) leading up to the
soft lockup on L1.
 
> But ... the affinity setting message does not appear to trigger, and 
> that's the only real race I can see in the code. Also, the frequency 
> of these messages appears to be low, while the race window is narrow. 
> So I'm not sure the problem is related to the irq-move mechanism.
> 
> One thing that appears to be weird: why is there irq-movement activity 
> to begin with? Is something changing irq-affinities?
>

irqbalance is running on all VMs/Systems (L0,L1,L2) by default. I'll disable
this and re-test.
 
> Could you put a dump_stack() into the call? Something like the patch 
> below, in addition to all patches so far. (if it conflicts with the 
> previous debugging patches then just add the code manually to after 
> the debug printout.)
> 
> Thanks,
> 
>   Ingo
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 6cedd7914581..79d6de6fdf0a 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
> struct cpumask *mask)
>   cfg->move_in_progress =
>  cpumask_intersects(cfg->old_domain, cpu_online_mask);
>   cpumask_and(cfg->domain, cfg->domain, tmp_mask);
> + if (cfg->move_in_progress)
> + dump_stack();
>   break;
>   }
>  
> 

Should I add the dump_stack() to the 'new-domain move' area line ~184 since
that's the path I seem to be triggering?

Thanks,
--chris

--

For completeness, here is the patch that I ran with previously:

diff --git 

Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar

* Chris J Arges  wrote:

> Linus,
> 
> I had a few runs with your patch plus modifications, and got the following
> results (modified patch inlined below):
> 
> [   14.423916] ack_APIC_irq: vector = d1, irq = 
> [  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> [qemu-system-x86:1630]
> 
> [   17.995298] ack_APIC_irq: vector = d1, irq = 
> [  182.993828] ack_APIC_irq: vector = e1, irq = 
> [  202.919691] ack_APIC_irq: vector = 22, irq = 
> [  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
> [qemu-system-x86:1586]
> 
> [   15.592032] ack_APIC_irq: vector = d1, irq = 
> [  304.993490] ack_APIC_irq: vector = e1, irq = 
> [  315.174755] ack_APIC_irq: vector = 22, irq = 
> [  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]
> 
> [   15.026077] ack_APIC_irq: vector = b1, irq = 
> [  374.828531] ack_APIC_irq: vector = c1, irq = 
> [  402.965942] ack_APIC_irq: vector = d1, irq = 
> [  434.540814] ack_APIC_irq: vector = e1, irq = 
> [  461.820768] ack_APIC_irq: vector = 22, irq = 
> [  536.120027] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
> [qemu-system-x86:4243]
> 
> [   17.889334] ack_APIC_irq: vector = d1, irq = 
> [  291.888784] ack_APIC_irq: vector = e1, irq = 
> [  297.824627] ack_APIC_irq: vector = 22, irq = 
> [  336.960594] ack_APIC_irq: vector = 42, irq = 
> [  367.012706] ack_APIC_irq: vector = 52, irq = 
> [  377.025090] ack_APIC_irq: vector = 62, irq = 
> [  417.088773] ack_APIC_irq: vector = 72, irq = 
> [  447.136788] ack_APIC_irq: vector = 82, irq = 
> -- stopped it since it wasn't reproducing / I was impatient --
> 
> So I'm seeing irq == VECTOR_UNDEFINED in all of these cases. Making
> (vector >= 0) didn't seem to expose any additional vectors.

So, these vectors do seem to be lining up with the pattern of how new 
irq vectors get assigned and how we slowly rotate through all 
available ones.

The VECTOR_UNDEFINED might correspond to the fact that we already 
'freed' that vector, as part of the irq-move mechanism - but it was 
likely in use shortly before.

So the irq-move code is not off the hook, to the contrary.

Have you already tested whether the hang goes away if you remove 
irq-affinity fiddling daemons from the system? Do you have irqbalance 
installed or similar mechanisms?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar

* Chris J Arges  wrote:

> This was only tested only on the L1, so I can put this on the L0 host and run
> this as well. The results:
> 
> [  124.897002] apic: vector c1, new-domain move in progress   
>  
> [  124.954827] apic: vector d1, sent cleanup vector, move completed   
>  
> [  163.477270] apic: vector d1, new-domain move in progress   
>  
> [  164.041938] apic: vector e1, sent cleanup vector, move completed   
>  
> [  213.466971] apic: vector e1, new-domain move in progress   
>  
> [  213.775639] apic: vector 22, sent cleanup vector, move completed   
>  
> [  365.996747] apic: vector 22, new-domain move in progress   
>  
> [  366.011136] apic: vector 42, sent cleanup vector, move completed   
>  
> [  393.836032] apic: vector 42, new-domain move in progress   
>  
> [  393.837727] apic: vector 52, sent cleanup vector, move completed   
>  
> [  454.977514] apic: vector 52, new-domain move in progress   
>  
> [  454.978880] apic: vector 62, sent cleanup vector, move completed   
>  
> [  467.055730] apic: vector 62, new-domain move in progress   
>  
> [  467.058129] apic: vector 72, sent cleanup vector, move completed   
>  
> [  545.280125] apic: vector 72, new-domain move in progress   
>  
> [  545.282801] apic: vector 82, sent cleanup vector, move completed   
>  
> [  567.631652] apic: vector 82, new-domain move in progress   
>  
> [  567.632207] apic: vector 92, sent cleanup vector, move completed   
>  
> [  628.940638] apic: vector 92, new-domain move in progress   
>  
> [  628.965274] apic: vector a2, sent cleanup vector, move completed   
>  
> [  635.187433] apic: vector a2, new-domain move in progress   
>  
> [  635.191643] apic: vector b2, sent cleanup vector, move completed   
>  
> [  673.548020] apic: vector b2, new-domain move in progress   
>  
> [  673.553843] apic: vector c2, sent cleanup vector, move completed   
>  
> [  688.221906] apic: vector c2, new-domain move in progress   
>  
> [  688.229487] apic: vector d2, sent cleanup vector, move completed   
>  
> [  723.818916] apic: vector d2, new-domain move in progress   
>  
> [  723.828970] apic: vector e2, sent cleanup vector, move completed   
>  
> [  733.485435] apic: vector e2, new-domain move in progress   
>  
> [  733.615007] apic: vector 23, sent cleanup vector, move completed   
>  
> [  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> [ksmd:26] 

Are these all the messages? Looks like Linus's warnings went away, or 
did you filter them out?

But ... the affinity setting message does not appear to trigger, and 
that's the only real race I can see in the code. Also, the frequency 
of these messages appears to be low, while the race window is narrow. 
So I'm not sure the problem is related to the irq-move mechanism.

One thing that appears to be weird: why is there irq-movement activity 
to begin with? Is something changing irq-affinities?

Could you put a dump_stack() into the call? Something like the patch 
below, in addition to all patches so far. (if it conflicts with the 
previous debugging patches then just add the code manually to after 
the debug printout.)

Thanks,

Ingo

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..79d6de6fdf0a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
struct cpumask *mask)
cfg->move_in_progress =
   cpumask_intersects(cfg->old_domain, cpu_online_mask);
cpumask_and(cfg->domain, cfg->domain, tmp_mask);
+   if (cfg->move_in_progress)
+   dump_stack();
break;
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar

* Chris J Arges chris.j.ar...@canonical.com wrote:

 This was only tested only on the L1, so I can put this on the L0 host and run
 this as well. The results:
 
 [  124.897002] apic: vector c1, new-domain move in progress   
  
 [  124.954827] apic: vector d1, sent cleanup vector, move completed   
  
 [  163.477270] apic: vector d1, new-domain move in progress   
  
 [  164.041938] apic: vector e1, sent cleanup vector, move completed   
  
 [  213.466971] apic: vector e1, new-domain move in progress   
  
 [  213.775639] apic: vector 22, sent cleanup vector, move completed   
  
 [  365.996747] apic: vector 22, new-domain move in progress   
  
 [  366.011136] apic: vector 42, sent cleanup vector, move completed   
  
 [  393.836032] apic: vector 42, new-domain move in progress   
  
 [  393.837727] apic: vector 52, sent cleanup vector, move completed   
  
 [  454.977514] apic: vector 52, new-domain move in progress   
  
 [  454.978880] apic: vector 62, sent cleanup vector, move completed   
  
 [  467.055730] apic: vector 62, new-domain move in progress   
  
 [  467.058129] apic: vector 72, sent cleanup vector, move completed   
  
 [  545.280125] apic: vector 72, new-domain move in progress   
  
 [  545.282801] apic: vector 82, sent cleanup vector, move completed   
  
 [  567.631652] apic: vector 82, new-domain move in progress   
  
 [  567.632207] apic: vector 92, sent cleanup vector, move completed   
  
 [  628.940638] apic: vector 92, new-domain move in progress   
  
 [  628.965274] apic: vector a2, sent cleanup vector, move completed   
  
 [  635.187433] apic: vector a2, new-domain move in progress   
  
 [  635.191643] apic: vector b2, sent cleanup vector, move completed   
  
 [  673.548020] apic: vector b2, new-domain move in progress   
  
 [  673.553843] apic: vector c2, sent cleanup vector, move completed   
  
 [  688.221906] apic: vector c2, new-domain move in progress   
  
 [  688.229487] apic: vector d2, sent cleanup vector, move completed   
  
 [  723.818916] apic: vector d2, new-domain move in progress   
  
 [  723.828970] apic: vector e2, sent cleanup vector, move completed   
  
 [  733.485435] apic: vector e2, new-domain move in progress   
  
 [  733.615007] apic: vector 23, sent cleanup vector, move completed   
  
 [  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
 [ksmd:26] 

Are these all the messages? Looks like Linus's warnings went away, or 
did you filter them out?

But ... the affinity setting message does not appear to trigger, and 
that's the only real race I can see in the code. Also, the frequency 
of these messages appears to be low, while the race window is narrow. 
So I'm not sure the problem is related to the irq-move mechanism.

One thing that appears to be weird: why is there irq-movement activity 
to begin with? Is something changing irq-affinities?

Could you put a dump_stack() into the call? Something like the patch 
below, in addition to all patches so far. (if it conflicts with the 
previous debugging patches then just add the code manually to after 
the debug printout.)

Thanks,

Ingo

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..79d6de6fdf0a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
struct cpumask *mask)
cfg-move_in_progress =
   cpumask_intersects(cfg-old_domain, cpu_online_mask);
cpumask_and(cfg-domain, cfg-domain, tmp_mask);
+   if (cfg-move_in_progress)
+   dump_stack();
break;
}
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar

* Chris J Arges chris.j.ar...@canonical.com wrote:

 Linus,
 
 I had a few runs with your patch plus modifications, and got the following
 results (modified patch inlined below):
 
 [   14.423916] ack_APIC_irq: vector = d1, irq = 
 [  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
 [qemu-system-x86:1630]
 
 [   17.995298] ack_APIC_irq: vector = d1, irq = 
 [  182.993828] ack_APIC_irq: vector = e1, irq = 
 [  202.919691] ack_APIC_irq: vector = 22, irq = 
 [  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
 [qemu-system-x86:1586]
 
 [   15.592032] ack_APIC_irq: vector = d1, irq = 
 [  304.993490] ack_APIC_irq: vector = e1, irq = 
 [  315.174755] ack_APIC_irq: vector = 22, irq = 
 [  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]
 
 [   15.026077] ack_APIC_irq: vector = b1, irq = 
 [  374.828531] ack_APIC_irq: vector = c1, irq = 
 [  402.965942] ack_APIC_irq: vector = d1, irq = 
 [  434.540814] ack_APIC_irq: vector = e1, irq = 
 [  461.820768] ack_APIC_irq: vector = 22, irq = 
 [  536.120027] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
 [qemu-system-x86:4243]
 
 [   17.889334] ack_APIC_irq: vector = d1, irq = 
 [  291.888784] ack_APIC_irq: vector = e1, irq = 
 [  297.824627] ack_APIC_irq: vector = 22, irq = 
 [  336.960594] ack_APIC_irq: vector = 42, irq = 
 [  367.012706] ack_APIC_irq: vector = 52, irq = 
 [  377.025090] ack_APIC_irq: vector = 62, irq = 
 [  417.088773] ack_APIC_irq: vector = 72, irq = 
 [  447.136788] ack_APIC_irq: vector = 82, irq = 
 -- stopped it since it wasn't reproducing / I was impatient --
 
 So I'm seeing irq == VECTOR_UNDEFINED in all of these cases. Making
 (vector = 0) didn't seem to expose any additional vectors.

So, these vectors do seem to be lining up with the pattern of how new 
irq vectors get assigned and how we slowly rotate through all 
available ones.

The VECTOR_UNDEFINED might correspond to the fact that we already 
'freed' that vector, as part of the irq-move mechanism - but it was 
likely in use shortly before.

So the irq-move code is not off the hook, to the contrary.

Have you already tested whether the hang goes away if you remove 
irq-affinity fiddling daemons from the system? Do you have irqbalance 
installed or similar mechanisms?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Wed, Apr 01, 2015 at 02:39:13PM +0200, Ingo Molnar wrote:
 
 * Chris J Arges chris.j.ar...@canonical.com wrote:
 
  This was only tested only on the L1, so I can put this on the L0 host and 
  run
  this as well. The results:
  
  [  124.897002] apic: vector c1, new-domain move in progress 
 
  [  124.954827] apic: vector d1, sent cleanup vector, move completed 
 
  [  163.477270] apic: vector d1, new-domain move in progress 
 
  [  164.041938] apic: vector e1, sent cleanup vector, move completed 
 
  [  213.466971] apic: vector e1, new-domain move in progress 
 
  [  213.775639] apic: vector 22, sent cleanup vector, move completed 
 
  [  365.996747] apic: vector 22, new-domain move in progress 
 
  [  366.011136] apic: vector 42, sent cleanup vector, move completed 
 
  [  393.836032] apic: vector 42, new-domain move in progress 
 
  [  393.837727] apic: vector 52, sent cleanup vector, move completed 
 
  [  454.977514] apic: vector 52, new-domain move in progress 
 
  [  454.978880] apic: vector 62, sent cleanup vector, move completed 
 
  [  467.055730] apic: vector 62, new-domain move in progress 
 
  [  467.058129] apic: vector 72, sent cleanup vector, move completed 
 
  [  545.280125] apic: vector 72, new-domain move in progress 
 
  [  545.282801] apic: vector 82, sent cleanup vector, move completed 
 
  [  567.631652] apic: vector 82, new-domain move in progress 
 
  [  567.632207] apic: vector 92, sent cleanup vector, move completed 
 
  [  628.940638] apic: vector 92, new-domain move in progress 
 
  [  628.965274] apic: vector a2, sent cleanup vector, move completed 
 
  [  635.187433] apic: vector a2, new-domain move in progress 
 
  [  635.191643] apic: vector b2, sent cleanup vector, move completed 
 
  [  673.548020] apic: vector b2, new-domain move in progress 
 
  [  673.553843] apic: vector c2, sent cleanup vector, move completed 
 
  [  688.221906] apic: vector c2, new-domain move in progress 
 
  [  688.229487] apic: vector d2, sent cleanup vector, move completed 
 
  [  723.818916] apic: vector d2, new-domain move in progress 
 
  [  723.828970] apic: vector e2, sent cleanup vector, move completed 
 
  [  733.485435] apic: vector e2, new-domain move in progress 
 
  [  733.615007] apic: vector 23, sent cleanup vector, move completed 
 
  [  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
  [ksmd:26] 
 
 Are these all the messages? Looks like Linus's warnings went away, or 
 did you filter them out?

Ingo,

The patch I ran with included both patches, but I had to modify your patch
to accomodate Linus' vector tracking, I've inlined the complete patch below.

The above kernel log was all kernel messages (no grepping) leading up to the
soft lockup on L1.
 
 But ... the affinity setting message does not appear to trigger, and 
 that's the only real race I can see in the code. Also, the frequency 
 of these messages appears to be low, while the race window is narrow. 
 So I'm not sure the problem is related to the irq-move mechanism.
 
 One thing that appears to be weird: why is there irq-movement activity 
 to begin with? Is something changing irq-affinities?


irqbalance is running on all VMs/Systems (L0,L1,L2) by default. I'll disable
this and re-test.
 
 Could you put a dump_stack() into the call? Something like the patch 
 below, in addition to all patches so far. (if it conflicts with the 
 previous debugging patches then just add the code manually to after 
 the debug printout.)
 
 Thanks,
 
   Ingo
 
 diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
 index 6cedd7914581..79d6de6fdf0a 100644
 --- a/arch/x86/kernel/apic/vector.c
 +++ b/arch/x86/kernel/apic/vector.c
 @@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
 struct cpumask *mask)
   cfg-move_in_progress =
  cpumask_intersects(cfg-old_domain, cpu_online_mask);
   cpumask_and(cfg-domain, cfg-domain, tmp_mask);
 + if (cfg-move_in_progress)
 + dump_stack();
   break;
   }
  
 

Should I add the dump_stack() to the 'new-domain move' area line ~184 since
that's the path I seem to be triggering?

Thanks,
--chris

--

For completeness, here is the patch that I ran with previously:

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index efc3b22..88b3933 100644
--- a/arch/x86/include/asm/apic.h
+++ 

Re: smp_call_function_single lockups

2015-04-01 Thread Frederic Weisbecker
On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
 [ Added Frederic to the cc, since he's touched this file/area most ]
 
 On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  So the caller has a really hard time guaranteeing that CSD_LOCK isn't
  set. And if the call is done in interrupt context, for all we know it
  is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
  will never be cleared at all, and csd_lock() will wait forever.
 
  So I actually think that for the async case, we really *should* unlock
  before doing the callback (which is what Thomas' old patch did).
 
  And we migth well be better off doing something like
 
  WARN_ON_ONCE(csd-flags  CSD_LOCK);
 
  in smp_call_function_single_async(), because that really is a hard 
  requirement.
 
  And it strikes me that hrtick_csd is one of these cases that do this
  with interrupts disabled, and use the callback for serialization. So I
  really wonder if this is part of the problem..
 
  Thomas? Am I missing something?
 
 Ok, this is a more involved patch than I'd like, but making the
 *caller* do all the CSD maintenance actually cleans things up.
 
 And this is still completely untested, and may be entirely buggy. What
 do you guys think?
 
  Linus

  kernel/smp.c | 78 
 
  1 file changed, 47 insertions(+), 31 deletions(-)
 
 diff --git a/kernel/smp.c b/kernel/smp.c
 index f38a1e692259..2aaac2c47683 100644
 --- a/kernel/smp.c
 +++ b/kernel/smp.c
 @@ -19,7 +19,7 @@
  
  enum {
   CSD_FLAG_LOCK   = 0x01,
 - CSD_FLAG_WAIT   = 0x02,
 + CSD_FLAG_SYNCHRONOUS= 0x02,
  };
  
  struct call_function_data {
 @@ -107,7 +107,7 @@ void __init call_function_init(void)
   */
  static void csd_lock_wait(struct call_single_data *csd)
  {
 - while (csd-flags  CSD_FLAG_LOCK)
 + while (smp_load_acquire(csd-flags)  CSD_FLAG_LOCK)
   cpu_relax();
  }
  
 @@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd)
* to -flags with any subsequent assignments to other
* fields of the specified call_single_data structure:
*/
 - smp_mb();
 + smp_wmb();
  }
  
  static void csd_unlock(struct call_single_data *csd)
  {
 - WARN_ON((csd-flags  CSD_FLAG_WAIT)  !(csd-flags  CSD_FLAG_LOCK));
 + WARN_ON(!(csd-flags  CSD_FLAG_LOCK));
  
   /*
* ensure we're all done before releasing data:
*/
 - smp_mb();
 -
 - csd-flags = ~CSD_FLAG_LOCK;
 + smp_store_release(csd-flags, 0);
  }
  
  static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
 @@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct 
 call_single_data, csd_data);
   * -func, -info, and -flags set.
   */
  static int generic_exec_single(int cpu, struct call_single_data *csd,
 -smp_call_func_t func, void *info, int wait)
 +smp_call_func_t func, void *info)
  {
 - struct call_single_data csd_stack = { .flags = 0 };
 - unsigned long flags;
 -
 -
   if (cpu == smp_processor_id()) {
 + unsigned long flags;
 +
 + /*
 +  * We can unlock early even for the synchronous on-stack case,
 +  * since we're doing this from the same CPU..
 +  */
 + csd_unlock(csd);
   local_irq_save(flags);
   func(info);
   local_irq_restore(flags);
 @@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct 
 call_single_data *csd,
   if ((unsigned)cpu = nr_cpu_ids || !cpu_online(cpu))
   return -ENXIO;
  
 -
 - if (!csd) {
 - csd = csd_stack;
 - if (!wait)
 - csd = this_cpu_ptr(csd_data);
 - }
 -
 - csd_lock(csd);
 -
   csd-func = func;
   csd-info = info;
  
 - if (wait)
 - csd-flags |= CSD_FLAG_WAIT;
 -
   /*
* The list addition should be visible before sending the IPI
* handler locks the list to pull the entry off it because of
 @@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct 
 call_single_data *csd,
   if (llist_add(csd-llist, per_cpu(call_single_queue, cpu)))
   arch_send_call_function_single_ipi(cpu);
  
 - if (wait)
 - csd_lock_wait(csd);
 -
   return 0;
  }
  
 @@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool 
 warn_cpu_offline)
   }
  
   llist_for_each_entry_safe(csd, csd_next, entry, llist) {
 - csd-func(csd-info);
 - csd_unlock(csd);
 + smp_call_func_t func = csd-func;
 + void *info = csd-info;
 +
 + /* Do we wait until *after* callback? */
 + if (csd-flags  CSD_FLAG_SYNCHRONOUS) {
 + func(info);
 + csd_unlock(csd);
 + 

Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Tue, Mar 31, 2015 at 04:07:32PM -0700, Linus Torvalds wrote:
 On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges
 chris.j.ar...@canonical.com wrote:
 
  I had a few runs with your patch plus modifications, and got the following
  results (modified patch inlined below):
 
 Ok, thanks.
 
  [   14.423916] ack_APIC_irq: vector = d1, irq = 
  [  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
  [qemu-system-x86:1630]
 
  [   17.995298] ack_APIC_irq: vector = d1, irq = 
  [  182.993828] ack_APIC_irq: vector = e1, irq = 
  [  202.919691] ack_APIC_irq: vector = 22, irq = 
  [  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
  [qemu-system-x86:1586]
 
  [   15.592032] ack_APIC_irq: vector = d1, irq = 
  [  304.993490] ack_APIC_irq: vector = e1, irq = 
  [  315.174755] ack_APIC_irq: vector = 22, irq = 
  [  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
  [ksmd:26]
 .. snip snip ..
 
 So yeah, that's VECTOR_UNDEFINED, and while it could happen as part of
 irq setup, I'm not seeing that being something that your load should
 trigger.
 
 It could also obviously just be the vector being somehow corrupted,
 either due to crazy hardware or software.
 
 But quite frankly, the most likely reason is that whole irq vector movement.
 
 Especially since it sounds from your other email that when you apply
 Ingo's patches, the ack_APIC_irq warnings go away. Is that correct? Or
 did you just grep for move in the messages?
 
 If you do get both movement messages (from Info's patch) _and_ the
 ack_APIC_irq warnings (from mine), it would be interesting to see if
 the vectors line up somehow..
 
  Linus
 

Linus,

I included the full patch in reply to Ingo's email, and when running with that
I no longer get the ack_APIC_irq WARNs.

My next homework assignments are:
- Testing with irqbalance disabled
- Testing w/ the appropriate dump_stack() in Ingo's patch
- L0 testing

Thanks,
--chris


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges
chris.j.ar...@canonical.com wrote:

 I included the full patch in reply to Ingo's email, and when running with that
 I no longer get the ack_APIC_irq WARNs.

Ok. That means that the printk's themselves just change timing enough,
or change the compiler instruction scheduling so that it hides the
apic problem.

Which very much indicates that these things are interconnected.

For example, Ingo's printk patch does

cfg-move_in_progress =
   cpumask_intersects(cfg-old_domain, cpu_online_mask);
+   if (cfg-move_in_progress)
+   pr_info(apic: vector %02x,
same-domain move in progress\n, cfg-vector);
cpumask_and(cfg-domain, cfg-domain, tmp_mask);

and that means that now the setting of move_in_progress is serialized
with the cpumask_and() in a way that it wasn't before.

And while the code takes the vector_lock and disables interrupts,
the interrupts themselves can happily continue on other cpu's, and
they don't take the vector_lock. Neither does send_cleanup_vector(),
which clears that bit, afaik.

I don't know. The locking there is odd.

 My next homework assignments are:
 - Testing with irqbalance disabled

Definitely.

 - Testing w/ the appropriate dump_stack() in Ingo's patch
 - L0 testing

Thanks,

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges
On Wed, Apr 01, 2015 at 02:43:36PM +0200, Ingo Molnar wrote:

snip 
 Have you already tested whether the hang goes away if you remove 
 irq-affinity fiddling daemons from the system? Do you have irqbalance 
 installed or similar mechanisms?
 
 Thanks,
 
   Ingo


Even with irqbalance removed from the L0/L1 machines the hang still occurs.
Using the debug patch as described here:
https://lkml.org/lkml/2015/4/1/338

This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.

--chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-04-01 Thread Ingo Molnar

* Chris J Arges chris.j.ar...@canonical.com wrote:

  diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
  index 6cedd7914581..79d6de6fdf0a 100644
  --- a/arch/x86/kernel/apic/vector.c
  +++ b/arch/x86/kernel/apic/vector.c
  @@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
  struct cpumask *mask)
  cfg-move_in_progress =
 cpumask_intersects(cfg-old_domain, cpu_online_mask);
  cpumask_and(cfg-domain, cfg-domain, tmp_mask);
  +   if (cfg-move_in_progress)
  +   dump_stack();
  break;
  }
   
  
 
 Should I add the dump_stack() to the 'new-domain move' area line 
 ~184 since that's the path I seem to be triggering?

Yeah, please do, that's what I wanted to mark.

My guess is that it will show a stack dump to within set_affinity().

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Linus Torvalds
On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges
chris.j.ar...@canonical.com wrote:

 Even with irqbalance removed from the L0/L1 machines the hang still occurs.

 This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.

Ok. So the ack_APIC debug patch found *something*, but it seems to be
unrelated to the hang.

Dang. Oh well.  Back to square one.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-04-01 Thread Chris J Arges


On 04/01/2015 11:14 AM, Linus Torvalds wrote:
 On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges
 chris.j.ar...@canonical.com wrote:

 Even with irqbalance removed from the L0/L1 machines the hang still occurs.

 This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.
 
 Ok. So the ack_APIC debug patch found *something*, but it seems to be
 unrelated to the hang.
 
 Dang. Oh well.  Back to square one.
 
Linus
 

With my L0 testing I've normally used a 3.13 series kernel since it
tends to reproduce the hang very quickly with the testcase. Note we have
reproduced an identical hang with newer kernels (3.19+patches) using the
openstack tempest on openstack reproducer, but the timing can vary
between hours and days.

Installing a v4.0-rc6+patch kernel on L0 makes the problem very slow to
reproduce, so I am running these tests now which may take day(s).

It is worthwhile to do a 'bisect' to see where on average it takes
longer to reproduce? Perhaps it will point to a relevant change, or it
may be completely useless.

--chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Daniel J Blueman
On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote:
> On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> >
> > * Linus Torvalds  wrote:
> >
> > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR
> > > bit clear" seems to be a real issue.
> >
> > It's interesting in particular when it happens with an edge-triggered
> > interrupt source: it's much harder to miss level triggered IRQs, which
> > stay around until actively handled. Edge triggered irqs are more
> > fragile to loss of event processing.
> >
> > > > Anyway, maybe this sheds some more light on this issue. I can
> > > > reproduce this at will, so let me know of other experiments to do.
> >
> > Btw., could you please describe (again) what your current best method
> > for reproduction is? It's been a long discussion ...
> >
>
> Ingo,
>
> To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
> ( Although I've heard of others that have reproduced on other machines. )
>
> 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
> 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
> 3) Add the following to the L1 cmdline:
> nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
> 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM
>
> Sometimes this is sufficient to reproduce the issue, I've observed that 
> running
> KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
> If this doesn't reproduce then you can do the following:
> 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
> L1 vCPUs until the hang occurs.
>
> I attempted to write a module that used smp_call_function_single calls to
> trigger IPIs but have been unable to create a more simple reproducer.

A non-intrusive way of generating a lot of IPIs, is calling
stop_machine() via something like:

while :; do
echo "base=0x200 size=0x800 type=write-back" >/proc/mtrr
echo "disable=4" >| /proc/mtrr
done

Of course, ensure base is above DRAM and any 64-bit MMIO for no
side-effects and ensure it'll be entry 4. Onlining and offlining cores
in parallel will generate IPIs also.

Dan
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-31 Thread Linus Torvalds
On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges
 wrote:
>
> I had a few runs with your patch plus modifications, and got the following
> results (modified patch inlined below):

Ok, thanks.

> [   14.423916] ack_APIC_irq: vector = d1, irq = 
> [  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> [qemu-system-x86:1630]
>
> [   17.995298] ack_APIC_irq: vector = d1, irq = 
> [  182.993828] ack_APIC_irq: vector = e1, irq = 
> [  202.919691] ack_APIC_irq: vector = 22, irq = 
> [  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
> [qemu-system-x86:1586]
>
> [   15.592032] ack_APIC_irq: vector = d1, irq = 
> [  304.993490] ack_APIC_irq: vector = e1, irq = 
> [  315.174755] ack_APIC_irq: vector = 22, irq = 
> [  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]
.. snip snip ..

So yeah, that's VECTOR_UNDEFINED, and while it could happen as part of
irq setup, I'm not seeing that being something that your load should
trigger.

It could also obviously just be the vector being somehow corrupted,
either due to crazy hardware or software.

But quite frankly, the most likely reason is that whole irq vector movement.

Especially since it sounds from your other email that when you apply
Ingo's patches, the ack_APIC_irq warnings go away. Is that correct? Or
did you just grep for "move" in the messages?

If you do get both movement messages (from Info's patch) _and_ the
ack_APIC_irq warnings (from mine), it would be interesting to see if
the vectors line up somehow..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Chris J Arges
On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR 
> > bit clear" seems to be a real issue.
> 
> It's interesting in particular when it happens with an edge-triggered 
> interrupt source: it's much harder to miss level triggered IRQs, which 
> stay around until actively handled. Edge triggered irqs are more 
> fragile to loss of event processing.
> 
> > > Anyway, maybe this sheds some more light on this issue. I can 
> > > reproduce this at will, so let me know of other experiments to do.
> 
> Btw., could you please describe (again) what your current best method 
> for reproduction is? It's been a long discussion ...
> 

Ingo,

To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
( Although I've heard of others that have reproduced on other machines. )

1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Add the following to the L1 cmdline:
nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
L1 vCPUs until the hang occurs.

I attempted to write a module that used smp_call_function_single calls to 
trigger IPIs but have been unable to create a more simple reproducer.

> > Somebody else who knows the apic needs to also take a look, but I'd 
> > love to hear what the actual hardware interrupt is (from that 
> > "vector_irq[vector]" thing above.
> > 
> > I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
> > are 0xf0-0xff), so it sounds like an external one. But that then 
> > requires the whole mapping table thing.
> > 
> > Ingo/Peter/Jiang - is there anything else useful we could print out? 
> > I worry about the irq movement code. Can we add printk's to when an 
> > irq is chasing from one CPU to another and doing that 
> > "move_in_progress" thing? I've always been scared of that code.
> 
> 1)
> 
> So the irq_cfg::move_in_progress field originates from:
> 
>   610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.")
> 
> and I agree that it looks fragile, so it would be nice to see how much 
> (if at all?) it gets used, by sticking a few pr_info()s in it.
> 
> Debug patch for the vector movement state machine attached below. 
> Untested.
> 
> 2)
> 
> But ... having taken a quick look at the vector movement handling, I 
> am scared more than ever,and I cannot convince myself that it's race 
> free. It's possibly harmless, but still, famous last words ...
> 
> For example, most ->move_in_progress transitions happen with the 
> vector_lock held - with the exception of send_cleanup_vector(): there 
> we are only holding the desc->lock, but that's not enough! For 
> example, this codepath:
> 
> static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
> {
> unsigned me;
> 
> if (likely(!cfg->move_in_progress))
> return;
> 
> me = smp_processor_id();
> 
> if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain))
> send_cleanup_vector(cfg);
> 
> ...
> 
> void send_cleanup_vector(struct irq_cfg *cfg)
> {
> ...
> 
> cfg->move_in_progress = 0;
> }
> 
> is blatantly racy, unless I missed something obvious?
> 
> 2b)
> 
> Initially I thought this only affects slowpaths like driver unregister 
> or CPU hotplug, but I think this would be relevant for the fastpath of 
> edge triggered irqs as well:
> 
> void apic_ack_edge(struct irq_data *data)
> {
> irq_complete_move(irqd_cfg(data));
> irq_move_irq(data);
> ack_APIC_irq();
> }
> 
> and irq_complete_move() checks and clears cfg->move_in_progress as 
> listed above.
> 
> So in most scenarios this is probably harmless, because it can in the 
> worst case result in the missing of a ->move_in_progress flag setting.
> 
> But it does not look harmless in the apic_set_affinity() code path: 
> there we call into assign_irq_vector() without the desc->lock held, 
> due to this gem in the IRQ core:
> 
> int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
> {
> unsigned long flags;
> struct irq_desc *desc = irq_get_desc_lock(irq, , 
> IRQ_GET_DESC_CHECK_GLOBAL);
> 
> if (!desc)
> return -EINVAL;
> desc->affinity_hint = m;
> irq_put_desc_unlock(desc, flags);
> /* set the initial affinity to prevent every interrupt being on CPU0 
> */
> if (m)
> __irq_set_affinity(irq, m, false);
> return 0;
> }
> 
> argh!
> 
> Now if this ever 

Re: smp_call_function_single lockups

2015-03-31 Thread Chris J Arges
On Tue, Mar 31, 2015 at 08:08:40AM -0700, Linus Torvalds wrote:
> On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
>  wrote:
> >
> > I modified the posted patch with the following:
> 
> Actually, in addition to Ingo's patches (and the irq printout), which
> you should try first, if none of that really gives any different
> behavior, can modify that ack_APIC_irq() debugging code a bit more:
> 
> > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> > index bf32309..dc3e192 100644
> > --- a/arch/x86/include/asm/apic.h
> > +++ b/arch/x86/include/asm/apic.h
> > @@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector)
> > if (vector >= 16) {
> > unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1));
> > v >>= vector & 0x1f;
> > -   WARN_ON_ONCE(!(v & 1));
> > +   WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector);
> > }
> > /*
> >  * ack_APIC_irq() actually gets compiled as a single instruction
> 
> So what I'd suggest doing is:
> 
>  - change the test of "vector >= 16" to just "vector >= 0".
> 
>We still have "-1" as the "unknown vector" thing, but I think only
> the ack_bad_irq() thing calls it, and that should print out its own
> message if it ever triggers, so it isn't an issue.
> 
>The reason for the ">= 16" was kind of bogus - the first 16 vectors
> are system vectors, but we definitely shouldn't ack the apic for such
> vectors anyway, so giving a warning for them is very much appropriate.
> In particular, vector 2 is NMI, and maybe we do ACk it incorrectly.
> 
>  - add a "return" if the warning triggers, and simply don't do the
> actual ACK cycle at all if the ISR bit is clear.
> 
>IOW, make it do "if (WARN(..)) return;"
> 
> Now, we might get the vector number wrong for some reason, and in that
> case not ACK'ing at all might cause problems too, but it would be
> interesting to see if it changes behavior wrt the lockup.
> 
> I don't have any other ideas at the moment, but hopefully the
> suggested changes by me and Ingo will give some more data to go on and
> clarify what might be going on.
> 
>   Linus
> 

Linus,

I had a few runs with your patch plus modifications, and got the following
results (modified patch inlined below):

[   14.423916] ack_APIC_irq: vector = d1, irq = 
[  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
[qemu-system-x86:1630]

[   17.995298] ack_APIC_irq: vector = d1, irq = 
[  182.993828] ack_APIC_irq: vector = e1, irq = 
[  202.919691] ack_APIC_irq: vector = 22, irq = 
[  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
[qemu-system-x86:1586]

[   15.592032] ack_APIC_irq: vector = d1, irq = 
[  304.993490] ack_APIC_irq: vector = e1, irq = 
[  315.174755] ack_APIC_irq: vector = 22, irq = 
[  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]

[   15.026077] ack_APIC_irq: vector = b1, irq = 
[  374.828531] ack_APIC_irq: vector = c1, irq = 
[  402.965942] ack_APIC_irq: vector = d1, irq = 
[  434.540814] ack_APIC_irq: vector = e1, irq = 
[  461.820768] ack_APIC_irq: vector = 22, irq = 
[  536.120027] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
[qemu-system-x86:4243]

[   17.889334] ack_APIC_irq: vector = d1, irq = 
[  291.888784] ack_APIC_irq: vector = e1, irq = 
[  297.824627] ack_APIC_irq: vector = 22, irq = 
[  336.960594] ack_APIC_irq: vector = 42, irq = 
[  367.012706] ack_APIC_irq: vector = 52, irq = 
[  377.025090] ack_APIC_irq: vector = 62, irq = 
[  417.088773] ack_APIC_irq: vector = 72, irq = 
[  447.136788] ack_APIC_irq: vector = 82, irq = 
-- stopped it since it wasn't reproducing / I was impatient --

So I'm seeing irq == VECTOR_UNDEFINED in all of these cases. Making
(vector >= 0) didn't seem to expose any additional vectors.

This was only instrumented at the L1 level, I'm also planning on instrumenting
the L0 kernel and seeing if we get these as well.

--chris j arges

--

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index efc3b22..88b3933 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -395,7 +395,7 @@ static inline void apic_write(u32 reg, u32 val)
apic->write(reg, val);
 }
 
-static inline void apic_eoi(void)
+static inline void apic_eoi(int vector)
 {
apic->eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
@@ -426,7 +426,7 @@ extern void __init apic_set_eoi_write(void (*eoi_write)(u32 
reg, u32 v));
 
 static inline u32 apic_read(u32 reg) { return 0; }
 static inline void apic_write(u32 reg, u32 val) { }
-static inline void apic_eoi(void) { }
+static inline void apic_eoi(int vector) { }
 static inline u64 apic_icr_read(void) { return 0; }
 static inline void apic_icr_write(u32 low, u32 high) { }
 static inline void 

Re: smp_call_function_single lockups

2015-03-31 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
 wrote:
>
> I modified the posted patch with the following:

Actually, in addition to Ingo's patches (and the irq printout), which
you should try first, if none of that really gives any different
behavior, can modify that ack_APIC_irq() debugging code a bit more:

> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index bf32309..dc3e192 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector)
> if (vector >= 16) {
> unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1));
> v >>= vector & 0x1f;
> -   WARN_ON_ONCE(!(v & 1));
> +   WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector);
> }
> /*
>  * ack_APIC_irq() actually gets compiled as a single instruction

So what I'd suggest doing is:

 - change the test of "vector >= 16" to just "vector >= 0".

   We still have "-1" as the "unknown vector" thing, but I think only
the ack_bad_irq() thing calls it, and that should print out its own
message if it ever triggers, so it isn't an issue.

   The reason for the ">= 16" was kind of bogus - the first 16 vectors
are system vectors, but we definitely shouldn't ack the apic for such
vectors anyway, so giving a warning for them is very much appropriate.
In particular, vector 2 is NMI, and maybe we do ACk it incorrectly.

 - add a "return" if the warning triggers, and simply don't do the
actual ACK cycle at all if the ISR bit is clear.

   IOW, make it do "if (WARN(..)) return;"

Now, we might get the vector number wrong for some reason, and in that
case not ACK'ing at all might cause problems too, but it would be
interesting to see if it changes behavior wrt the lockup.

I don't have any other ideas at the moment, but hopefully the
suggested changes by me and Ingo will give some more data to go on and
clarify what might be going on.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Ingo Molnar

* Linus Torvalds  wrote:

> Ok, interesting. So the whole "we try to do an APIC ACK with the ISR 
> bit clear" seems to be a real issue.

It's interesting in particular when it happens with an edge-triggered 
interrupt source: it's much harder to miss level triggered IRQs, which 
stay around until actively handled. Edge triggered irqs are more 
fragile to loss of event processing.

> > Anyway, maybe this sheds some more light on this issue. I can 
> > reproduce this at will, so let me know of other experiments to do.

Btw., could you please describe (again) what your current best method 
for reproduction is? It's been a long discussion ...

> Somebody else who knows the apic needs to also take a look, but I'd 
> love to hear what the actual hardware interrupt is (from that 
> "vector_irq[vector]" thing above.
> 
> I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
> are 0xf0-0xff), so it sounds like an external one. But that then 
> requires the whole mapping table thing.
> 
> Ingo/Peter/Jiang - is there anything else useful we could print out? 
> I worry about the irq movement code. Can we add printk's to when an 
> irq is chasing from one CPU to another and doing that 
> "move_in_progress" thing? I've always been scared of that code.

1)

So the irq_cfg::move_in_progress field originates from:

  610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.")

and I agree that it looks fragile, so it would be nice to see how much 
(if at all?) it gets used, by sticking a few pr_info()s in it.

Debug patch for the vector movement state machine attached below. 
Untested.

2)

But ... having taken a quick look at the vector movement handling, I 
am scared more than ever,and I cannot convince myself that it's race 
free. It's possibly harmless, but still, famous last words ...

For example, most ->move_in_progress transitions happen with the 
vector_lock held - with the exception of send_cleanup_vector(): there 
we are only holding the desc->lock, but that's not enough! For 
example, this codepath:

static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
{
unsigned me;

if (likely(!cfg->move_in_progress))
return;

me = smp_processor_id();

if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain))
send_cleanup_vector(cfg);

...

void send_cleanup_vector(struct irq_cfg *cfg)
{
...

cfg->move_in_progress = 0;
}

is blatantly racy, unless I missed something obvious?

2b)

Initially I thought this only affects slowpaths like driver unregister 
or CPU hotplug, but I think this would be relevant for the fastpath of 
edge triggered irqs as well:

void apic_ack_edge(struct irq_data *data)
{
irq_complete_move(irqd_cfg(data));
irq_move_irq(data);
ack_APIC_irq();
}

and irq_complete_move() checks and clears cfg->move_in_progress as 
listed above.

So in most scenarios this is probably harmless, because it can in the 
worst case result in the missing of a ->move_in_progress flag setting.

But it does not look harmless in the apic_set_affinity() code path: 
there we call into assign_irq_vector() without the desc->lock held, 
due to this gem in the IRQ core:

int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
{
unsigned long flags;
struct irq_desc *desc = irq_get_desc_lock(irq, , 
IRQ_GET_DESC_CHECK_GLOBAL);

if (!desc)
return -EINVAL;
desc->affinity_hint = m;
irq_put_desc_unlock(desc, flags);
/* set the initial affinity to prevent every interrupt being on CPU0 */
if (m)
__irq_set_affinity(irq, m, false);
return 0;
}

argh!

Now if this ever crosses path with an assign_irq_vector() call on 
another CPU, then I really cannot see what would save us from deep 
trouble: we use cfg->vector while that is possibly being modified, 
right?

So I'd suggest to put a printk into irq_set_affinity_hint() as well, 
to make sure it's not used during this test. In particular:

drivers/virtio/virtio_pci_common.c: 
irq_set_affinity_hint(irq, NULL);
drivers/virtio/virtio_pci_common.c: 
irq_set_affinity_hint(irq, mask);

might be triggering it in the virtualization code...

The second patch below adds the relevant pr_info(). (untested)

Now this too might be unrelated, because the affinity hint was added 
ages ago, in:

e7a297b0d7d6 ("genirq: Add CPU mask affinity hint")

and the potentially racy nature of calling into set_affinity without 
the desc lock held was probably not realized back then.

VirtIO's use of the affinity-hint was added a while ago as well, four 
years ago, in:

  75a0a52be3c2 ("virtio: introduce an API to set affinity for a virtqueue")

2c)

The other thing that worries me here is that we apparently send an 
IRQ-move IPI while having an un-acked local APIC (!). I have vague 
memories that this was problematic and fragile 

[debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 Ok, interesting. So the whole we try to do an APIC ACK with the ISR 
 bit clear seems to be a real issue.

It's interesting in particular when it happens with an edge-triggered 
interrupt source: it's much harder to miss level triggered IRQs, which 
stay around until actively handled. Edge triggered irqs are more 
fragile to loss of event processing.

  Anyway, maybe this sheds some more light on this issue. I can 
  reproduce this at will, so let me know of other experiments to do.

Btw., could you please describe (again) what your current best method 
for reproduction is? It's been a long discussion ...

 Somebody else who knows the apic needs to also take a look, but I'd 
 love to hear what the actual hardware interrupt is (from that 
 vector_irq[vector] thing above.
 
 I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
 are 0xf0-0xff), so it sounds like an external one. But that then 
 requires the whole mapping table thing.
 
 Ingo/Peter/Jiang - is there anything else useful we could print out? 
 I worry about the irq movement code. Can we add printk's to when an 
 irq is chasing from one CPU to another and doing that 
 move_in_progress thing? I've always been scared of that code.

1)

So the irq_cfg::move_in_progress field originates from:

  610142927b5b ([PATCH] x86_64 irq: Safely cleanup an irq after moving it.)

and I agree that it looks fragile, so it would be nice to see how much 
(if at all?) it gets used, by sticking a few pr_info()s in it.

Debug patch for the vector movement state machine attached below. 
Untested.

2)

But ... having taken a quick look at the vector movement handling, I 
am scared more than ever,and I cannot convince myself that it's race 
free. It's possibly harmless, but still, famous last words ...

For example, most -move_in_progress transitions happen with the 
vector_lock held - with the exception of send_cleanup_vector(): there 
we are only holding the desc-lock, but that's not enough! For 
example, this codepath:

static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
{
unsigned me;

if (likely(!cfg-move_in_progress))
return;

me = smp_processor_id();

if (vector == cfg-vector  cpumask_test_cpu(me, cfg-domain))
send_cleanup_vector(cfg);

...

void send_cleanup_vector(struct irq_cfg *cfg)
{
...

cfg-move_in_progress = 0;
}

is blatantly racy, unless I missed something obvious?

2b)

Initially I thought this only affects slowpaths like driver unregister 
or CPU hotplug, but I think this would be relevant for the fastpath of 
edge triggered irqs as well:

void apic_ack_edge(struct irq_data *data)
{
irq_complete_move(irqd_cfg(data));
irq_move_irq(data);
ack_APIC_irq();
}

and irq_complete_move() checks and clears cfg-move_in_progress as 
listed above.

So in most scenarios this is probably harmless, because it can in the 
worst case result in the missing of a -move_in_progress flag setting.

But it does not look harmless in the apic_set_affinity() code path: 
there we call into assign_irq_vector() without the desc-lock held, 
due to this gem in the IRQ core:

int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
{
unsigned long flags;
struct irq_desc *desc = irq_get_desc_lock(irq, flags, 
IRQ_GET_DESC_CHECK_GLOBAL);

if (!desc)
return -EINVAL;
desc-affinity_hint = m;
irq_put_desc_unlock(desc, flags);
/* set the initial affinity to prevent every interrupt being on CPU0 */
if (m)
__irq_set_affinity(irq, m, false);
return 0;
}

argh!

Now if this ever crosses path with an assign_irq_vector() call on 
another CPU, then I really cannot see what would save us from deep 
trouble: we use cfg-vector while that is possibly being modified, 
right?

So I'd suggest to put a printk into irq_set_affinity_hint() as well, 
to make sure it's not used during this test. In particular:

drivers/virtio/virtio_pci_common.c: 
irq_set_affinity_hint(irq, NULL);
drivers/virtio/virtio_pci_common.c: 
irq_set_affinity_hint(irq, mask);

might be triggering it in the virtualization code...

The second patch below adds the relevant pr_info(). (untested)

Now this too might be unrelated, because the affinity hint was added 
ages ago, in:

e7a297b0d7d6 (genirq: Add CPU mask affinity hint)

and the potentially racy nature of calling into set_affinity without 
the desc lock held was probably not realized back then.

VirtIO's use of the affinity-hint was added a while ago as well, four 
years ago, in:

  75a0a52be3c2 (virtio: introduce an API to set affinity for a virtqueue)

2c)

The other thing that worries me here is that we apparently send an 
IRQ-move IPI while having an un-acked local APIC (!). I have vague 
memories that this was problematic and fragile before.

To 

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Daniel J Blueman
On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote:
 On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
 
  * Linus Torvalds torva...@linux-foundation.org wrote:
 
   Ok, interesting. So the whole we try to do an APIC ACK with the ISR
   bit clear seems to be a real issue.
 
  It's interesting in particular when it happens with an edge-triggered
  interrupt source: it's much harder to miss level triggered IRQs, which
  stay around until actively handled. Edge triggered irqs are more
  fragile to loss of event processing.
 
Anyway, maybe this sheds some more light on this issue. I can
reproduce this at will, so let me know of other experiments to do.
 
  Btw., could you please describe (again) what your current best method
  for reproduction is? It's been a long discussion ...
 

 Ingo,

 To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
 ( Although I've heard of others that have reproduced on other machines. )

 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
 3) Add the following to the L1 cmdline:
 nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

 Sometimes this is sufficient to reproduce the issue, I've observed that 
 running
 KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
 If this doesn't reproduce then you can do the following:
 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
 L1 vCPUs until the hang occurs.

 I attempted to write a module that used smp_call_function_single calls to
 trigger IPIs but have been unable to create a more simple reproducer.

A non-intrusive way of generating a lot of IPIs, is calling
stop_machine() via something like:

while :; do
echo base=0x200 size=0x800 type=write-back /proc/mtrr
echo disable=4 | /proc/mtrr
done

Of course, ensure base is above DRAM and any 64-bit MMIO for no
side-effects and ensure it'll be entry 4. Onlining and offlining cores
in parallel will generate IPIs also.

Dan
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-31 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
chris.j.ar...@canonical.com wrote:

 I modified the posted patch with the following:

Actually, in addition to Ingo's patches (and the irq printout), which
you should try first, if none of that really gives any different
behavior, can modify that ack_APIC_irq() debugging code a bit more:

 diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
 index bf32309..dc3e192 100644
 --- a/arch/x86/include/asm/apic.h
 +++ b/arch/x86/include/asm/apic.h
 @@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector)
 if (vector = 16) {
 unsigned v = apic_read(APIC_ISR + ((vector  ~0x1f)  1));
 v = vector  0x1f;
 -   WARN_ON_ONCE(!(v  1));
 +   WARN(!(v  1), ack_APIC_irq: vector = %0x\n, vector);
 }
 /*
  * ack_APIC_irq() actually gets compiled as a single instruction

So what I'd suggest doing is:

 - change the test of vector = 16 to just vector = 0.

   We still have -1 as the unknown vector thing, but I think only
the ack_bad_irq() thing calls it, and that should print out its own
message if it ever triggers, so it isn't an issue.

   The reason for the = 16 was kind of bogus - the first 16 vectors
are system vectors, but we definitely shouldn't ack the apic for such
vectors anyway, so giving a warning for them is very much appropriate.
In particular, vector 2 is NMI, and maybe we do ACk it incorrectly.

 - add a return if the warning triggers, and simply don't do the
actual ACK cycle at all if the ISR bit is clear.

   IOW, make it do if (WARN(..)) return;

Now, we might get the vector number wrong for some reason, and in that
case not ACK'ing at all might cause problems too, but it would be
interesting to see if it changes behavior wrt the lockup.

I don't have any other ideas at the moment, but hopefully the
suggested changes by me and Ingo will give some more data to go on and
clarify what might be going on.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-31 Thread Chris J Arges
On Tue, Mar 31, 2015 at 08:08:40AM -0700, Linus Torvalds wrote:
 On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
 chris.j.ar...@canonical.com wrote:
 
  I modified the posted patch with the following:
 
 Actually, in addition to Ingo's patches (and the irq printout), which
 you should try first, if none of that really gives any different
 behavior, can modify that ack_APIC_irq() debugging code a bit more:
 
  diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
  index bf32309..dc3e192 100644
  --- a/arch/x86/include/asm/apic.h
  +++ b/arch/x86/include/asm/apic.h
  @@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector)
  if (vector = 16) {
  unsigned v = apic_read(APIC_ISR + ((vector  ~0x1f)  1));
  v = vector  0x1f;
  -   WARN_ON_ONCE(!(v  1));
  +   WARN(!(v  1), ack_APIC_irq: vector = %0x\n, vector);
  }
  /*
   * ack_APIC_irq() actually gets compiled as a single instruction
 
 So what I'd suggest doing is:
 
  - change the test of vector = 16 to just vector = 0.
 
We still have -1 as the unknown vector thing, but I think only
 the ack_bad_irq() thing calls it, and that should print out its own
 message if it ever triggers, so it isn't an issue.
 
The reason for the = 16 was kind of bogus - the first 16 vectors
 are system vectors, but we definitely shouldn't ack the apic for such
 vectors anyway, so giving a warning for them is very much appropriate.
 In particular, vector 2 is NMI, and maybe we do ACk it incorrectly.
 
  - add a return if the warning triggers, and simply don't do the
 actual ACK cycle at all if the ISR bit is clear.
 
IOW, make it do if (WARN(..)) return;
 
 Now, we might get the vector number wrong for some reason, and in that
 case not ACK'ing at all might cause problems too, but it would be
 interesting to see if it changes behavior wrt the lockup.
 
 I don't have any other ideas at the moment, but hopefully the
 suggested changes by me and Ingo will give some more data to go on and
 clarify what might be going on.
 
   Linus
 

Linus,

I had a few runs with your patch plus modifications, and got the following
results (modified patch inlined below):

[   14.423916] ack_APIC_irq: vector = d1, irq = 
[  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
[qemu-system-x86:1630]

[   17.995298] ack_APIC_irq: vector = d1, irq = 
[  182.993828] ack_APIC_irq: vector = e1, irq = 
[  202.919691] ack_APIC_irq: vector = 22, irq = 
[  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
[qemu-system-x86:1586]

[   15.592032] ack_APIC_irq: vector = d1, irq = 
[  304.993490] ack_APIC_irq: vector = e1, irq = 
[  315.174755] ack_APIC_irq: vector = 22, irq = 
[  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]

[   15.026077] ack_APIC_irq: vector = b1, irq = 
[  374.828531] ack_APIC_irq: vector = c1, irq = 
[  402.965942] ack_APIC_irq: vector = d1, irq = 
[  434.540814] ack_APIC_irq: vector = e1, irq = 
[  461.820768] ack_APIC_irq: vector = 22, irq = 
[  536.120027] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
[qemu-system-x86:4243]

[   17.889334] ack_APIC_irq: vector = d1, irq = 
[  291.888784] ack_APIC_irq: vector = e1, irq = 
[  297.824627] ack_APIC_irq: vector = 22, irq = 
[  336.960594] ack_APIC_irq: vector = 42, irq = 
[  367.012706] ack_APIC_irq: vector = 52, irq = 
[  377.025090] ack_APIC_irq: vector = 62, irq = 
[  417.088773] ack_APIC_irq: vector = 72, irq = 
[  447.136788] ack_APIC_irq: vector = 82, irq = 
-- stopped it since it wasn't reproducing / I was impatient --

So I'm seeing irq == VECTOR_UNDEFINED in all of these cases. Making
(vector = 0) didn't seem to expose any additional vectors.

This was only instrumented at the L1 level, I'm also planning on instrumenting
the L0 kernel and seeing if we get these as well.

--chris j arges

--

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index efc3b22..88b3933 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -395,7 +395,7 @@ static inline void apic_write(u32 reg, u32 val)
apic-write(reg, val);
 }
 
-static inline void apic_eoi(void)
+static inline void apic_eoi(int vector)
 {
apic-eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
@@ -426,7 +426,7 @@ extern void __init apic_set_eoi_write(void (*eoi_write)(u32 
reg, u32 v));
 
 static inline u32 apic_read(u32 reg) { return 0; }
 static inline void apic_write(u32 reg, u32 val) { }
-static inline void apic_eoi(void) { }
+static inline void apic_eoi(int vector) { }
 static inline u64 apic_icr_read(void) { return 0; }
 static inline void apic_icr_write(u32 low, u32 high) { }
 static inline void apic_wait_icr_idle(void) { }
@@ -435,13 +435,26 @@ static inline void 

Re: [debug PATCHes] Re: smp_call_function_single lockups

2015-03-31 Thread Chris J Arges
On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
 
 * Linus Torvalds torva...@linux-foundation.org wrote:
 
  Ok, interesting. So the whole we try to do an APIC ACK with the ISR 
  bit clear seems to be a real issue.
 
 It's interesting in particular when it happens with an edge-triggered 
 interrupt source: it's much harder to miss level triggered IRQs, which 
 stay around until actively handled. Edge triggered irqs are more 
 fragile to loss of event processing.
 
   Anyway, maybe this sheds some more light on this issue. I can 
   reproduce this at will, so let me know of other experiments to do.
 
 Btw., could you please describe (again) what your current best method 
 for reproduction is? It's been a long discussion ...
 

Ingo,

To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
( Although I've heard of others that have reproduced on other machines. )

1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Add the following to the L1 cmdline:
nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
L1 vCPUs until the hang occurs.

I attempted to write a module that used smp_call_function_single calls to 
trigger IPIs but have been unable to create a more simple reproducer.

  Somebody else who knows the apic needs to also take a look, but I'd 
  love to hear what the actual hardware interrupt is (from that 
  vector_irq[vector] thing above.
  
  I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
  are 0xf0-0xff), so it sounds like an external one. But that then 
  requires the whole mapping table thing.
  
  Ingo/Peter/Jiang - is there anything else useful we could print out? 
  I worry about the irq movement code. Can we add printk's to when an 
  irq is chasing from one CPU to another and doing that 
  move_in_progress thing? I've always been scared of that code.
 
 1)
 
 So the irq_cfg::move_in_progress field originates from:
 
   610142927b5b ([PATCH] x86_64 irq: Safely cleanup an irq after moving it.)
 
 and I agree that it looks fragile, so it would be nice to see how much 
 (if at all?) it gets used, by sticking a few pr_info()s in it.
 
 Debug patch for the vector movement state machine attached below. 
 Untested.
 
 2)
 
 But ... having taken a quick look at the vector movement handling, I 
 am scared more than ever,and I cannot convince myself that it's race 
 free. It's possibly harmless, but still, famous last words ...
 
 For example, most -move_in_progress transitions happen with the 
 vector_lock held - with the exception of send_cleanup_vector(): there 
 we are only holding the desc-lock, but that's not enough! For 
 example, this codepath:
 
 static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
 {
 unsigned me;
 
 if (likely(!cfg-move_in_progress))
 return;
 
 me = smp_processor_id();
 
 if (vector == cfg-vector  cpumask_test_cpu(me, cfg-domain))
 send_cleanup_vector(cfg);
 
 ...
 
 void send_cleanup_vector(struct irq_cfg *cfg)
 {
 ...
 
 cfg-move_in_progress = 0;
 }
 
 is blatantly racy, unless I missed something obvious?
 
 2b)
 
 Initially I thought this only affects slowpaths like driver unregister 
 or CPU hotplug, but I think this would be relevant for the fastpath of 
 edge triggered irqs as well:
 
 void apic_ack_edge(struct irq_data *data)
 {
 irq_complete_move(irqd_cfg(data));
 irq_move_irq(data);
 ack_APIC_irq();
 }
 
 and irq_complete_move() checks and clears cfg-move_in_progress as 
 listed above.
 
 So in most scenarios this is probably harmless, because it can in the 
 worst case result in the missing of a -move_in_progress flag setting.
 
 But it does not look harmless in the apic_set_affinity() code path: 
 there we call into assign_irq_vector() without the desc-lock held, 
 due to this gem in the IRQ core:
 
 int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
 {
 unsigned long flags;
 struct irq_desc *desc = irq_get_desc_lock(irq, flags, 
 IRQ_GET_DESC_CHECK_GLOBAL);
 
 if (!desc)
 return -EINVAL;
 desc-affinity_hint = m;
 irq_put_desc_unlock(desc, flags);
 /* set the initial affinity to prevent every interrupt being on CPU0 
 */
 if (m)
 __irq_set_affinity(irq, m, false);
 return 0;
 }
 
 argh!
 
 Now if this ever crosses path with an assign_irq_vector() call on 
 another CPU, then I really cannot see what would save us from deep 
 trouble: 

Re: smp_call_function_single lockups

2015-03-31 Thread Linus Torvalds
On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges
chris.j.ar...@canonical.com wrote:

 I had a few runs with your patch plus modifications, and got the following
 results (modified patch inlined below):

Ok, thanks.

 [   14.423916] ack_APIC_irq: vector = d1, irq = 
 [  176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
 [qemu-system-x86:1630]

 [   17.995298] ack_APIC_irq: vector = d1, irq = 
 [  182.993828] ack_APIC_irq: vector = e1, irq = 
 [  202.919691] ack_APIC_irq: vector = 22, irq = 
 [  484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
 [qemu-system-x86:1586]

 [   15.592032] ack_APIC_irq: vector = d1, irq = 
 [  304.993490] ack_APIC_irq: vector = e1, irq = 
 [  315.174755] ack_APIC_irq: vector = 22, irq = 
 [  360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]
.. snip snip ..

So yeah, that's VECTOR_UNDEFINED, and while it could happen as part of
irq setup, I'm not seeing that being something that your load should
trigger.

It could also obviously just be the vector being somehow corrupted,
either due to crazy hardware or software.

But quite frankly, the most likely reason is that whole irq vector movement.

Especially since it sounds from your other email that when you apply
Ingo's patches, the ack_APIC_irq warnings go away. Is that correct? Or
did you just grep for move in the messages?

If you do get both movement messages (from Info's patch) _and_ the
ack_APIC_irq warnings (from mine), it would be interesting to see if
the vectors line up somehow..

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-30 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
 wrote:
> [   13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 
> apic_ack_edge+0x84/0x90()
> [   13.613531]  [] apic_ack_edge+0x84/0x90
> [   13.613531]  [] handle_edge_irq+0x57/0x120
> [   13.613531]  [] handle_irq+0x22/0x40
> [   13.613531]  [] do_IRQ+0x4f/0x140
> [   13.613531]  [] common_interrupt+0x6d/0x6d
> [   13.613531][] ? hrtimer_start+0x18/0x20
> [   13.613531]  [] ? native_safe_halt+0x6/0x10
> [   13.613531]  [] ? rcu_eqs_enter+0xa3/0xb0
> [   13.613531]  [] default_idle+0x1e/0xc0

Hmm. I didn't notice that "hrtimer_start" was always there as a stale
entry on the stack when this happened.

That may well be immaterial - the CPU being idle means that the last
thing it did before going to sleep was likely that "start timer"
thing, but it's interesting even so.

Some issue with reprogramming the hrtimer as it is triggering, kind of
similar to the bootup case I saw where the keyboard init sequence
raises an interrupt that was already cleared by the time the interrupt
happened.

So maybe something like this happens:

 - local timer is about to go off and raises the interrupt line

 - in the meantime, we're reprogramming the timer into the future

 - the CPU takes the interrupt, but now the timer has been
reprogammed, so the irq line is no longer active, and ISR is zero even
though we took the interrupt (which is why the new warning triggers)

  - we're running the local timer interrupt (which happened due to the
*old* programmed value), but we do something wrong because when we
read the timer state, we see the *new* programmed value and so we
think that it's the new timer that triggered.

I dunno. I don't see why we'd lock up, but DaveJ's old lockup had
several signs that it seemed to be timer-related.

It would be interesting to see the actual irq number. Maybe this has
nothing what-so-ever to do with the hrtimer.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-30 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
 wrote:
>
> I've been able to repro with your patch and observed the WARN_ON when booting 
> a
> VM on affected hardware and non affected hardware:

Ok, interesting. So the whole "we try to do an APIC ACK with the ISR
bit clear" seems to be a real issue.

> I modified the posted patch with the following:
> -   WARN_ON_ONCE(!(v & 1));
> +   WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector);

Yes, makes sense, although I'm not sure what the vector translations
end up being. See below.

> And it showed vector = 1b when booting. However, when I run the reproducer on
> an affected machine I get the following WARNs before the hang:

Ok, so the boot-time thing I think happens because a device irq
happens but goes away immediately because the CPU that triggers it
also clears it immediately in the device initialization code, so it's
a level-triggered interrupt that goes away "on its own".

But vector 0x1b seems odd. I thought we mapped external interrupts to
0x20+ (FIRST_EXTERNAL_VECTOR). Ingo/Peter? Is there any sane interface
to look up the percpu apic vector data?

Chris, since this is repeatable for you, can you do

int irq;
irq = __this_cpu_read(vector_irq[vector]);

and print that out too? That *should* show the actual hardware irq,
although there are a few magic cases too (-1/-2 mean special things)

But the fact that you get the warning before the hang is much more interesting.

> [   36.301299] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 
> apic_ack_edge+0x93/0xa0()
> [   36.301301] ack_APIC_irq: vector = e1

Is this repeatable? Does it happen before *every* hang, or at least
often enough to be a good pattern?

> [   40.430533] ack_APIC_irq: vector = 22
>
> So vector = e1 then 22 before the hang.

Is it always the same ones? I assume that on different machines the
vector allocations would be different, but is it consistent on any
particular machine? That's assuming the whole warning is consistent at
all before the hang, of course.

> Anyway, maybe this sheds some more light on this issue. I can reproduce this 
> at
> will, so let me know of other experiments to do.

Somebody else who knows the apic needs to also take a look, but I'd
love to hear what the actual hardware interrupt is (from that
"vector_irq[vector]" thing above.

I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they are
0xf0-0xff), so it sounds like an external one. But that then requires
the whole mapping table thing.

Ingo/Peter/Jiang - is there anything else useful we could print out? I
worry about the irq movement code. Can we add printk's to when an irq
is chasing from one CPU to another and doing that "move_in_progress"
thing? I've always been scared of that code.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-30 Thread Chris J Arges
On Thu, Feb 19, 2015 at 02:45:54PM -0800, Linus Torvalds wrote:
> On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds
>  wrote:
> >
> > Is this worth looking at? Or is it something spurious? I might have
> > gotten the vectors wrong, and maybe the warning is not because the ISR
> > bit isn't set, but because I test the wrong bit.
> 
> I edited the patch to do ratelimiting (one per 10s max) rather than
> "once". And tested it some more. It seems to work correctly. The irq
> case during 8042 probing is not repeatable, and I suspect it happens
> because the interrupt source goes away (some probe-time thing that
> first triggers an interrupt, but then clears it itself), so it doesn't
> happen every boot, and I've gotten it with slightly different
> backtraces.
> 
> But it's the only warning that happens for me, so I think my code is
> right (at least for the cases that trigger on this machine). It's
> definitely not a "every interrupt causes the warning because the code
> was buggy, and the WARN_ONCE() just printed the first one".
> 
> It would be interesting to hear if others see spurious APIC EOI cases
> too. In particular, the people seeing the IPI lockup. Because a lot of
> the lockups we've seen have *looked* like the IPI interrupt just never
> happened, and so we're waiting forever for the target CPU to react to
> it. And just maybe the spurious EOI could cause the wrong bit to be
> cleared in the ISR, and then the interrupt never shows up. Something
> like that would certainly explain why it only happens on some machines
> and under certain timing circumstances.
> 
> Linus
> 

Linus,

I'm able to reproduce this IPI lockup easily now when using specific hardware
and nested KVM VMs. However, it seems to only occur when using certain host
hardware. For example:
- Xeon E5620 (Westmere-EP) (fam: 06, model: 2c)
- Xeon E312xx (Sandy Bridge) (fam: 06, model: 2a)
Now I'm not sure if this indicates hardware LAPIC issues or if the timing in
these processors make it more likely to hit this issue. So far I haven't seen
the issue on other CPUs.

To set this up, I've done the following (L0 being the Xeon box):
1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
L1 vCPUs until the hang occurs.

Pinning L1 vCPUs to L0 initially (i.e. 0->0 1->1) causes the hangs to not occur
in my testing. (Which makes sense because we're unlikely to do real IPIs.)

I've been able to repro with your patch and observed the WARN_ON when booting a
VM on affected hardware and non affected hardware:

[   13.613531] [ cut here ]
[   13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 
apic_ack_edge+0x84/0x90()
[   13.613531] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) 
iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) 
xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKSUM(E) 
iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) 
ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) 
x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) serio_raw(E) i2c_piix4(E) 
parport_pc(E) pvpanic(E) parport(E) mac_hid(E) nls_utf8(E) isofs(E) psmouse(E) 
floppy(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) 
drm_kms_helper(E) drm(E) pata_acpi(E)
[   13.613531] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GE   
4.0.0-rc6-460f8calinus1 #4
[   13.613531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Bochs 01/01/2011
[   13.613531]  81a911a6 88013fc03eb8 817999a7 
0007
[   13.613531]   88013fc03ef8 810734ca 
0092
[   13.613531]  0011 8800b8bbac00 001f 
00b1
[   13.613531] Call Trace:
[   13.613531][] dump_stack+0x45/0x57
[   13.613531]  [] warn_slowpath_common+0x8a/0xc0
[   13.613531]  [] warn_slowpath_null+0x1a/0x20
[   13.613531]  [] apic_ack_edge+0x84/0x90
[   13.613531]  [] handle_edge_irq+0x57/0x120
[   13.613531]  [] handle_irq+0x22/0x40
[   13.613531]  [] do_IRQ+0x4f/0x140
[   13.613531]  [] common_interrupt+0x6d/0x6d
[   13.613531][] ? hrtimer_start+0x18/0x20
[   13.613531]  [] ? native_safe_halt+0x6/0x10
[   13.613531]  [] ? rcu_eqs_enter+0xa3/0xb0
[   13.613531]  [] default_idle+0x1e/0xc0
[   13.613531]  [] arch_cpu_idle+0xf/0x20
[   13.613531]  [] cpu_startup_entry+0x2ff/0x420
[   13.613531]  [] rest_init+0x77/0x80
[   13.613531]  [] start_kernel+0x43c/0x449
[   13.613531]  [] ? 

Re: smp_call_function_single lockups

2015-03-30 Thread Chris J Arges
On Thu, Feb 19, 2015 at 02:45:54PM -0800, Linus Torvalds wrote:
 On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Is this worth looking at? Or is it something spurious? I might have
  gotten the vectors wrong, and maybe the warning is not because the ISR
  bit isn't set, but because I test the wrong bit.
 
 I edited the patch to do ratelimiting (one per 10s max) rather than
 once. And tested it some more. It seems to work correctly. The irq
 case during 8042 probing is not repeatable, and I suspect it happens
 because the interrupt source goes away (some probe-time thing that
 first triggers an interrupt, but then clears it itself), so it doesn't
 happen every boot, and I've gotten it with slightly different
 backtraces.
 
 But it's the only warning that happens for me, so I think my code is
 right (at least for the cases that trigger on this machine). It's
 definitely not a every interrupt causes the warning because the code
 was buggy, and the WARN_ONCE() just printed the first one.
 
 It would be interesting to hear if others see spurious APIC EOI cases
 too. In particular, the people seeing the IPI lockup. Because a lot of
 the lockups we've seen have *looked* like the IPI interrupt just never
 happened, and so we're waiting forever for the target CPU to react to
 it. And just maybe the spurious EOI could cause the wrong bit to be
 cleared in the ISR, and then the interrupt never shows up. Something
 like that would certainly explain why it only happens on some machines
 and under certain timing circumstances.
 
 Linus
 

Linus,

I'm able to reproduce this IPI lockup easily now when using specific hardware
and nested KVM VMs. However, it seems to only occur when using certain host
hardware. For example:
- Xeon E5620 (Westmere-EP) (fam: 06, model: 2c)
- Xeon E312xx (Sandy Bridge) (fam: 06, model: 2a)
Now I'm not sure if this indicates hardware LAPIC issues or if the timing in
these processors make it more likely to hit this issue. So far I haven't seen
the issue on other CPUs.

To set this up, I've done the following (L0 being the Xeon box):
1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
L1 vCPUs until the hang occurs.

Pinning L1 vCPUs to L0 initially (i.e. 0-0 1-1) causes the hangs to not occur
in my testing. (Which makes sense because we're unlikely to do real IPIs.)

I've been able to repro with your patch and observed the WARN_ON when booting a
VM on affected hardware and non affected hardware:

[   13.613531] [ cut here ]
[   13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 
apic_ack_edge+0x84/0x90()
[   13.613531] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) 
iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) 
xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKSUM(E) 
iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) 
ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) 
x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) serio_raw(E) i2c_piix4(E) 
parport_pc(E) pvpanic(E) parport(E) mac_hid(E) nls_utf8(E) isofs(E) psmouse(E) 
floppy(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) 
drm_kms_helper(E) drm(E) pata_acpi(E)
[   13.613531] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GE   
4.0.0-rc6-460f8calinus1 #4
[   13.613531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Bochs 01/01/2011
[   13.613531]  81a911a6 88013fc03eb8 817999a7 
0007
[   13.613531]   88013fc03ef8 810734ca 
0092
[   13.613531]  0011 8800b8bbac00 001f 
00b1
[   13.613531] Call Trace:
[   13.613531]  IRQ  [817999a7] dump_stack+0x45/0x57
[   13.613531]  [810734ca] warn_slowpath_common+0x8a/0xc0
[   13.613531]  [810735ba] warn_slowpath_null+0x1a/0x20
[   13.613531]  [8104d3f4] apic_ack_edge+0x84/0x90
[   13.613531]  [810cf8e7] handle_edge_irq+0x57/0x120
[   13.613531]  [81016aa2] handle_irq+0x22/0x40
[   13.613531]  [817a3b9f] do_IRQ+0x4f/0x140
[   13.613531]  [817a196d] common_interrupt+0x6d/0x6d
[   13.613531]  EOI  [810def08] ? hrtimer_start+0x18/0x20
[   13.613531]  [8105a356] ? native_safe_halt+0x6/0x10
[   13.613531]  [810d5623] ? rcu_eqs_enter+0xa3/0xb0
[   13.613531]  [8101ecde] default_idle+0x1e/0xc0
[   13.613531]  

Re: smp_call_function_single lockups

2015-03-30 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
chris.j.ar...@canonical.com wrote:

 I've been able to repro with your patch and observed the WARN_ON when booting 
 a
 VM on affected hardware and non affected hardware:

Ok, interesting. So the whole we try to do an APIC ACK with the ISR
bit clear seems to be a real issue.

 I modified the posted patch with the following:
 -   WARN_ON_ONCE(!(v  1));
 +   WARN(!(v  1), ack_APIC_irq: vector = %0x\n, vector);

Yes, makes sense, although I'm not sure what the vector translations
end up being. See below.

 And it showed vector = 1b when booting. However, when I run the reproducer on
 an affected machine I get the following WARNs before the hang:

Ok, so the boot-time thing I think happens because a device irq
happens but goes away immediately because the CPU that triggers it
also clears it immediately in the device initialization code, so it's
a level-triggered interrupt that goes away on its own.

But vector 0x1b seems odd. I thought we mapped external interrupts to
0x20+ (FIRST_EXTERNAL_VECTOR). Ingo/Peter? Is there any sane interface
to look up the percpu apic vector data?

Chris, since this is repeatable for you, can you do

int irq;
irq = __this_cpu_read(vector_irq[vector]);

and print that out too? That *should* show the actual hardware irq,
although there are a few magic cases too (-1/-2 mean special things)

But the fact that you get the warning before the hang is much more interesting.

 [   36.301299] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 
 apic_ack_edge+0x93/0xa0()
 [   36.301301] ack_APIC_irq: vector = e1

Is this repeatable? Does it happen before *every* hang, or at least
often enough to be a good pattern?

 [   40.430533] ack_APIC_irq: vector = 22

 So vector = e1 then 22 before the hang.

Is it always the same ones? I assume that on different machines the
vector allocations would be different, but is it consistent on any
particular machine? That's assuming the whole warning is consistent at
all before the hang, of course.

 Anyway, maybe this sheds some more light on this issue. I can reproduce this 
 at
 will, so let me know of other experiments to do.

Somebody else who knows the apic needs to also take a look, but I'd
love to hear what the actual hardware interrupt is (from that
vector_irq[vector] thing above.

I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they are
0xf0-0xff), so it sounds like an external one. But that then requires
the whole mapping table thing.

Ingo/Peter/Jiang - is there anything else useful we could print out? I
worry about the irq movement code. Can we add printk's to when an irq
is chasing from one CPU to another and doing that move_in_progress
thing? I've always been scared of that code.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-30 Thread Linus Torvalds
On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges
chris.j.ar...@canonical.com wrote:
 [   13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 
 apic_ack_edge+0x84/0x90()
 [   13.613531]  [8104d3f4] apic_ack_edge+0x84/0x90
 [   13.613531]  [810cf8e7] handle_edge_irq+0x57/0x120
 [   13.613531]  [81016aa2] handle_irq+0x22/0x40
 [   13.613531]  [817a3b9f] do_IRQ+0x4f/0x140
 [   13.613531]  [817a196d] common_interrupt+0x6d/0x6d
 [   13.613531]  EOI  [810def08] ? hrtimer_start+0x18/0x20
 [   13.613531]  [8105a356] ? native_safe_halt+0x6/0x10
 [   13.613531]  [810d5623] ? rcu_eqs_enter+0xa3/0xb0
 [   13.613531]  [8101ecde] default_idle+0x1e/0xc0

Hmm. I didn't notice that hrtimer_start was always there as a stale
entry on the stack when this happened.

That may well be immaterial - the CPU being idle means that the last
thing it did before going to sleep was likely that start timer
thing, but it's interesting even so.

Some issue with reprogramming the hrtimer as it is triggering, kind of
similar to the bootup case I saw where the keyboard init sequence
raises an interrupt that was already cleared by the time the interrupt
happened.

So maybe something like this happens:

 - local timer is about to go off and raises the interrupt line

 - in the meantime, we're reprogramming the timer into the future

 - the CPU takes the interrupt, but now the timer has been
reprogammed, so the irq line is no longer active, and ISR is zero even
though we took the interrupt (which is why the new warning triggers)

  - we're running the local timer interrupt (which happened due to the
*old* programmed value), but we do something wrong because when we
read the timer state, we see the *new* programmed value and so we
think that it's the new timer that triggered.

I dunno. I don't see why we'd lock up, but DaveJ's old lockup had
several signs that it seemed to be timer-related.

It would be interesting to see the actual irq number. Maybe this has
nothing what-so-ever to do with the hrtimer.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-20 Thread Mike Galbraith
On Fri, 2015-03-20 at 09:26 -0700, Linus Torvalds wrote:
> On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra  wrote:
> >
> > Linus, any plans for this patch? I think it does solve a fair few issues
> > the current code has.
> 
> So I didn't really have any plans. I think it's a good patch, and it
> might even fix a bug and improve code generation, but it didn't fix
> the lockup problem, so..
> 
> > do you want me to route it through tip (and write a changelog while I'm
> > at it) or will you be taking it yourself?
> 
> If you think it's worth committing, then go ahead.  Add my signed-off,
> although I have to note that I didn't do a lot of testing of that
> patch (I think it was compile-tested when I sent it out, and then I
> ended up running it for a while, and obviously it got *some* testing
> by the person who saw the lockup that just showed that it didn't fix
> the problem ;(

I carried it for a while on boxen big and small, so it got some burn-in.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra  wrote:
>
> Linus, any plans for this patch? I think it does solve a fair few issues
> the current code has.

So I didn't really have any plans. I think it's a good patch, and it
might even fix a bug and improve code generation, but it didn't fix
the lockup problem, so..

> do you want me to route it through tip (and write a changelog while I'm
> at it) or will you be taking it yourself?

If you think it's worth committing, then go ahead.  Add my signed-off,
although I have to note that I didn't do a lot of testing of that
patch (I think it was compile-tested when I sent it out, and then I
ended up running it for a while, and obviously it got *some* testing
by the person who saw the lockup that just showed that it didn't fix
the problem ;(

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-20 Thread Peter Zijlstra
On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
> Ok, this is a more involved patch than I'd like, but making the
> *caller* do all the CSD maintenance actually cleans things up.
> 
> And this is still completely untested, and may be entirely buggy. What
> do you guys think?
> 

Linus, any plans for this patch? I think it does solve a fair few issues
the current code has.

do you want me to route it through tip (and write a changelog while I'm
at it) or will you be taking it yourself?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-20 Thread Peter Zijlstra
On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
 Ok, this is a more involved patch than I'd like, but making the
 *caller* do all the CSD maintenance actually cleans things up.
 
 And this is still completely untested, and may be entirely buggy. What
 do you guys think?
 

Linus, any plans for this patch? I think it does solve a fair few issues
the current code has.

do you want me to route it through tip (and write a changelog while I'm
at it) or will you be taking it yourself?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra pet...@infradead.org wrote:

 Linus, any plans for this patch? I think it does solve a fair few issues
 the current code has.

So I didn't really have any plans. I think it's a good patch, and it
might even fix a bug and improve code generation, but it didn't fix
the lockup problem, so..

 do you want me to route it through tip (and write a changelog while I'm
 at it) or will you be taking it yourself?

If you think it's worth committing, then go ahead.  Add my signed-off,
although I have to note that I didn't do a lot of testing of that
patch (I think it was compile-tested when I sent it out, and then I
ended up running it for a while, and obviously it got *some* testing
by the person who saw the lockup that just showed that it didn't fix
the problem ;(

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-03-20 Thread Mike Galbraith
On Fri, 2015-03-20 at 09:26 -0700, Linus Torvalds wrote:
 On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra pet...@infradead.org wrote:
 
  Linus, any plans for this patch? I think it does solve a fair few issues
  the current code has.
 
 So I didn't really have any plans. I think it's a good patch, and it
 might even fix a bug and improve code generation, but it didn't fix
 the lockup problem, so..
 
  do you want me to route it through tip (and write a changelog while I'm
  at it) or will you be taking it yourself?
 
 If you think it's worth committing, then go ahead.  Add my signed-off,
 although I have to note that I didn't do a lot of testing of that
 patch (I think it was compile-tested when I sent it out, and then I
 ended up running it for a while, and obviously it got *some* testing
 by the person who saw the lockup that just showed that it didn't fix
 the problem ;(

I carried it for a while on boxen big and small, so it got some burn-in.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-23 Thread Rafael David Tinoco
>
> [11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 
> 01/01/2011
>
> But its a virtual machine right? Its not running bare metal, its running
> a !virt kernel on a virt machine, so maybe some of the virt muck is
> borked?
>
> A very subtly broken APIC emulation would be heaps of 'fun'.

Indeed, I'll dig hypervisor changes... Anyway, this initial stack
trace was first seen during the "frequent lockups in 3.18rc4"
discussion:

https://lkml.org/lkml/2014/11/14/656

RIP: 0010:[]  []
generic_exec_single+0xea/0x1d0

At that time seen by Dave Jones, and running bare metal (if i remember
correctly).

-- 
Rafael Tinoco
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-23 Thread Peter Zijlstra
On Mon, Feb 23, 2015 at 11:32:50AM -0800, Linus Torvalds wrote:
> On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco  
> wrote:
> >
> > This is v3.19 + your patch (smp acquire/release)
> > - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
> > and acpi_idle)
> 
> Hmm. There is absolutely nothing else going on on that machine, except
> for the single call to smp_call_function_single() that is waiting for
> the CSD to be released.
> 
> > * It looks like we got locked because of reentrant flush_tlb_* through
> > smp_call_*
> > but I'll leave it to you.
> 
> No, that is all a perfectly regular callchain:
> 
>  .. native_flush_tlb_others -> smp_call_function_many ->
> smp_call_function_single
> 
> but the stack contains some stale addresses (one is probably just from
> smp_call_function_single() calling into "generic_exec_single()", and
> thus the stack contains the return address inside
> smp_call_function_single() in _addition_ to the actual place where the
> watchdog timer then interrupted it).
> 
> It all really looks very regular and sane, and looks like
> smp_call_function_single() is happily just waiting for the IPI to
> finish in the (inlined) csd_lock_wait().
> 
> I see nothing wrong at all.

[11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 
01/01/2011

But its a virtual machine right? Its not running bare metal, its running
a !virt kernel on a virt machine, so maybe some of the virt muck is
borked?

A very subtly broken APIC emulation would be heaps of 'fun'.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-23 Thread Linus Torvalds
On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco  wrote:
>
> This is v3.19 + your patch (smp acquire/release)
> - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
> and acpi_idle)

Hmm. There is absolutely nothing else going on on that machine, except
for the single call to smp_call_function_single() that is waiting for
the CSD to be released.

> * It looks like we got locked because of reentrant flush_tlb_* through
> smp_call_*
> but I'll leave it to you.

No, that is all a perfectly regular callchain:

 .. native_flush_tlb_others -> smp_call_function_many ->
smp_call_function_single

but the stack contains some stale addresses (one is probably just from
smp_call_function_single() calling into "generic_exec_single()", and
thus the stack contains the return address inside
smp_call_function_single() in _addition_ to the actual place where the
watchdog timer then interrupted it).

It all really looks very regular and sane, and looks like
smp_call_function_single() is happily just waiting for the IPI to
finish in the (inlined) csd_lock_wait().

I see nothing wrong at all.

However, here's a patch to the actual x86
smp_call_function_single_interrupt() handler, which probably doesn't
make any difference at all, but does:

 - gets rid of the incredibly crazy and ugly smp_entering_irq() that
inexplicably pairs with irq_exit()

 - makes the SMP IPI functions dot he APIC ACK cycle at the *end*
rather than at the beginning.

The exact placement of the ACK should make absolutely no difference,
since interrupts should be disabled for the whole period anyway. But
I'm desperate and am flailing around worrying about x2apic bugs and
whatnot in addition to whatever bugs we migth have in this area in sw.

So maybe there is some timing thing. This makes us consistently ack
the IPI _after_ we have cleared the list of smp call actions, and
executed the list, rather than ack it before, and then go on to muck
with the list.

Plus that old code really annoyed me anyway.

So I don't see why this should matter, but since I don't see how the
bug could happen in the first place, might as well try it..

  Linus
 arch/x86/kernel/smp.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index be8e1bde07aa..ed43259111d8 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -265,12 +265,23 @@ __visible void smp_reschedule_interrupt(struct pt_regs 
*regs)
 */
 }
 
-static inline void smp_entering_irq(void)
+/*
+ * This is the same as "entering_irq()", except it doesn't
+ * do the 'exit_idle()'.
+ *
+ * It pairs with smp_irq_exit().
+ */
+static inline void smp_irq_enter(void)
 {
-   ack_APIC_irq();
irq_enter();
 }
 
+static inline void smp_irq_exit(void)
+{
+   irq_exit();
+   ack_APIC_irq();
+}
+
 __visible void smp_trace_reschedule_interrupt(struct pt_regs *regs)
 {
/*
@@ -279,11 +290,11 @@ __visible void smp_trace_reschedule_interrupt(struct 
pt_regs *regs)
 * scheduler_ipi(). This is OK, since those functions are allowed
 * to nest.
 */
-   smp_entering_irq();
+   smp_irq_enter();
trace_reschedule_entry(RESCHEDULE_VECTOR);
__smp_reschedule_interrupt();
trace_reschedule_exit(RESCHEDULE_VECTOR);
-   exiting_irq();
+   smp_irq_exit();
/*
 * KVM uses this interrupt to force a cpu out of guest mode
 */
@@ -297,18 +308,18 @@ static inline void __smp_call_function_interrupt(void)
 
 __visible void smp_call_function_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
__smp_call_function_interrupt();
-   exiting_irq();
+   smp_irq_exit();
 }
 
 __visible void smp_trace_call_function_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
trace_call_function_entry(CALL_FUNCTION_VECTOR);
__smp_call_function_interrupt();
trace_call_function_exit(CALL_FUNCTION_VECTOR);
-   exiting_irq();
+   smp_irq_exit();
 }
 
 static inline void __smp_call_function_single_interrupt(void)
@@ -319,18 +330,18 @@ static inline void 
__smp_call_function_single_interrupt(void)
 
 __visible void smp_call_function_single_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
__smp_call_function_single_interrupt();
-   exiting_irq();
+   smp_irq_exit();
 }
 
 __visible void smp_trace_call_function_single_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR);
__smp_call_function_single_interrupt();
trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
-   exiting_irq();
+   smp_irq_exit();
 }
 
 static int __init nonmi_ipi_setup(char *str)


Re: smp_call_function_single lockups

2015-02-23 Thread Rafael David Tinoco
On Thu, Feb 19, 2015 at 2:14 PM, Linus Torvalds
 wrote:

>
> Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if
> you can get the core-file to work and figure out where the other CPU
> is, that would be good.
>

This is v3.19 + your patch (smp acquire/release)
- (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
and acpi_idle)

* http://people.canonical.com/~inaddy/lp1413540/STACK_TRACE.txt
http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt
http://people.canonical.com/~inaddy/lp1413540/FOREACH_BT.txt

* It looks like we got locked because of reentrant flush_tlb_* through
smp_call_*
but I'll leave it to you. let me know if you need any more info..

-
Rafael Tinoco
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-23 Thread Rafael David Tinoco
On Thu, Feb 19, 2015 at 2:14 PM, Linus Torvalds
torva...@linux-foundation.org wrote:


 Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if
 you can get the core-file to work and figure out where the other CPU
 is, that would be good.


This is v3.19 + your patch (smp acquire/release)
- (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
and acpi_idle)

* http://people.canonical.com/~inaddy/lp1413540/STACK_TRACE.txt
http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt
http://people.canonical.com/~inaddy/lp1413540/FOREACH_BT.txt

* It looks like we got locked because of reentrant flush_tlb_* through
smp_call_*
but I'll leave it to you. let me know if you need any more info..

-
Rafael Tinoco
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-23 Thread Peter Zijlstra
On Mon, Feb 23, 2015 at 11:32:50AM -0800, Linus Torvalds wrote:
 On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco ina...@ubuntu.com 
 wrote:
 
  This is v3.19 + your patch (smp acquire/release)
  - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
  and acpi_idle)
 
 Hmm. There is absolutely nothing else going on on that machine, except
 for the single call to smp_call_function_single() that is waiting for
 the CSD to be released.
 
  * It looks like we got locked because of reentrant flush_tlb_* through
  smp_call_*
  but I'll leave it to you.
 
 No, that is all a perfectly regular callchain:
 
  .. native_flush_tlb_others - smp_call_function_many -
 smp_call_function_single
 
 but the stack contains some stale addresses (one is probably just from
 smp_call_function_single() calling into generic_exec_single(), and
 thus the stack contains the return address inside
 smp_call_function_single() in _addition_ to the actual place where the
 watchdog timer then interrupted it).
 
 It all really looks very regular and sane, and looks like
 smp_call_function_single() is happily just waiting for the IPI to
 finish in the (inlined) csd_lock_wait().
 
 I see nothing wrong at all.

[11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 
01/01/2011

But its a virtual machine right? Its not running bare metal, its running
a !virt kernel on a virt machine, so maybe some of the virt muck is
borked?

A very subtly broken APIC emulation would be heaps of 'fun'.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-23 Thread Rafael David Tinoco

 [11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 
 01/01/2011

 But its a virtual machine right? Its not running bare metal, its running
 a !virt kernel on a virt machine, so maybe some of the virt muck is
 borked?

 A very subtly broken APIC emulation would be heaps of 'fun'.

Indeed, I'll dig hypervisor changes... Anyway, this initial stack
trace was first seen during the frequent lockups in 3.18rc4
discussion:

https://lkml.org/lkml/2014/11/14/656

RIP: 0010:[9c11e98a]  [9c11e98a]
generic_exec_single+0xea/0x1d0

At that time seen by Dave Jones, and running bare metal (if i remember
correctly).

-- 
Rafael Tinoco
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-23 Thread Linus Torvalds
On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco ina...@ubuntu.com wrote:

 This is v3.19 + your patch (smp acquire/release)
 - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
 and acpi_idle)

Hmm. There is absolutely nothing else going on on that machine, except
for the single call to smp_call_function_single() that is waiting for
the CSD to be released.

 * It looks like we got locked because of reentrant flush_tlb_* through
 smp_call_*
 but I'll leave it to you.

No, that is all a perfectly regular callchain:

 .. native_flush_tlb_others - smp_call_function_many -
smp_call_function_single

but the stack contains some stale addresses (one is probably just from
smp_call_function_single() calling into generic_exec_single(), and
thus the stack contains the return address inside
smp_call_function_single() in _addition_ to the actual place where the
watchdog timer then interrupted it).

It all really looks very regular and sane, and looks like
smp_call_function_single() is happily just waiting for the IPI to
finish in the (inlined) csd_lock_wait().

I see nothing wrong at all.

However, here's a patch to the actual x86
smp_call_function_single_interrupt() handler, which probably doesn't
make any difference at all, but does:

 - gets rid of the incredibly crazy and ugly smp_entering_irq() that
inexplicably pairs with irq_exit()

 - makes the SMP IPI functions dot he APIC ACK cycle at the *end*
rather than at the beginning.

The exact placement of the ACK should make absolutely no difference,
since interrupts should be disabled for the whole period anyway. But
I'm desperate and am flailing around worrying about x2apic bugs and
whatnot in addition to whatever bugs we migth have in this area in sw.

So maybe there is some timing thing. This makes us consistently ack
the IPI _after_ we have cleared the list of smp call actions, and
executed the list, rather than ack it before, and then go on to muck
with the list.

Plus that old code really annoyed me anyway.

So I don't see why this should matter, but since I don't see how the
bug could happen in the first place, might as well try it..

  Linus
 arch/x86/kernel/smp.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index be8e1bde07aa..ed43259111d8 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -265,12 +265,23 @@ __visible void smp_reschedule_interrupt(struct pt_regs 
*regs)
 */
 }
 
-static inline void smp_entering_irq(void)
+/*
+ * This is the same as entering_irq(), except it doesn't
+ * do the 'exit_idle()'.
+ *
+ * It pairs with smp_irq_exit().
+ */
+static inline void smp_irq_enter(void)
 {
-   ack_APIC_irq();
irq_enter();
 }
 
+static inline void smp_irq_exit(void)
+{
+   irq_exit();
+   ack_APIC_irq();
+}
+
 __visible void smp_trace_reschedule_interrupt(struct pt_regs *regs)
 {
/*
@@ -279,11 +290,11 @@ __visible void smp_trace_reschedule_interrupt(struct 
pt_regs *regs)
 * scheduler_ipi(). This is OK, since those functions are allowed
 * to nest.
 */
-   smp_entering_irq();
+   smp_irq_enter();
trace_reschedule_entry(RESCHEDULE_VECTOR);
__smp_reschedule_interrupt();
trace_reschedule_exit(RESCHEDULE_VECTOR);
-   exiting_irq();
+   smp_irq_exit();
/*
 * KVM uses this interrupt to force a cpu out of guest mode
 */
@@ -297,18 +308,18 @@ static inline void __smp_call_function_interrupt(void)
 
 __visible void smp_call_function_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
__smp_call_function_interrupt();
-   exiting_irq();
+   smp_irq_exit();
 }
 
 __visible void smp_trace_call_function_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
trace_call_function_entry(CALL_FUNCTION_VECTOR);
__smp_call_function_interrupt();
trace_call_function_exit(CALL_FUNCTION_VECTOR);
-   exiting_irq();
+   smp_irq_exit();
 }
 
 static inline void __smp_call_function_single_interrupt(void)
@@ -319,18 +330,18 @@ static inline void 
__smp_call_function_single_interrupt(void)
 
 __visible void smp_call_function_single_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
__smp_call_function_single_interrupt();
-   exiting_irq();
+   smp_irq_exit();
 }
 
 __visible void smp_trace_call_function_single_interrupt(struct pt_regs *regs)
 {
-   smp_entering_irq();
+   smp_irq_enter();
trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR);
__smp_call_function_single_interrupt();
trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
-   exiting_irq();
+   smp_irq_exit();
 }
 
 static int __init nonmi_ipi_setup(char *str)


Re: smp_call_function_single lockups

2015-02-22 Thread Ingo Molnar

* Daniel J Blueman  wrote:

> The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs 
> are queued, so the wait_icr_idle() polling is only 
> necessary on PPro and older, and maybe then to avoid 
> delivery retry. This unnecessarily ties up the IPI 
> caller, so we bypass the polling in the Numachip APIC 
> driver IPI-to-self path.

It would be nice to propagate this back to the generic x86 
code.

> On Linus's earlier point, with the large core counts on 
> Numascale systems, I previously implemented a shortcut to 
> allow single IPIs to bypass all the cpumask generation 
> and walking; it's way down on my list, but I'll see if I 
> can generalise and present a patch series at some point 
> if interested?

I am definitely interested!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-22 Thread Daniel J Blueman

On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote:
> * Linus Torvalds  wrote:
>
> > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar  wrote:
> > >
> > > So if my memory serves me right, I think it was for
> > > local APICs, and even there mostly it was a performance
> > > issue: if an IO-APIC sent more than 2 IRQs per 'level'
> > > to a local APIC then the IO-APIC might be forced to
> > > resend those IRQs, leading to excessive message traffic
> > > on the relevant hardware bus.
> >
> > Hmm. I have a distinct memory of interrupts actually
> > being lost, but I really can't find anything to support
> > that memory, so it's probably some drug-induced confusion
> > of mine. I don't find *anything* about interrupt "levels"
> > any more in modern Intel documentation on the APIC, but
> > maybe I missed something. But it might all have been an
> > IO-APIC thing.
>
> So I just found an older discussion of it:
>
> 
http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815

>
> while it's not a comprehensive description, it matches what
> I remember from it: with 3 vectors within a level of 16
> vectors we'd get excessive "retries" sent by the IO-APIC
> through the (then rather slow) APIC bus.
>
> ( It was possible for the same phenomenon to occur with
>   IPIs as well, when a CPU sent an APIC message to another
>   CPU, if the affected vectors were equal modulo 16 - but
>   this was rare IIRC because most systems were dual CPU so
>   only two IPIs could have occured. )
>
> > Well, the attached patch for that seems pretty trivial.
> > And seems to work for me (my machine also defaults to
> > x2apic clustered mode), and allows the APIC code to start
> > doing a "send to specific cpu" thing one by one, since it
> > falls back to the send_IPI_mask() function if no
> > individual CPU IPI function exists.
> >
> > NOTE! There's a few cases in
> > arch/x86/kernel/apic/vector.c that also do that
> > "apic->send_IPI_mask(cpumask_of(i), .." thing, but they
> > aren't that important, so I didn't bother with them.
> >
> > NOTE2! I've tested this, and it seems to work, but maybe
> > there is something seriously wrong. I skipped the
> > "disable interrupts" part when doing the "send_IPI", for
> > example, because I think it's entirely unnecessary for
> > that case. But this has certainly *not* gotten any real
> > stress-testing.

> I'm not so sure about that aspect: I think disabling IRQs
> might be necessary with some APICs (if lower levels don't
> disable IRQs), to make sure the 'local APIC busy' bit isn't
> set:
>
> we typically do a wait_icr_idle() call before sending an
> IPI - and if IRQs are not off then the idleness of the APIC
> might be gone. (Because a hardirq that arrives after a
> wait_icr_idle() but before the actual IPI sending sent out
> an IPI and the queue is full.)

The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs are queued, so 
the wait_icr_idle() polling is only necessary on PPro and older, and 
maybe then to avoid delivery retry. This unnecessarily ties up the IPI 
caller, so we bypass the polling in the Numachip APIC driver IPI-to-self 
path.


On Linus's earlier point, with the large core counts on Numascale 
systems, I previously implemented a shortcut to allow single IPIs to 
bypass all the cpumask generation and walking; it's way down on my list, 
but I'll see if I can generalise and present a patch series at some 
point if interested?


Dan

-- [1] Intel SDM 3, p10-30 
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf


If more than one interrupt is generated with the same vector number, the 
local APIC can set the bit for the vector both in the IRR and the ISR. 
This means that for the Pentium 4 and Intel Xeon processors, the IRR and 
ISR can queue two interrupts for each interrupt vector: one in the IRR 
and one in the ISR. Any additional interrupts issued for the same 
interrupt vector are collapsed into the single bit in the IRR. For the 
P6 family and Pentium processors, the IRR and ISR registers can queue no 
more than two interrupts per interrupt vector and will reject other 
interrupts that are received within the same vector.


-- [2] AMD Fam15h BKDG p470 
http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf


DS: interrupt delivery status. Read-only. Reset: 0. In xAPIC mode this 
bit is set to indicate that the interrupt has not yet been accepted by 
the destination core(s). 0=Idle. 1=Send pending. Reserved in x2APIC 
mode. Software may repeatedly write ICRL without polling the DS bit; all 
requested IPIs will be delivered.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-22 Thread Ingo Molnar

* Daniel J Blueman dan...@numascale.com wrote:

 The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs 
 are queued, so the wait_icr_idle() polling is only 
 necessary on PPro and older, and maybe then to avoid 
 delivery retry. This unnecessarily ties up the IPI 
 caller, so we bypass the polling in the Numachip APIC 
 driver IPI-to-self path.

It would be nice to propagate this back to the generic x86 
code.

 On Linus's earlier point, with the large core counts on 
 Numascale systems, I previously implemented a shortcut to 
 allow single IPIs to bypass all the cpumask generation 
 and walking; it's way down on my list, but I'll see if I 
 can generalise and present a patch series at some point 
 if interested?

I am definitely interested!

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-22 Thread Daniel J Blueman

On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote:
 * Linus Torvalds torva...@linux-foundation.org wrote:

  On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar mi...@kernel.org wrote:
  
   So if my memory serves me right, I think it was for
   local APICs, and even there mostly it was a performance
   issue: if an IO-APIC sent more than 2 IRQs per 'level'
   to a local APIC then the IO-APIC might be forced to
   resend those IRQs, leading to excessive message traffic
   on the relevant hardware bus.
 
  Hmm. I have a distinct memory of interrupts actually
  being lost, but I really can't find anything to support
  that memory, so it's probably some drug-induced confusion
  of mine. I don't find *anything* about interrupt levels
  any more in modern Intel documentation on the APIC, but
  maybe I missed something. But it might all have been an
  IO-APIC thing.

 So I just found an older discussion of it:

 
http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815


 while it's not a comprehensive description, it matches what
 I remember from it: with 3 vectors within a level of 16
 vectors we'd get excessive retries sent by the IO-APIC
 through the (then rather slow) APIC bus.

 ( It was possible for the same phenomenon to occur with
   IPIs as well, when a CPU sent an APIC message to another
   CPU, if the affected vectors were equal modulo 16 - but
   this was rare IIRC because most systems were dual CPU so
   only two IPIs could have occured. )

  Well, the attached patch for that seems pretty trivial.
  And seems to work for me (my machine also defaults to
  x2apic clustered mode), and allows the APIC code to start
  doing a send to specific cpu thing one by one, since it
  falls back to the send_IPI_mask() function if no
  individual CPU IPI function exists.
 
  NOTE! There's a few cases in
  arch/x86/kernel/apic/vector.c that also do that
  apic-send_IPI_mask(cpumask_of(i), .. thing, but they
  aren't that important, so I didn't bother with them.
 
  NOTE2! I've tested this, and it seems to work, but maybe
  there is something seriously wrong. I skipped the
  disable interrupts part when doing the send_IPI, for
  example, because I think it's entirely unnecessary for
  that case. But this has certainly *not* gotten any real
  stress-testing.

 I'm not so sure about that aspect: I think disabling IRQs
 might be necessary with some APICs (if lower levels don't
 disable IRQs), to make sure the 'local APIC busy' bit isn't
 set:

 we typically do a wait_icr_idle() call before sending an
 IPI - and if IRQs are not off then the idleness of the APIC
 might be gone. (Because a hardirq that arrives after a
 wait_icr_idle() but before the actual IPI sending sent out
 an IPI and the queue is full.)

The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs are queued, so 
the wait_icr_idle() polling is only necessary on PPro and older, and 
maybe then to avoid delivery retry. This unnecessarily ties up the IPI 
caller, so we bypass the polling in the Numachip APIC driver IPI-to-self 
path.


On Linus's earlier point, with the large core counts on Numascale 
systems, I previously implemented a shortcut to allow single IPIs to 
bypass all the cpumask generation and walking; it's way down on my list, 
but I'll see if I can generalise and present a patch series at some 
point if interested?


Dan

-- [1] Intel SDM 3, p10-30 
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf


If more than one interrupt is generated with the same vector number, the 
local APIC can set the bit for the vector both in the IRR and the ISR. 
This means that for the Pentium 4 and Intel Xeon processors, the IRR and 
ISR can queue two interrupts for each interrupt vector: one in the IRR 
and one in the ISR. Any additional interrupts issued for the same 
interrupt vector are collapsed into the single bit in the IRR. For the 
P6 family and Pentium processors, the IRR and ISR registers can queue no 
more than two interrupts per interrupt vector and will reject other 
interrupts that are received within the same vector.


-- [2] AMD Fam15h BKDG p470 
http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf


DS: interrupt delivery status. Read-only. Reset: 0. In xAPIC mode this 
bit is set to indicate that the interrupt has not yet been accepted by 
the destination core(s). 0=Idle. 1=Send pending. Reserved in x2APIC 
mode. Software may repeatedly write ICRL without polling the DS bit; all 
requested IPIs will be delivered.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar  wrote:
> >
> > I'm not so sure about that aspect: I think disabling 
> > IRQs might be necessary with some APICs (if lower 
> > levels don't disable IRQs), to make sure the 'local 
> > APIC busy' bit isn't set:
> 
> Right. But afaik not for the x2apic case, which this is.  
> The x2apic doesn't even have a busy bit, and sending the 
> ipi is a single write,

Ah, ok! Then the patch looks good to me.

( Originally we didn't wait for the ICR bit either, but 
  then it was added due to later erratas and was eventually 
  made an architectural requirement. )

> I agree that when doing other apic implementations, we 
> may need to guarantee atomicity for things like "wait for 
> apic idle, then send the ipi".

Yeah.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-20 Thread Linus Torvalds
On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar  wrote:
>
> I'm not so sure about that aspect: I think disabling IRQs
> might be necessary with some APICs (if lower levels don't
> disable IRQs), to make sure the 'local APIC busy' bit isn't
> set:

Right. But afaik not for the x2apic case, which this is.  The x2apic
doesn't even have a busy bit, and sending the ipi is a single write,

I agree that when doing other apic implementations, we may need to
guarantee atomicity for things like "wait for apic idle, then send the
ipi".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar  wrote:
> >
> > So if my memory serves me right, I think it was for 
> > local APICs, and even there mostly it was a performance 
> > issue: if an IO-APIC sent more than 2 IRQs per 'level' 
> > to a local APIC then the IO-APIC might be forced to 
> > resend those IRQs, leading to excessive message traffic 
> > on the relevant hardware bus.
> 
> Hmm. I have a distinct memory of interrupts actually 
> being lost, but I really can't find anything to support 
> that memory, so it's probably some drug-induced confusion 
> of mine. I don't find *anything* about interrupt "levels" 
> any more in modern Intel documentation on the APIC, but 
> maybe I missed something. But it might all have been an 
> IO-APIC thing.

So I just found an older discussion of it:

  
http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815

while it's not a comprehensive description, it matches what 
I remember from it: with 3 vectors within a level of 16 
vectors we'd get excessive "retries" sent by the IO-APIC 
through the (then rather slow) APIC bus.

( It was possible for the same phenomenon to occur with 
  IPIs as well, when a CPU sent an APIC message to another
  CPU, if the affected vectors were equal modulo 16 - but
  this was rare IIRC because most systems were dual CPU so
  only two IPIs could have occured. )

> Well, the attached patch for that seems pretty trivial. 
> And seems to work for me (my machine also defaults to 
> x2apic clustered mode), and allows the APIC code to start 
> doing a "send to specific cpu" thing one by one, since it 
> falls back to the send_IPI_mask() function if no 
> individual CPU IPI function exists.
> 
> NOTE! There's a few cases in 
> arch/x86/kernel/apic/vector.c that also do that 
> "apic->send_IPI_mask(cpumask_of(i), .." thing, but they 
> aren't that important, so I didn't bother with them.
> 
> NOTE2! I've tested this, and it seems to work, but maybe 
> there is something seriously wrong. I skipped the 
> "disable interrupts" part when doing the "send_IPI", for 
> example, because I think it's entirely unnecessary for 
> that case. But this has certainly *not* gotten any real 
> stress-testing.

I'm not so sure about that aspect: I think disabling IRQs 
might be necessary with some APICs (if lower levels don't 
disable IRQs), to make sure the 'local APIC busy' bit isn't 
set:

we typically do a wait_icr_idle() call before sending an 
IPI - and if IRQs are not off then the idleness of the APIC 
might be gone. (Because a hardirq that arrives after a 
wait_icr_idle() but before the actual IPI sending sent out 
an IPI and the queue is full.)

So the IPI sending should be atomic in that sense.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-20 Thread Linus Torvalds
On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar  wrote:
>
> So if my memory serves me right, I think it was for local
> APICs, and even there mostly it was a performance issue: if
> an IO-APIC sent more than 2 IRQs per 'level' to a local
> APIC then the IO-APIC might be forced to resend those IRQs,
> leading to excessive message traffic on the relevant
> hardware bus.

Hmm. I have a distinct memory of interrupts actually being lost, but I
really can't find anything to support that memory, so it's probably
some drug-induced confusion of mine. I don't find *anything* about
interrupt "levels" any more in modern Intel documentation on the APIC,
but maybe I missed something. But it might all have been an IO-APIC
thing.

> I think you got it right.
>
> So the principle of EOI acknowledgement from the OS to the
> local APIC is specific to the IRQ that raised the interrupt
> and caused the vector to be executed, so it's not possible
> to ack the 'wrong' IRQ.

I agree - but only for cases where we actualyl should ACK in the first
place. The LAPIC knows what its own priorities were, so it always just
clears the highest-priority ISR bit.

> So I _think_ it's not possible to accidentally acknowledge
> a pending IRQ that has not been issued to the CPU yet
> (unless we have hardirqs enabled), just by writing stray
> EOIs to the local APIC. So in that sense the ExtInt irq0
> case should be mostly harmless.

It must clearly be mostly harmless even if I'm right, since we've
always done the ACK for it, as far as I can tell. So you're probably
right, and it doesn't matter.

In particular, it would depend on exactly when the ISR bit actually
gets set in the APIC. If it gets set only after the CPU has actually
*accepted* the interrupt, then it should not be possible for a
spurious EOI write to screw things up, because it's still happening in
the context of the previously executing interrupt, so a new ISR bit
hasn't been set yet, and any old ISR bits must be lower-priority than
the currently executing interrupt routine (that were interrupted by a
higher-priority one).

That sounds like the likely scenario. It just worries me. And my test
patch did actually trigger, even if it only seemed to trigger during
device probing (when the probing itself probably caused and then
cleared an interrupt).

> I also fully share your frustration about the level of
> obfuscation the various APIC drivers display today.
>
> The lack of a simple single-IPI implementation is annoying
> as well - when that injury was first inflicted with
> clustered APICs I tried to resist, but AFAICR there were
> some good hardware arguments why it cannot be kept and I
> gave up.
>
> If you agree then I can declare a feature stop for new
> hardware support (that isn't a stop-ship issue for users)
> until it's all cleaned up for real, and Thomas started some
> of that work already.

Well, the attached patch for that seems pretty trivial. And seems to
work for me (my machine also defaults to x2apic clustered mode), and
allows the APIC code to start doing a "send to specific cpu" thing one
by one, since it falls back to the send_IPI_mask() function if no
individual CPU IPI function exists.

NOTE! There's a few cases in arch/x86/kernel/apic/vector.c that also
do that "apic->send_IPI_mask(cpumask_of(i), .." thing, but they aren't
that important, so I didn't bother with them.

NOTE2! I've tested this, and it seems to work, but maybe there is
something seriously wrong. I skipped the "disable interrupts" part
when doing the "send_IPI", for example, because I think it's entirely
unnecessary for that case. But this has certainly *not* gotten any
real stress-testing.

But /proc/interrupts shows thousands of rescheduling IPI's, so this
should all have triggered.

Linus
 arch/x86/include/asm/apic.h   |  1 +
 arch/x86/kernel/apic/x2apic_cluster.c |  9 +
 arch/x86/kernel/smp.c | 16 ++--
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 92003f3c8a42..5d47ffcfa65d 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -296,6 +296,7 @@ struct apic {
  unsigned int *apicid);
 
/* ipi */
+   void (*send_IPI)(int cpu, int vector);
void (*send_IPI_mask)(const struct cpumask *mask, int vector);
void (*send_IPI_mask_allbutself)(const struct cpumask *mask,
 int vector);
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c 
b/arch/x86/kernel/apic/x2apic_cluster.c
index e658f21681c8..177c4fb027a1 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -23,6 +23,14 @@ static inline u32 x2apic_cluster(int cpu)
return per_cpu(x86_cpu_to_logical_apicid, cpu) >> 16;
 }
 
+static void x2apic_send_IPI(int cpu, int vector)
+{
+   u32 dest = 

Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
>  wrote:
> > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
> >  wrote:
> >>
> >> Are there known errata for the x2apic?
> >
> > .. and in particular, do we still have to worry about 
> > the traditional local apic "if there are more than two 
> > pending interrupts per priority level, things get lost" 
> > problem?
> >
> > I forget the exact details. Hopefully somebody 
> > remembers.
> 
> I can't find it in the docs. I find the "two-entries per 
> vector", but not anything that is per priority level 
> (group of 16 vectors). Maybe that was the IO-APIC, in 
> which case it's immaterial for IPI's.

So if my memory serves me right, I think it was for local 
APICs, and even there mostly it was a performance issue: if 
an IO-APIC sent more than 2 IRQs per 'level' to a local 
APIC then the IO-APIC might be forced to resend those IRQs, 
leading to excessive message traffic on the relevant 
hardware bus.

( I think the 'resend' was automatic in this case, i.e. a 
  hardware fallback for a CPU side resource shortage, and 
  it could not result in actually lost IRQs. I never saw 
  this documented properly, so people inside Intel or AMD 
  would be in a better position to comment on this ... I 
  might be mis-remembering this or confusing different 
  bugs. )

> However, having now mostly re-acquainted myself with the 
> APIC details, it strikes me that we do have some oddities 
> here.
> 
> In particular, a few interrupt types are very special: 
> NMI, SMI, INIT, ExtINT, or SIPI are handled early in the 
> interrupt acceptance logic, and are sent directly to the 
> CPU core, without going through the usual intermediate 
> IRR/ISR dance.
> 
> And why might this matter? It's important because it 
> means that those kinds of interrupts must *not* do the 
> apic EOI that ack_APIC_irq() does.
> 
> And we correctly don't do ack_APIC_irq() for NMI etc, but 
> it strikes me that ExtINT is odd and special.
> 
> I think we still use ExtINT for some odd cases. We used 
> to have some magic with the legacy timer interrupt, for 
> example. And I think they all go through the normal 
> "do_IRQ()" logic regardless of whether they are ExtINT or 
> not.
> 
> Now, what happens if we send an EOI for an ExtINT 
> interrupt? It basically ends up being a spurious IPI. And 
> I *think* that what normally happens is absolutely 
> nothing at all. But if in addition to the ExtINT, there 
> was a pending IPI (or other pending ISR bit set), maybe 
> we lose interrupts..

1)

I think you got it right.

So the principle of EOI acknowledgement from the OS to the 
local APIC is specific to the IRQ that raised the interrupt 
and caused the vector to be executed, so it's not possible 
to ack the 'wrong' IRQ.

But technically the EOI is state-less, i.e. (as you know) 
we write a constant value to a local APIC register without 
indicating which vector or external IRQ we meant. The OS 
wants to ack 'the IRQ that we are executing currently', but 
this leaves the situation a bit confused in cases where for 
example an IRQ handler enables IRQs, another IRQ comes in 
and stays unacked.

So I _think_ it's not possible to accidentally acknowledge 
a pending IRQ that has not been issued to the CPU yet 
(unless we have hardirqs enabled), just by writing stray 
EOIs to the local APIC. So in that sense the ExtInt irq0 
case should be mostly harmless.

But I could be wrong :-/

2)

So my suggestion for this bug would be:

The 'does a stray EOI matter' question could also be tested 
by deliberately writing two EOIs instead of just one - does 
this trigger the bug faster?

Then perhaps try to make sure that no hardirqs get ever 
enabled in an irq handler, and figure out whether any of 
the IRQs in question are edge triggered - but AFAICS it 
could be 'any' IRQ handler or flow causing the problem, 
right?

3)

I also fully share your frustration about the level of 
obfuscation the various APIC drivers display today.

The lack of a simple single-IPI implementation is annoying 
as well - when that injury was first inflicted with 
clustered APICs I tried to resist, but AFAICR there were 
some good hardware arguments why it cannot be kept and I 
gave up.

If you agree then I can declare a feature stop for new 
hardware support (that isn't a stop-ship issue for users) 
until it's all cleaned up for real, and Thomas started some 
of that work already.

> .. and it's entirely possible that I'm just completely 
> full of shit. Who is the poor bastard who has worked most 
> with things like ExtINT, and can educate me? I'm adding 
> Ingo, hpa and Jiang Liu as primary contacts..

So the buck stops at my desk, but any help is welcome!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
  torva...@linux-foundation.org wrote:
 
  Are there known errata for the x2apic?
 
  .. and in particular, do we still have to worry about 
  the traditional local apic if there are more than two 
  pending interrupts per priority level, things get lost 
  problem?
 
  I forget the exact details. Hopefully somebody 
  remembers.
 
 I can't find it in the docs. I find the two-entries per 
 vector, but not anything that is per priority level 
 (group of 16 vectors). Maybe that was the IO-APIC, in 
 which case it's immaterial for IPI's.

So if my memory serves me right, I think it was for local 
APICs, and even there mostly it was a performance issue: if 
an IO-APIC sent more than 2 IRQs per 'level' to a local 
APIC then the IO-APIC might be forced to resend those IRQs, 
leading to excessive message traffic on the relevant 
hardware bus.

( I think the 'resend' was automatic in this case, i.e. a 
  hardware fallback for a CPU side resource shortage, and 
  it could not result in actually lost IRQs. I never saw 
  this documented properly, so people inside Intel or AMD 
  would be in a better position to comment on this ... I 
  might be mis-remembering this or confusing different 
  bugs. )

 However, having now mostly re-acquainted myself with the 
 APIC details, it strikes me that we do have some oddities 
 here.
 
 In particular, a few interrupt types are very special: 
 NMI, SMI, INIT, ExtINT, or SIPI are handled early in the 
 interrupt acceptance logic, and are sent directly to the 
 CPU core, without going through the usual intermediate 
 IRR/ISR dance.
 
 And why might this matter? It's important because it 
 means that those kinds of interrupts must *not* do the 
 apic EOI that ack_APIC_irq() does.
 
 And we correctly don't do ack_APIC_irq() for NMI etc, but 
 it strikes me that ExtINT is odd and special.
 
 I think we still use ExtINT for some odd cases. We used 
 to have some magic with the legacy timer interrupt, for 
 example. And I think they all go through the normal 
 do_IRQ() logic regardless of whether they are ExtINT or 
 not.
 
 Now, what happens if we send an EOI for an ExtINT 
 interrupt? It basically ends up being a spurious IPI. And 
 I *think* that what normally happens is absolutely 
 nothing at all. But if in addition to the ExtINT, there 
 was a pending IPI (or other pending ISR bit set), maybe 
 we lose interrupts..

1)

I think you got it right.

So the principle of EOI acknowledgement from the OS to the 
local APIC is specific to the IRQ that raised the interrupt 
and caused the vector to be executed, so it's not possible 
to ack the 'wrong' IRQ.

But technically the EOI is state-less, i.e. (as you know) 
we write a constant value to a local APIC register without 
indicating which vector or external IRQ we meant. The OS 
wants to ack 'the IRQ that we are executing currently', but 
this leaves the situation a bit confused in cases where for 
example an IRQ handler enables IRQs, another IRQ comes in 
and stays unacked.

So I _think_ it's not possible to accidentally acknowledge 
a pending IRQ that has not been issued to the CPU yet 
(unless we have hardirqs enabled), just by writing stray 
EOIs to the local APIC. So in that sense the ExtInt irq0 
case should be mostly harmless.

But I could be wrong :-/

2)

So my suggestion for this bug would be:

The 'does a stray EOI matter' question could also be tested 
by deliberately writing two EOIs instead of just one - does 
this trigger the bug faster?

Then perhaps try to make sure that no hardirqs get ever 
enabled in an irq handler, and figure out whether any of 
the IRQs in question are edge triggered - but AFAICS it 
could be 'any' IRQ handler or flow causing the problem, 
right?

3)

I also fully share your frustration about the level of 
obfuscation the various APIC drivers display today.

The lack of a simple single-IPI implementation is annoying 
as well - when that injury was first inflicted with 
clustered APICs I tried to resist, but AFAICR there were 
some good hardware arguments why it cannot be kept and I 
gave up.

If you agree then I can declare a feature stop for new 
hardware support (that isn't a stop-ship issue for users) 
until it's all cleaned up for real, and Thomas started some 
of that work already.

 .. and it's entirely possible that I'm just completely 
 full of shit. Who is the poor bastard who has worked most 
 with things like ExtINT, and can educate me? I'm adding 
 Ingo, hpa and Jiang Liu as primary contacts..

So the buck stops at my desk, but any help is welcome!

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: smp_call_function_single lockups

2015-02-20 Thread Linus Torvalds
On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar mi...@kernel.org wrote:

 So if my memory serves me right, I think it was for local
 APICs, and even there mostly it was a performance issue: if
 an IO-APIC sent more than 2 IRQs per 'level' to a local
 APIC then the IO-APIC might be forced to resend those IRQs,
 leading to excessive message traffic on the relevant
 hardware bus.

Hmm. I have a distinct memory of interrupts actually being lost, but I
really can't find anything to support that memory, so it's probably
some drug-induced confusion of mine. I don't find *anything* about
interrupt levels any more in modern Intel documentation on the APIC,
but maybe I missed something. But it might all have been an IO-APIC
thing.

 I think you got it right.

 So the principle of EOI acknowledgement from the OS to the
 local APIC is specific to the IRQ that raised the interrupt
 and caused the vector to be executed, so it's not possible
 to ack the 'wrong' IRQ.

I agree - but only for cases where we actualyl should ACK in the first
place. The LAPIC knows what its own priorities were, so it always just
clears the highest-priority ISR bit.

 So I _think_ it's not possible to accidentally acknowledge
 a pending IRQ that has not been issued to the CPU yet
 (unless we have hardirqs enabled), just by writing stray
 EOIs to the local APIC. So in that sense the ExtInt irq0
 case should be mostly harmless.

It must clearly be mostly harmless even if I'm right, since we've
always done the ACK for it, as far as I can tell. So you're probably
right, and it doesn't matter.

In particular, it would depend on exactly when the ISR bit actually
gets set in the APIC. If it gets set only after the CPU has actually
*accepted* the interrupt, then it should not be possible for a
spurious EOI write to screw things up, because it's still happening in
the context of the previously executing interrupt, so a new ISR bit
hasn't been set yet, and any old ISR bits must be lower-priority than
the currently executing interrupt routine (that were interrupted by a
higher-priority one).

That sounds like the likely scenario. It just worries me. And my test
patch did actually trigger, even if it only seemed to trigger during
device probing (when the probing itself probably caused and then
cleared an interrupt).

 I also fully share your frustration about the level of
 obfuscation the various APIC drivers display today.

 The lack of a simple single-IPI implementation is annoying
 as well - when that injury was first inflicted with
 clustered APICs I tried to resist, but AFAICR there were
 some good hardware arguments why it cannot be kept and I
 gave up.

 If you agree then I can declare a feature stop for new
 hardware support (that isn't a stop-ship issue for users)
 until it's all cleaned up for real, and Thomas started some
 of that work already.

Well, the attached patch for that seems pretty trivial. And seems to
work for me (my machine also defaults to x2apic clustered mode), and
allows the APIC code to start doing a send to specific cpu thing one
by one, since it falls back to the send_IPI_mask() function if no
individual CPU IPI function exists.

NOTE! There's a few cases in arch/x86/kernel/apic/vector.c that also
do that apic-send_IPI_mask(cpumask_of(i), .. thing, but they aren't
that important, so I didn't bother with them.

NOTE2! I've tested this, and it seems to work, but maybe there is
something seriously wrong. I skipped the disable interrupts part
when doing the send_IPI, for example, because I think it's entirely
unnecessary for that case. But this has certainly *not* gotten any
real stress-testing.

But /proc/interrupts shows thousands of rescheduling IPI's, so this
should all have triggered.

Linus
 arch/x86/include/asm/apic.h   |  1 +
 arch/x86/kernel/apic/x2apic_cluster.c |  9 +
 arch/x86/kernel/smp.c | 16 ++--
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 92003f3c8a42..5d47ffcfa65d 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -296,6 +296,7 @@ struct apic {
  unsigned int *apicid);
 
/* ipi */
+   void (*send_IPI)(int cpu, int vector);
void (*send_IPI_mask)(const struct cpumask *mask, int vector);
void (*send_IPI_mask_allbutself)(const struct cpumask *mask,
 int vector);
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c 
b/arch/x86/kernel/apic/x2apic_cluster.c
index e658f21681c8..177c4fb027a1 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -23,6 +23,14 @@ static inline u32 x2apic_cluster(int cpu)
return per_cpu(x86_cpu_to_logical_apicid, cpu)  16;
 }
 
+static void x2apic_send_IPI(int cpu, int vector)
+{
+   u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
+
+   

Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar mi...@kernel.org wrote:
 
  So if my memory serves me right, I think it was for 
  local APICs, and even there mostly it was a performance 
  issue: if an IO-APIC sent more than 2 IRQs per 'level' 
  to a local APIC then the IO-APIC might be forced to 
  resend those IRQs, leading to excessive message traffic 
  on the relevant hardware bus.
 
 Hmm. I have a distinct memory of interrupts actually 
 being lost, but I really can't find anything to support 
 that memory, so it's probably some drug-induced confusion 
 of mine. I don't find *anything* about interrupt levels 
 any more in modern Intel documentation on the APIC, but 
 maybe I missed something. But it might all have been an 
 IO-APIC thing.

So I just found an older discussion of it:

  
http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815

while it's not a comprehensive description, it matches what 
I remember from it: with 3 vectors within a level of 16 
vectors we'd get excessive retries sent by the IO-APIC 
through the (then rather slow) APIC bus.

( It was possible for the same phenomenon to occur with 
  IPIs as well, when a CPU sent an APIC message to another
  CPU, if the affected vectors were equal modulo 16 - but
  this was rare IIRC because most systems were dual CPU so
  only two IPIs could have occured. )

 Well, the attached patch for that seems pretty trivial. 
 And seems to work for me (my machine also defaults to 
 x2apic clustered mode), and allows the APIC code to start 
 doing a send to specific cpu thing one by one, since it 
 falls back to the send_IPI_mask() function if no 
 individual CPU IPI function exists.
 
 NOTE! There's a few cases in 
 arch/x86/kernel/apic/vector.c that also do that 
 apic-send_IPI_mask(cpumask_of(i), .. thing, but they 
 aren't that important, so I didn't bother with them.
 
 NOTE2! I've tested this, and it seems to work, but maybe 
 there is something seriously wrong. I skipped the 
 disable interrupts part when doing the send_IPI, for 
 example, because I think it's entirely unnecessary for 
 that case. But this has certainly *not* gotten any real 
 stress-testing.

I'm not so sure about that aspect: I think disabling IRQs 
might be necessary with some APICs (if lower levels don't 
disable IRQs), to make sure the 'local APIC busy' bit isn't 
set:

we typically do a wait_icr_idle() call before sending an 
IPI - and if IRQs are not off then the idleness of the APIC 
might be gone. (Because a hardirq that arrives after a 
wait_icr_idle() but before the actual IPI sending sent out 
an IPI and the queue is full.)

So the IPI sending should be atomic in that sense.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-20 Thread Linus Torvalds
On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar mi...@kernel.org wrote:

 I'm not so sure about that aspect: I think disabling IRQs
 might be necessary with some APICs (if lower levels don't
 disable IRQs), to make sure the 'local APIC busy' bit isn't
 set:

Right. But afaik not for the x2apic case, which this is.  The x2apic
doesn't even have a busy bit, and sending the ipi is a single write,

I agree that when doing other apic implementations, we may need to
guarantee atomicity for things like wait for apic idle, then send the
ipi.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-20 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar mi...@kernel.org wrote:
 
  I'm not so sure about that aspect: I think disabling 
  IRQs might be necessary with some APICs (if lower 
  levels don't disable IRQs), to make sure the 'local 
  APIC busy' bit isn't set:
 
 Right. But afaik not for the x2apic case, which this is.  
 The x2apic doesn't even have a busy bit, and sending the 
 ipi is a single write,

Ah, ok! Then the patch looks good to me.

( Originally we didn't wait for the ICR bit either, but 
  then it was added due to later erratas and was eventually 
  made an architectural requirement. )

 I agree that when doing other apic implementations, we 
 may need to guarantee atomicity for things like wait for 
 apic idle, then send the ipi.

Yeah.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds
 wrote:
>
> Is this worth looking at? Or is it something spurious? I might have
> gotten the vectors wrong, and maybe the warning is not because the ISR
> bit isn't set, but because I test the wrong bit.

I edited the patch to do ratelimiting (one per 10s max) rather than
"once". And tested it some more. It seems to work correctly. The irq
case during 8042 probing is not repeatable, and I suspect it happens
because the interrupt source goes away (some probe-time thing that
first triggers an interrupt, but then clears it itself), so it doesn't
happen every boot, and I've gotten it with slightly different
backtraces.

But it's the only warning that happens for me, so I think my code is
right (at least for the cases that trigger on this machine). It's
definitely not a "every interrupt causes the warning because the code
was buggy, and the WARN_ONCE() just printed the first one".

It would be interesting to hear if others see spurious APIC EOI cases
too. In particular, the people seeing the IPI lockup. Because a lot of
the lockups we've seen have *looked* like the IPI interrupt just never
happened, and so we're waiting forever for the target CPU to react to
it. And just maybe the spurious EOI could cause the wrong bit to be
cleared in the ISR, and then the interrupt never shows up. Something
like that would certainly explain why it only happens on some machines
and under certain timing circumstances.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 12:29 PM, Linus Torvalds
 wrote:
>
> Now, what happens if we send an EOI for an ExtINT interrupt? It
> basically ends up being a spurious IPI. And I *think* that what
> normally happens is absolutely nothing at all. But if in addition to
> the ExtINT, there was a pending IPI (or other pending ISR bit set),
> maybe we lose interrupts..
>
> .. and it's entirely possible that I'm just completely full of shit.
> Who is the poor bastard who has worked most with things like ExtINT,
> and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
> contacts..

So quite frankly, trying to follow all the logic from do_IRQ() through
handle_irq() to the actual low-level handler, I just couldn't do it.

So instead, I wrote a patch to verify that the ISR bit is actually set
when we do ack_APIC_irq().

This was complicated by the fact that we don't actually pass in the
vector number at all to the acking, so 99% of the patch is just doing
that. A couple of places we don't really have a good vector number, so
I said "screw it, a negative value means that we won't check the ISR).

The attached patch is quite possibly garbage, but it gives an
interesting warning for me during i8042 probing, so who knows. Maybe
it actually shows a real problem - or maybe I just screwed up the
patch.

.. and maybe even if the patch is fine, it's actually never really a
problem to have spurious APIC ACK cycles. Maybe it cannot make
interrupts be ignored.

Anyway, the back-trace for the warning I get is during boot:

...
PNP: No PS/2 controller found. Probing ports directly.
[ cut here ]
WARNING: CPU: 0 PID: 1 at ./arch/x86/include/asm/apic.h:436
ir_ack_apic_edge+0x74/0x80()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted
3.19.0-08857-g89d3fa45b4ad-dirty #2
Call Trace:
 
   dump_stack+0x45/0x57
   warn_slowpath_common+0x80/0xc0
   warn_slowpath_null+0x15/0x20
   ir_ack_apic_edge+0x74/0x80
   handle_edge_irq+0x51/0x110
   handle_irq+0x74/0x140
   do_IRQ+0x4a/0x140
   common_interrupt+0x6a/0x6a
 
   ? _raw_spin_unlock_irqrestore+0x9/0x10
   __setup_irq+0x239/0x5a0
   request_threaded_irq+0xc2/0x180
   i8042_probe+0x5b8/0x680
   platform_drv_probe+0x2f/0xa0
   driver_probe_device+0x8b/0x3e0
   __driver_attach+0x93/0xa0
   bus_for_each_dev+0x63/0xa0
   driver_attach+0x19/0x20
   bus_add_driver+0x178/0x250
   driver_register+0x5f/0xf0
   __platform_driver_register+0x45/0x50
   __platform_driver_probe+0x26/0xa0
   __platform_create_bundle+0xad/0xe0
   i8042_init+0x3d0/0x3f6
   do_one_initcall+0xb8/0x1d0
   kernel_init_freeable+0x16d/0x1fa
   kernel_init+0x9/0xf0
   ret_from_fork+0x7c/0xb0
---[ end trace 1de82c4457c6a0f0 ]---
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
...

and it looks not entirely insane.

Is this worth looking at? Or is it something spurious? I might have
gotten the vectors wrong, and maybe the warning is not because the ISR
bit isn't set, but because I test the wrong bit.

 Linus
 arch/x86/include/asm/apic.h  | 22 ++
 arch/x86/kernel/apic/apic.c  | 10 +-
 arch/x86/kernel/apic/io_apic.c   |  5 +++--
 arch/x86/kernel/apic/vector.c|  8 +---
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  4 ++--
 arch/x86/kernel/cpu/mcheck/threshold.c   |  4 ++--
 arch/x86/kernel/irq.c| 11 ++-
 arch/x86/kernel/irq_work.c   |  8 
 arch/x86/kernel/smp.c| 18 +-
 drivers/iommu/irq_remapping.c|  6 --
 10 files changed, 54 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 92003f3c8a42..912a969fd747 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -387,7 +387,7 @@ static inline void apic_write(u32 reg, u32 val)
apic->write(reg, val);
 }
 
-static inline void apic_eoi(void)
+static inline void apic_eoi(int vector)
 {
apic->eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
@@ -418,7 +418,7 @@ extern void __init apic_set_eoi_write(void (*eoi_write)(u32 
reg, u32 v));
 
 static inline u32 apic_read(u32 reg) { return 0; }
 static inline void apic_write(u32 reg, u32 val) { }
-static inline void apic_eoi(void) { }
+static inline void apic_eoi(int vector) { }
 static inline u64 apic_icr_read(void) { return 0; }
 static inline void apic_icr_write(u32 low, u32 high) { }
 static inline void apic_wait_icr_idle(void) { }
@@ -427,13 +427,19 @@ static inline void apic_set_eoi_write(void 
(*eoi_write)(u32 reg, u32 v)) {}
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
-static inline void ack_APIC_irq(void)
+static inline void ack_APIC_irq(int vector)
 {
+   /* Is the ISR bit actually set for this vector? */
+   if (vector >= 16) {

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
 wrote:
> On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
>  wrote:
>>
>> Are there known errata for the x2apic?
>
> .. and in particular, do we still have to worry about the traditional
> local apic "if there are more than two pending interrupts per priority
> level, things get lost" problem?
>
> I forget the exact details. Hopefully somebody remembers.

I can't find it in the docs. I find the "two-entries per vector", but
not anything that is per priority level (group of 16 vectors). Maybe
that was the IO-APIC, in which case it's immaterial for IPI's.

However, having now mostly re-acquainted myself with the APIC details,
it strikes me that we do have some oddities here.

In particular, a few interrupt types are very special: NMI, SMI, INIT,
ExtINT, or SIPI are handled early in the interrupt acceptance logic,
and are sent directly to the CPU core, without going through the usual
intermediate IRR/ISR dance.

And why might this matter? It's important because it means that those
kinds of interrupts must *not* do the apic EOI that ack_APIC_irq()
does.

And we correctly don't do ack_APIC_irq() for NMI etc, but it strikes
me that ExtINT is odd and special.

I think we still use ExtINT for some odd cases. We used to have some
magic with the legacy timer interrupt, for example. And I think they
all go through the normal "do_IRQ()" logic regardless of whether they
are ExtINT or not.

Now, what happens if we send an EOI for an ExtINT interrupt? It
basically ends up being a spurious IPI. And I *think* that what
normally happens is absolutely nothing at all. But if in addition to
the ExtINT, there was a pending IPI (or other pending ISR bit set),
maybe we lose interrupts..

.. and it's entirely possible that I'm just completely full of shit.
Who is the poor bastard who has worked most with things like ExtINT,
and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
contacts..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
 wrote:
>
> Are there known errata for the x2apic?

.. and in particular, do we still have to worry about the traditional
local apic "if there are more than two pending interrupts per priority
level, things get lost" problem?

I forget the exact details. Hopefully somebody remembers.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Rafael David Tinoco
I could only find an advisory (regarding sr-iov and irq remaps) from
HP to RHEL6.2 users stating that Gen8 firmware does not enable it by
default.

http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03645796

"""
The interrupt remapping capability depends on x2apic enabled in the
BIOS and HP ProLiant Gen8 systems do not enable x2apic; therefore, the
following workaround is required for device assignment: Edit the
/etc/grub.conf and add intremap=no_x2apic_optout option to the kernel
command line options.
"""

Probably for backwards compatibility... not sure if there is an option
to enable/disable it in firmware (like DL390 seems to have).
I don't think so... but I was told by HP team that I should use x2apic
for >= Gen8.


On Thu, Feb 19, 2015 at 2:59 PM, Linus Torvalds
 wrote:
> On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco  
> wrote:
>> Feb 19 08:21:28 derain kernel: [3.637682] Switched APIC routing to
>> cluster x2apic.
>
> Ok. That "cluster x2apic" mode is just about the nastiest mode when it
> comes to sending a single ipi. We do that insane dance where we
>
>  - turn single cpu number into cpumask
>  - copy the cpumask to a percpu temporary storage
>  - walk each cpu in the cpumask
>  - for each cpu, look up the cluster siblings
>  - for each cluster sibling that is also in the cpumask, look up the
> logical apic mask and add it to the actual ipi destination mask
>  - send an ipi to that final mask.
>
> which is just insane. It's complicated, it's fragile, and it's unnecessary.
>
> If we had a simple "send_IPI()" function, we could do this all with
> something much saner, and it would look sopmething like
>
>   static void x2apic_send_IPI(int cpu, int vector)
>   {
> u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
> x2apic_wrmsr_fence();
> __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
>   }
>
> and then 'void native_send_call_func_single_ipi()' would just look like
>
>   void native_send_call_func_single_ipi(int cpu)
>   {
> apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
>   }
>
> but I might have missed something (and we might want to have a wrapper
> that says "if the apic doesn't have a 'send_IPI' function, use
> "send_IPI_mask(cpumask_of(cpu, vector) instead"
>
> The fact that you need that no_x2apic_optout (which in turn means that
> your ACPI tables seem to say "don't use x2apic") also makes me worry.
>
> Are there known errata for the x2apic?
>
>  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco  wrote:
> Feb 19 08:21:28 derain kernel: [3.637682] Switched APIC routing to
> cluster x2apic.

Ok. That "cluster x2apic" mode is just about the nastiest mode when it
comes to sending a single ipi. We do that insane dance where we

 - turn single cpu number into cpumask
 - copy the cpumask to a percpu temporary storage
 - walk each cpu in the cpumask
 - for each cpu, look up the cluster siblings
 - for each cluster sibling that is also in the cpumask, look up the
logical apic mask and add it to the actual ipi destination mask
 - send an ipi to that final mask.

which is just insane. It's complicated, it's fragile, and it's unnecessary.

If we had a simple "send_IPI()" function, we could do this all with
something much saner, and it would look sopmething like

  static void x2apic_send_IPI(int cpu, int vector)
  {
u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
x2apic_wrmsr_fence();
__x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
  }

and then 'void native_send_call_func_single_ipi()' would just look like

  void native_send_call_func_single_ipi(int cpu)
  {
apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
  }

but I might have missed something (and we might want to have a wrapper
that says "if the apic doesn't have a 'send_IPI' function, use
"send_IPI_mask(cpumask_of(cpu, vector) instead"

The fact that you need that no_x2apic_optout (which in turn means that
your ACPI tables seem to say "don't use x2apic") also makes me worry.

Are there known errata for the x2apic?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Rafael David Tinoco
For the host, we are using "intremap=no_x2apic_optout
intel_idle.max_cstate=0" for cmdline. It looks like that DL360/DL380
Gen8 firmware still asks to optout from x2apic but HP engineering team
said that using x2apic for Gen8 would be ok (intel_idle causes these
servers to generate NMIs when idling, probably related to packed
c-states and this server's dependency on acpi tables for c-state).

Feb 19 08:21:28 derain kernel: [3.504676] Enabled IRQ remapping in
x2apic mode
Feb 19 08:21:28 derain kernel: [3.565451] Enabling x2apic
Feb 19 08:21:28 derain kernel: [3.602134] Enabled x2apic
Feb 19 08:21:28 derain kernel: [3.637682] Switched APIC routing to
cluster x2apic.

On Thu, Feb 19, 2015 at 2:26 PM, Linus Torvalds
 wrote:
> On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco  
> wrote:
>>
>> Same environment as before: Nested KVM (2 vcpus) on top of Proliant
>> DL380G8 with acpi_idle and no x2apic optout.
>
> Btw, which apic model does that end up using? Does "no x2apic optout"
> mean you're using the x2apic?
>
> What does "dmesg | grep apic" report? Something like
>
> Switched APIC routing to cluster x2apic.
>
> or what?
>
> Side note to the apic guys: I think the "single CPU" case ends up
> being one of the most important ones, but the stupid APIC model
> doesn't allow that, so sending an IPI to a single CPU ends up being
> "send a mask with a single bit set", and then we have that horrible
> "for_each_cpu(cpu, mask)" crap.
>
> Would it make sense to perhaps add a "send_IPI_single()" function
> call, and then for the APIC models that always are based on masks, use
> a wrapper that just does that "cpumask_of(cpu)" thing..
>
> Hmm?
>
>Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco  wrote:
>
> Same environment as before: Nested KVM (2 vcpus) on top of Proliant
> DL380G8 with acpi_idle and no x2apic optout.

Btw, which apic model does that end up using? Does "no x2apic optout"
mean you're using the x2apic?

What does "dmesg | grep apic" report? Something like

Switched APIC routing to cluster x2apic.

or what?

Side note to the apic guys: I think the "single CPU" case ends up
being one of the most important ones, but the stupid APIC model
doesn't allow that, so sending an IPI to a single CPU ends up being
"send a mask with a single bit set", and then we have that horrible
"for_each_cpu(cpu, mask)" crap.

Would it make sense to perhaps add a "send_IPI_single()" function
call, and then for the APIC models that always are based on masks, use
a wrapper that just does that "cpumask_of(cpu)" thing..

Hmm?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco  wrote:
>
> Just a quick feedback, We were able to reproduce the lockup with this
> proposed patch (3.19 + patch). Unfortunately we had problems with the
> core file and I have only the stack trace for now but I think we are
> able to reproduce it again and provide more details (sorry for the
> delay... after a reboot it took some days for us to reproduce this
> again).
>
> It looks like RIP is still smp_call_function_single.

Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if
you can get the core-file to work and figure out where the other CPU
is, that would be good.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Peter Zijlstra
On Thu, Feb 19, 2015 at 01:42:39PM -0200, Rafael David Tinoco wrote:
> Linus, Peter, Thomas
> 
> Just a quick feedback, We were able to reproduce the lockup with this
> proposed patch (3.19 + patch). Unfortunately we had problems with the
> core file and I have only the stack trace for now but I think we are
> able to reproduce it again and provide more details (sorry for the
> delay... after a reboot it took some days for us to reproduce this
> again).
> 
> It looks like RIP is still smp_call_function_single.

So Linus' patch mostly fixes smp_call_function_single_async() which is
not what you're using.

It would be very good to see traces of other CPUs; if for some reason
the target CPU doesn't get around to running your callback, then we'll
forever wait on it.

loaded_vmcs_clear() uses smp_call_function_single(.wait = 1), that
should work as before.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Rafael David Tinoco
Linus, Peter, Thomas

Just a quick feedback, We were able to reproduce the lockup with this
proposed patch (3.19 + patch). Unfortunately we had problems with the
core file and I have only the stack trace for now but I think we are
able to reproduce it again and provide more details (sorry for the
delay... after a reboot it took some days for us to reproduce this
again).

It looks like RIP is still smp_call_function_single.

Same environment as before: Nested KVM (2 vcpus) on top of Proliant
DL380G8 with acpi_idle and no x2apic optout.

[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
  E   3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013] task: 88081b9beca0 ti: 88081a7a task.ti:
88081a7a
[47708.068013] RIP: 0010:[]  []
smp_call_function_single+0xca/0x120
[47708.068013] RSP: 0018:88081a7a3b38  EFLAGS: 0202
[47708.068013] RAX:  RBX: 0001 RCX: 0002
[47708.068013] RDX: 0001 RSI:  RDI: 0296
[47708.068013] RBP: 88081a7a3b78 R08: 81815168 R09: 880818192000
[47708.068013] R10: bdf6 R11: 0001bf90 R12: 0008810b66f8
[47708.068013] R13: 00fb R14: 0296 R15: 
[47708.068013] FS:  7fa143fff700() GS:88083fc0()
knlGS:
[47708.068013] CS:  0010 DS:  ES:  CR0: 80050033
[47708.068013] CR2: 7f5d76f5d050 CR3: 0008190cc000 CR4: 000426f0
[47708.068013] Stack:
[47708.068013]  88083fd151b8 0001 
c0589320
[47708.068013]  88081a547a80 0003 88081a543f80

[47708.068013]  88081a7a3b88 c0586097 88081a7a3bc8
c058aefe
[47708.068013] Call Trace:
[47708.068013]  [] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013]  [] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013]  [] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013]  [] ? set_next_entity+0x9d/0xb0
[47708.068013]  [] kvm_arch_vcpu_load+0x33/0x1f0 [kvm]
[47708.068013]  [] kvm_sched_in+0x39/0x40 [kvm]
[47708.068013]  [] finish_task_switch+0x98/0x1a0
[47708.068013]  [] __schedule+0x33b/0x900
[47708.068013]  [] schedule+0x37/0x90
[47708.068013]  [] kvm_vcpu_block+0x6d/0xb0 [kvm]
[47708.068013]  [] ? prepare_to_wait_event+0x110/0x110
[47708.068013]  [] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm]
[47708.068013]  [] kvm_vcpu_ioctl+0x2ce/0x670 [kvm]
[47708.068013]  [] ? new_sync_write+0x81/0xb0
[47708.068013]  [] do_vfs_ioctl+0x2f8/0x510
[47708.068013]  [] ? __sb_end_write+0x35/0x70
[47708.068013]  [] ? kvm_on_user_return+0x74/0x80 [kvm]
[47708.068013]  [] SyS_ioctl+0x81/0xa0
[47708.068013]  [] system_call_fastpath+0x16/0x1b
[47708.068013] Code: 30 5b 41 5c 5d c3 0f 1f 00 48 8d 75 d0 48 89 d1
89 df 4c 89 e2 e8 57 fe ff ff 0f b7 55 e8 83 e2 01 74 da 66 0f 1f 44
00 00 f3 90 <0f> b7 55 e8 83 e2 01 75 f5 eb c7 0f 1f 00 8b 05 ca e6 dd
00 85
[47708.068013] Kernel panic - not syncing: softlockup: hung tasks
[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
  EL  3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013]  88081b9beca0 88083fc03de8 817a6bf6

[47708.068013]  81ab30d4 88083fc03e68 817a1aec
0e92
[47708.068013]  0008 88083fc03e78 88083fc03e18
88083fc03e68
[47708.068013] Call Trace:
[47708.068013][] dump_stack+0x45/0x57
[47708.068013]  [] panic+0xc1/0x1f5
[47708.068013]  [] watchdog_timer_fn+0x1db/0x1f0
[47708.068013]  [] __run_hrtimer+0x77/0x1d0
[47708.068013]  [] ? watchdog+0x30/0x30
[47708.068013]  [] hrtimer_interrupt+0xf3/0x220
[47708.068013]  [] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013]  [] local_apic_timer_interrupt+0x39/0x60
[47708.068013]  [] smp_apic_timer_interrupt+0x45/0x60
[47708.068013]  [] apic_timer_interrupt+0x6d/0x80
[47708.068013][] ?
smp_call_function_single+0xca/0x120
[47708.068013]  [] ? smp_call_function_single+0xb9/0x120
[47708.068013]  [] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013]  [] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013]  [] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013]  [] ? set_next_entity+0x9d/0xb0
[47708.068013]  [] kvm_arch_vcpu_load+0x33/0x1f0 [kvm]
[47708.068013]  [] kvm_sched_in+0x39/0x40 [kvm]
[47708.068013]  [] finish_task_switch+0x98/0x1a0
[47708.068013]  [] __schedule+0x33b/0x900
[47708.068013]  [] schedule+0x37/0x90
[47708.068013]  [] kvm_vcpu_block+0x6d/0xb0 [kvm]
[47708.068013]  [] ? prepare_to_wait_event+0x110/0x110
[47708.068013]  [] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm]
[47708.068013]  [] kvm_vcpu_ioctl+0x2ce/0x670 [kvm]
[47708.068013]  [] ? new_sync_write+0x81/0xb0
[47708.068013]  [] do_vfs_ioctl+0x2f8/0x510

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:

 Are there known errata for the x2apic?

 .. and in particular, do we still have to worry about the traditional
 local apic if there are more than two pending interrupts per priority
 level, things get lost problem?

 I forget the exact details. Hopefully somebody remembers.

I can't find it in the docs. I find the two-entries per vector, but
not anything that is per priority level (group of 16 vectors). Maybe
that was the IO-APIC, in which case it's immaterial for IPI's.

However, having now mostly re-acquainted myself with the APIC details,
it strikes me that we do have some oddities here.

In particular, a few interrupt types are very special: NMI, SMI, INIT,
ExtINT, or SIPI are handled early in the interrupt acceptance logic,
and are sent directly to the CPU core, without going through the usual
intermediate IRR/ISR dance.

And why might this matter? It's important because it means that those
kinds of interrupts must *not* do the apic EOI that ack_APIC_irq()
does.

And we correctly don't do ack_APIC_irq() for NMI etc, but it strikes
me that ExtINT is odd and special.

I think we still use ExtINT for some odd cases. We used to have some
magic with the legacy timer interrupt, for example. And I think they
all go through the normal do_IRQ() logic regardless of whether they
are ExtINT or not.

Now, what happens if we send an EOI for an ExtINT interrupt? It
basically ends up being a spurious IPI. And I *think* that what
normally happens is absolutely nothing at all. But if in addition to
the ExtINT, there was a pending IPI (or other pending ISR bit set),
maybe we lose interrupts..

.. and it's entirely possible that I'm just completely full of shit.
Who is the poor bastard who has worked most with things like ExtINT,
and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
contacts..

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: smp_call_function_single lockups

2015-02-19 Thread Rafael David Tinoco
Linus, Peter, Thomas

Just a quick feedback, We were able to reproduce the lockup with this
proposed patch (3.19 + patch). Unfortunately we had problems with the
core file and I have only the stack trace for now but I think we are
able to reproduce it again and provide more details (sorry for the
delay... after a reboot it took some days for us to reproduce this
again).

It looks like RIP is still smp_call_function_single.

Same environment as before: Nested KVM (2 vcpus) on top of Proliant
DL380G8 with acpi_idle and no x2apic optout.

[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
  E   3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013] task: 88081b9beca0 ti: 88081a7a task.ti:
88081a7a
[47708.068013] RIP: 0010:[810f537a]  [810f537a]
smp_call_function_single+0xca/0x120
[47708.068013] RSP: 0018:88081a7a3b38  EFLAGS: 0202
[47708.068013] RAX:  RBX: 0001 RCX: 0002
[47708.068013] RDX: 0001 RSI:  RDI: 0296
[47708.068013] RBP: 88081a7a3b78 R08: 81815168 R09: 880818192000
[47708.068013] R10: bdf6 R11: 0001bf90 R12: 0008810b66f8
[47708.068013] R13: 00fb R14: 0296 R15: 
[47708.068013] FS:  7fa143fff700() GS:88083fc0()
knlGS:
[47708.068013] CS:  0010 DS:  ES:  CR0: 80050033
[47708.068013] CR2: 7f5d76f5d050 CR3: 0008190cc000 CR4: 000426f0
[47708.068013] Stack:
[47708.068013]  88083fd151b8 0001 
c0589320
[47708.068013]  88081a547a80 0003 88081a543f80

[47708.068013]  88081a7a3b88 c0586097 88081a7a3bc8
c058aefe
[47708.068013] Call Trace:
[47708.068013]  [c0589320] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013]  [c0586097] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013]  [c058aefe] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013]  [810a918d] ? set_next_entity+0x9d/0xb0
[47708.068013]  [c04660e3] kvm_arch_vcpu_load+0x33/0x1f0 [kvm]
[47708.068013]  [c0452529] kvm_sched_in+0x39/0x40 [kvm]
[47708.068013]  [8109e8e8] finish_task_switch+0x98/0x1a0
[47708.068013]  [817aa81b] __schedule+0x33b/0x900
[47708.068013]  [817aae17] schedule+0x37/0x90
[47708.068013]  [c0451e7d] kvm_vcpu_block+0x6d/0xb0 [kvm]
[47708.068013]  [810b6ec0] ? prepare_to_wait_event+0x110/0x110
[47708.068013]  [c0469d3c] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm]
[47708.068013]  [c04551ce] kvm_vcpu_ioctl+0x2ce/0x670 [kvm]
[47708.068013]  [811ef441] ? new_sync_write+0x81/0xb0
[47708.068013]  [812034e8] do_vfs_ioctl+0x2f8/0x510
[47708.068013]  [811f2215] ? __sb_end_write+0x35/0x70
[47708.068013]  [c045cf84] ? kvm_on_user_return+0x74/0x80 [kvm]
[47708.068013]  [81203781] SyS_ioctl+0x81/0xa0
[47708.068013]  [817aefad] system_call_fastpath+0x16/0x1b
[47708.068013] Code: 30 5b 41 5c 5d c3 0f 1f 00 48 8d 75 d0 48 89 d1
89 df 4c 89 e2 e8 57 fe ff ff 0f b7 55 e8 83 e2 01 74 da 66 0f 1f 44
00 00 f3 90 0f b7 55 e8 83 e2 01 75 f5 eb c7 0f 1f 00 8b 05 ca e6 dd
00 85
[47708.068013] Kernel panic - not syncing: softlockup: hung tasks
[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
  EL  3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013]  88081b9beca0 88083fc03de8 817a6bf6

[47708.068013]  81ab30d4 88083fc03e68 817a1aec
0e92
[47708.068013]  0008 88083fc03e78 88083fc03e18
88083fc03e68
[47708.068013] Call Trace:
[47708.068013]  IRQ  [817a6bf6] dump_stack+0x45/0x57
[47708.068013]  [817a1aec] panic+0xc1/0x1f5
[47708.068013]  [8112ba0b] watchdog_timer_fn+0x1db/0x1f0
[47708.068013]  [810e0e37] __run_hrtimer+0x77/0x1d0
[47708.068013]  [8112b830] ? watchdog+0x30/0x30
[47708.068013]  [810e1203] hrtimer_interrupt+0xf3/0x220
[47708.068013]  [c0589320] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013]  [8104b0a9] local_apic_timer_interrupt+0x39/0x60
[47708.068013]  [817b1fb5] smp_apic_timer_interrupt+0x45/0x60
[47708.068013]  [817b002d] apic_timer_interrupt+0x6d/0x80
[47708.068013]  EOI  [810f537a] ?
smp_call_function_single+0xca/0x120
[47708.068013]  [810f5369] ? smp_call_function_single+0xb9/0x120
[47708.068013]  [c0589320] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013]  [c0586097] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013]  [c058aefe] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013]  

Re: smp_call_function_single lockups

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 12:29 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Now, what happens if we send an EOI for an ExtINT interrupt? It
 basically ends up being a spurious IPI. And I *think* that what
 normally happens is absolutely nothing at all. But if in addition to
 the ExtINT, there was a pending IPI (or other pending ISR bit set),
 maybe we lose interrupts..

 .. and it's entirely possible that I'm just completely full of shit.
 Who is the poor bastard who has worked most with things like ExtINT,
 and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
 contacts..

So quite frankly, trying to follow all the logic from do_IRQ() through
handle_irq() to the actual low-level handler, I just couldn't do it.

So instead, I wrote a patch to verify that the ISR bit is actually set
when we do ack_APIC_irq().

This was complicated by the fact that we don't actually pass in the
vector number at all to the acking, so 99% of the patch is just doing
that. A couple of places we don't really have a good vector number, so
I said screw it, a negative value means that we won't check the ISR).

The attached patch is quite possibly garbage, but it gives an
interesting warning for me during i8042 probing, so who knows. Maybe
it actually shows a real problem - or maybe I just screwed up the
patch.

.. and maybe even if the patch is fine, it's actually never really a
problem to have spurious APIC ACK cycles. Maybe it cannot make
interrupts be ignored.

Anyway, the back-trace for the warning I get is during boot:

...
PNP: No PS/2 controller found. Probing ports directly.
[ cut here ]
WARNING: CPU: 0 PID: 1 at ./arch/x86/include/asm/apic.h:436
ir_ack_apic_edge+0x74/0x80()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted
3.19.0-08857-g89d3fa45b4ad-dirty #2
Call Trace:
 IRQ
   dump_stack+0x45/0x57
   warn_slowpath_common+0x80/0xc0
   warn_slowpath_null+0x15/0x20
   ir_ack_apic_edge+0x74/0x80
   handle_edge_irq+0x51/0x110
   handle_irq+0x74/0x140
   do_IRQ+0x4a/0x140
   common_interrupt+0x6a/0x6a
 EOI
   ? _raw_spin_unlock_irqrestore+0x9/0x10
   __setup_irq+0x239/0x5a0
   request_threaded_irq+0xc2/0x180
   i8042_probe+0x5b8/0x680
   platform_drv_probe+0x2f/0xa0
   driver_probe_device+0x8b/0x3e0
   __driver_attach+0x93/0xa0
   bus_for_each_dev+0x63/0xa0
   driver_attach+0x19/0x20
   bus_add_driver+0x178/0x250
   driver_register+0x5f/0xf0
   __platform_driver_register+0x45/0x50
   __platform_driver_probe+0x26/0xa0
   __platform_create_bundle+0xad/0xe0
   i8042_init+0x3d0/0x3f6
   do_one_initcall+0xb8/0x1d0
   kernel_init_freeable+0x16d/0x1fa
   kernel_init+0x9/0xf0
   ret_from_fork+0x7c/0xb0
---[ end trace 1de82c4457c6a0f0 ]---
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
...

and it looks not entirely insane.

Is this worth looking at? Or is it something spurious? I might have
gotten the vectors wrong, and maybe the warning is not because the ISR
bit isn't set, but because I test the wrong bit.

 Linus
 arch/x86/include/asm/apic.h  | 22 ++
 arch/x86/kernel/apic/apic.c  | 10 +-
 arch/x86/kernel/apic/io_apic.c   |  5 +++--
 arch/x86/kernel/apic/vector.c|  8 +---
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  4 ++--
 arch/x86/kernel/cpu/mcheck/threshold.c   |  4 ++--
 arch/x86/kernel/irq.c| 11 ++-
 arch/x86/kernel/irq_work.c   |  8 
 arch/x86/kernel/smp.c| 18 +-
 drivers/iommu/irq_remapping.c|  6 --
 10 files changed, 54 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 92003f3c8a42..912a969fd747 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -387,7 +387,7 @@ static inline void apic_write(u32 reg, u32 val)
apic-write(reg, val);
 }
 
-static inline void apic_eoi(void)
+static inline void apic_eoi(int vector)
 {
apic-eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
@@ -418,7 +418,7 @@ extern void __init apic_set_eoi_write(void (*eoi_write)(u32 
reg, u32 v));
 
 static inline u32 apic_read(u32 reg) { return 0; }
 static inline void apic_write(u32 reg, u32 val) { }
-static inline void apic_eoi(void) { }
+static inline void apic_eoi(int vector) { }
 static inline u64 apic_icr_read(void) { return 0; }
 static inline void apic_icr_write(u32 low, u32 high) { }
 static inline void apic_wait_icr_idle(void) { }
@@ -427,13 +427,19 @@ static inline void apic_set_eoi_write(void 
(*eoi_write)(u32 reg, u32 v)) {}
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
-static inline void ack_APIC_irq(void)
+static inline void ack_APIC_irq(int vector)
 {
+   /* Is the ISR bit actually set for this vector? */
+   

  1   2   >