Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
On 29.10.2019 12:29, Sergey Dyasli wrote: > On 28/10/2019 17:30, Stonehouse, Robert wrote: >> This is a heads-up as I have observed that the following commit (backported >> onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. >> >> commit c719519a4183d0630121f6abeba420f49dbc3229 >> Author: Jan Beulich >> AuthorDate: Fri Jul 5 10:32:41 2019 +0200 >> Commit: Jan Beulich >> CommitDate: Fri Jul 5 10:32:41 2019 +0200 >> >> x86/SMP: don't try to stop already stopped CPUs >> >> In particular with an enabled IOMMU (but not really limited to this >> case), trying to invoke fixup_irqs() after having already done >> disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea: >> > > This was already fixed in staging by "x86/crash: fix kexec transition > breakage": > > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=f56813f3470c5b4987963c3c41e4fe16b95c5a3f > > Looks like it needs inclusion into 4.11 branch. Hmm, in principle I did fish out this one and a few more for backporting. But it looks like I've applied them to the 4.12 branch only. Thanks for noticing! Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
On 28/10/2019 17:30, Stonehouse, Robert wrote: > This is a heads-up as I have observed that the following commit (backported > onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. > > commit c719519a4183d0630121f6abeba420f49dbc3229 > Author: Jan Beulich > AuthorDate: Fri Jul 5 10:32:41 2019 +0200 > Commit: Jan Beulich > CommitDate: Fri Jul 5 10:32:41 2019 +0200 > > x86/SMP: don't try to stop already stopped CPUs > > In particular with an enabled IOMMU (but not really limited to this > case), trying to invoke fixup_irqs() after having already done > disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea: > This was already fixed in staging by "x86/crash: fix kexec transition breakage": https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=f56813f3470c5b4987963c3c41e4fe16b95c5a3f Looks like it needs inclusion into 4.11 branch. -- Thanks, Sergey ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
Hi, Am Montag, 28. Oktober 2019, 18:30:12 CET schrieb Stonehouse, Robert: > This is a heads-up as I have observed that the following commit (backported > onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. > > commit c719519a4183d0630121f6abeba420f49dbc3229 > Author: Jan Beulich > AuthorDate: Fri Jul 5 10:32:41 2019 +0200 > Commit: Jan Beulich > CommitDate: Fri Jul 5 10:32:41 2019 +0200 > > x86/SMP: don't try to stop already stopped CPUs > > In particular with an enabled IOMMU (but not really limited to this > case), trying to invoke fixup_irqs() after having already done > disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea: > > > The test was performing "echo c > /proc/sysrq-trigger" in dom0 and the loaded > crash kernel fails to show any signs of starting. This is the end of the Xen > console ... > > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. > (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. > > > Expected behaviour is that the kdump kernel immediately loads and then > performs the crash dump I can confirm this behavior but with xen version (4.11.0_08-1) from SuSE SLES12 SP4 which doesn't contain the said commit c719519a4183d0630121f6abeba420f49dbc3229.But I can see this only on systems with newer Intel CPUS like "Intel(R) Xeon(R) Gold 6242 CPU". > > I'm sorry that I have not yet had time to check if this affects vanilla > stable-4.11 or master. I just wanted to be certain that you don't have the > same issue. > > > Reverting one hunk via the following commit fixes things for me (this is an > experiment and not at all a proposed fix) > > --- a/xen/arch/x86/smp.c > +++ b/xen/arch/x86/smp.c > @@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy) > void smp_send_stop(void) > { > unsigned int cpu = smp_processor_id(); > + > +local_irq_disable(); > +fixup_irqs(cpumask_of(cpu), 0); > +local_irq_enable(); > > if ( num_online_cpus() > 1 ) > { > int timeout = 10; > > -local_irq_disable(); > -fixup_irqs(cpumask_of(cpu), 0); > -local_irq_enable(); > - > smp_call_function(stop_this_cpu, NULL, 0); > > /* Wait 10ms for all other CPUs to go offline. */ > > > Regards > Rob > > ___ > Xen-devel mailing list > Xen-devel@lists.xenproject.org > https://lists.xenproject.org/mailman/listinfo/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
On 28.10.2019 18:30, Stonehouse, Robert wrote: > Reverting one hunk via the following commit fixes things for me (this is an > experiment and not at all a proposed fix) > > --- a/xen/arch/x86/smp.c > +++ b/xen/arch/x86/smp.c > @@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy) > void smp_send_stop(void) > { > unsigned int cpu = smp_processor_id(); > + > +local_irq_disable(); > +fixup_irqs(cpumask_of(cpu), 0); > +local_irq_enable(); > > if ( num_online_cpus() > 1 ) > { > int timeout = 10; > > -local_irq_disable(); > -fixup_irqs(cpumask_of(cpu), 0); > -local_irq_enable(); Are you saying we get here the first time only when num_online_cpus() already returns 1 (but there are actually multiple CPUs, i.e. affinity changes are actually needed)? If so - why? Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel