Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
On 29.10.2019 12:29, Sergey Dyasli wrote: > On 28/10/2019 17:30, Stonehouse, Robert wrote: >> This is a heads-up as I have observed that the following commit (backported >> onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. >> >> commit c719519a4183d0630121f6abeba420f49dbc3229 >> Author: Jan Beulich >> AuthorDate: Fri Jul 5 10:32:41 2019 +0200 >> Commit: Jan Beulich >> CommitDate: Fri Jul 5 10:32:41 2019 +0200 >> >> x86/SMP: don't try to stop already stopped CPUs >> >> In particular with an enabled IOMMU (but not really limited to this >> case), trying to invoke fixup_irqs() after having already done >> disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea: >> > > This was already fixed in staging by "x86/crash: fix kexec transition > breakage": > > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=f56813f3470c5b4987963c3c41e4fe16b95c5a3f > > Looks like it needs inclusion into 4.11 branch. Hmm, in principle I did fish out this one and a few more for backporting. But it looks like I've applied them to the 4.12 branch only. Thanks for noticing! Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
On 28/10/2019 17:30, Stonehouse, Robert wrote: > This is a heads-up as I have observed that the following commit (backported > onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. > > commit c719519a4183d0630121f6abeba420f49dbc3229 > Author: Jan Beulich > AuthorDate: Fri Jul 5 10:32:41 2019 +0200 > Commit: Jan Beulich > CommitDate: Fri Jul 5 10:32:41 2019 +0200 > > x86/SMP: don't try to stop already stopped CPUs > > In particular with an enabled IOMMU (but not really limited to this > case), trying to invoke fixup_irqs() after having already done > disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea: > This was already fixed in staging by "x86/crash: fix kexec transition breakage": https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=f56813f3470c5b4987963c3c41e4fe16b95c5a3f Looks like it needs inclusion into 4.11 branch. -- Thanks, Sergey ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
Hi, Am Montag, 28. Oktober 2019, 18:30:12 CET schrieb Stonehouse, Robert: > This is a heads-up as I have observed that the following commit (backported > onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. > > commit c719519a4183d0630121f6abeba420f49dbc3229 > Author: Jan Beulich > AuthorDate: Fri Jul 5 10:32:41 2019 +0200 > Commit: Jan Beulich > CommitDate: Fri Jul 5 10:32:41 2019 +0200 > > x86/SMP: don't try to stop already stopped CPUs > > In particular with an enabled IOMMU (but not really limited to this > case), trying to invoke fixup_irqs() after having already done > disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea: > > > The test was performing "echo c > /proc/sysrq-trigger" in dom0 and the loaded > crash kernel fails to show any signs of starting. This is the end of the Xen > console ... > > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. > (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. > > > Expected behaviour is that the kdump kernel immediately loads and then > performs the crash dump I can confirm this behavior but with xen version (4.11.0_08-1) from SuSE SLES12 SP4 which doesn't contain the said commit c719519a4183d0630121f6abeba420f49dbc3229.But I can see this only on systems with newer Intel CPUS like "Intel(R) Xeon(R) Gold 6242 CPU". > > I'm sorry that I have not yet had time to check if this affects vanilla > stable-4.11 or master. I just wanted to be certain that you don't have the > same issue. > > > Reverting one hunk via the following commit fixes things for me (this is an > experiment and not at all a proposed fix) > > --- a/xen/arch/x86/smp.c > +++ b/xen/arch/x86/smp.c > @@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy) > void smp_send_stop(void) > { > unsigned int cpu = smp_processor_id(); > + > +local_irq_disable(); > +fixup_irqs(cpumask_of(cpu), 0); > +local_irq_enable(); > > if ( num_online_cpus() > 1 ) > { > int timeout = 10; > > -local_irq_disable(); > -fixup_irqs(cpumask_of(cpu), 0); > -local_irq_enable(); > - > smp_call_function(stop_this_cpu, NULL, 0); > > /* Wait 10ms for all other CPUs to go offline. */ > > > Regards > Rob > > ___ > Xen-devel mailing list > Xen-devel@lists.xenproject.org > https://lists.xenproject.org/mailman/listinfo/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
On 28.10.2019 18:30, Stonehouse, Robert wrote: > Reverting one hunk via the following commit fixes things for me (this is an > experiment and not at all a proposed fix) > > --- a/xen/arch/x86/smp.c > +++ b/xen/arch/x86/smp.c > @@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy) > void smp_send_stop(void) > { > unsigned int cpu = smp_processor_id(); > + > +local_irq_disable(); > +fixup_irqs(cpumask_of(cpu), 0); > +local_irq_enable(); > > if ( num_online_cpus() > 1 ) > { > int timeout = 10; > > -local_irq_disable(); > -fixup_irqs(cpumask_of(cpu), 0); > -local_irq_enable(); Are you saying we get here the first time only when num_online_cpus() already returns 1 (but there are actually multiple CPUs, i.e. affinity changes are actually needed)? If so - why? Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
[Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
This is a heads-up as I have observed that the following commit (backported onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. commit c719519a4183d0630121f6abeba420f49dbc3229 Author: Jan Beulich AuthorDate: Fri Jul 5 10:32:41 2019 +0200 Commit: Jan Beulich CommitDate: Fri Jul 5 10:32:41 2019 +0200 x86/SMP: don't try to stop already stopped CPUs In particular with an enabled IOMMU (but not really limited to this case), trying to invoke fixup_irqs() after having already done disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea: The test was performing "echo c > /proc/sysrq-trigger" in dom0 and the loaded crash kernel fails to show any signs of starting. This is the end of the Xen console ... (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. Expected behaviour is that the kdump kernel immediately loads and then performs the crash dump I'm sorry that I have not yet had time to check if this affects vanilla stable-4.11 or master. I just wanted to be certain that you don't have the same issue. Reverting one hunk via the following commit fixes things for me (this is an experiment and not at all a proposed fix) --- a/xen/arch/x86/smp.c +++ b/xen/arch/x86/smp.c @@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy) void smp_send_stop(void) { unsigned int cpu = smp_processor_id(); + +local_irq_disable(); +fixup_irqs(cpumask_of(cpu), 0); +local_irq_enable(); if ( num_online_cpus() > 1 ) { int timeout = 10; -local_irq_disable(); -fixup_irqs(cpumask_of(cpu), 0); -local_irq_enable(); - smp_call_function(stop_this_cpu, NULL, 0); /* Wait 10ms for all other CPUs to go offline. */ Regards Rob ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel