Re: [PATCHv3 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable

2022-05-09 Thread Thomas Gleixner
On Mon, May 09 2022 at 12:13, Pingfan Liu wrote:
> The following code chunk repeats in both
> migrate_to_reboot_cpu() and smp_shutdown_nonboot_cpus():
>
>   if (!cpu_online(primary_cpu))
>   primary_cpu = cpumask_first(cpu_online_mask);
>
> This is due to a breakage like the following:

I don't see what's broken here.

> kernel_kexec()
>migrate_to_reboot_cpu();
>cpu_hotplug_enable();
> ---> comes a cpu_down(this_cpu) on other cpu
>machine_shutdown();
>  smp_shutdown_nonboot_cpus(); // re-check "if (!cpu_online(primary_cpu))" 
> to protect against the former breakin
>
> Although the kexec-reboot task can get through a cpu_down() on its cpu,
> this code looks a little confusing.

Confusing != broken.

> +/* primary_cpu keeps unchanged after migrate_to_reboot_cpu() */

This comment makes no sense.

>  void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
>  {
>   unsigned int cpu;
>   int error;
>  
> + /*
> +  * Block other cpu hotplug event, so primary_cpu is always online if
> +  * it is not touched by us
> +  */
>   cpu_maps_update_begin();
> -
>   /*
> -  * Make certain the cpu I'm about to reboot on is online.
> -  *
> -  * This is inline to what migrate_to_reboot_cpu() already do.
> +  * migrate_to_reboot_cpu() disables CPU hotplug assuming that
> +  * no further code needs to use CPU hotplug (which is true in
> +  * the reboot case). However, the kexec path depends on using
> +  * CPU hotplug again; so re-enable it here.

You want to reduce confusion, but in reality this is even more confusing
than before.

>*/
> - if (!cpu_online(primary_cpu))
> - primary_cpu = cpumask_first(cpu_online_mask);
> + __cpu_hotplug_enable();

How is this decrement solving anything? At the end of this function, the
counter is incremented again. So what's the point of this exercise?

>   for_each_online_cpu(cpu) {
>   if (cpu == primary_cpu)
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 68480f731192..db4fa6b174e3 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1168,14 +1168,12 @@ int kernel_kexec(void)
>   kexec_in_progress = true;
>   kernel_restart_prepare("kexec reboot");
>   migrate_to_reboot_cpu();
> -
>   /*
> -  * migrate_to_reboot_cpu() disables CPU hotplug assuming that
> -  * no further code needs to use CPU hotplug (which is true in
> -  * the reboot case). However, the kexec path depends on using
> -  * CPU hotplug again; so re-enable it here.
> +  * migrate_to_reboot_cpu() disables CPU hotplug. If an arch
> +  * relies on the cpu teardown to achieve reboot, it needs to
> +  * re-enable CPU hotplug there.

What does that for arch/powerpc/kernel/kexec_machine64.c now?

Nothing, as far as I can tell. Which means you basically reverted
011e4b02f1da ("powerpc, kexec: Fix "Processor X is stuck" issue during
kexec from ST mode") unless I'm completely confused.

>*/
> - cpu_hotplug_enable();

This is tinkering at best. Can we please sit down and rethink this whole
machinery instead of applying random duct tape to it?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv3 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable

2022-05-09 Thread Thomas Gleixner
On Mon, May 09 2022 at 12:55, Thomas Gleixner wrote:
> On Mon, May 09 2022 at 12:13, Pingfan Liu wrote:
>> -cpu_hotplug_enable();
>
> This is tinkering at best. Can we please sit down and rethink this whole
> machinery instead of applying random duct tape to it?

That said, I still have not figured out which real world problem you are
actually trying to solve.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 16/30] drivers/hv/vmbus, video/hyperv_fb: Untangle and refactor Hyper-V panic notifiers

2022-05-09 Thread Guilherme G. Piccoli
On 03/05/2022 15:13, Michael Kelley (LINUX) wrote:
> [...]
>> (a) We could forget about this change, and always do the clean-up here,
>> not relying in machine_crash_shutdown().
>> Pro: really simple, behaves the same as it is doing currently.
>> Con: less elegant/concise, doesn't allow arm64 customization.
>>
>> (b) Add a way to allow ARM64 customization of shutdown crash handler.
>> Pro: matches x86, more customizable, improves arm64 arch code.
>> Con: A tad more complex.
>>
>> Also, a question that came-up: if ARM64 has no way of calling special
>> crash shutdown handler, how can you execute hv_stimer_cleanup() and
>> hv_synic_disable_regs() there? Or are they not required in ARM64?
>>
> 
> My suggestion is to do (a) for now.  I suspect (b) could be a more
> extended discussion and I wouldn't want your patch set to get held
> up on that discussion.  I don't know what the sense of the ARM64
> maintainers would be toward (b).  They have tried to avoid picking
> up code warts like have accumulated on the x86/x64 side over the
> years, and I agree with that effort.  But as more and varied
> hypervisors become available for ARM64, it seems like a framework
> for supporting a custom shutdown handler may become necessary.
> But that could take a little time.
> 
> You are right about hv_stimer_cleanup() and hv_synic_disable_regs().
> We are not running these when a panic occurs on ARM64, and we
> should be, though the risk is small.   We will pursue (b) and add
> these additional cleanups as part of that.  But again, I would suggest
> doing (a) for now, and we will switch back to your solution once
> (b) is in place.
> 

Thanks again Michael, I'll stick with (a) for now. I'll check with ARM64
community about that, and I might even try to implement something in
parallel (if you are not already working on that - lemme know please),
so we don't get stuck here. As you said, I feel that this is more and
more relevant as the number of panic/crash/kexec scenarios tend to
increase in ARM64.


>> [...]
>> Some ideas of what we can do here:
>>
>> I) we could change the framebuffer notifier to rely on trylocks, instead
>> of risking a lockup scenario, and with that, we can execute it before
>> the vmbus disconnect in the hypervisor list;
> 
> I think we have to do this approach for now.
> 
>>
>> II) we ignore the hypervisor notifier in case of kdump _by default_, and
>> if the users don't want that, they can always set the panic notifier
>> level to 4 and run all notifiers prior to kdump; would that be terrible
>> you think? Kdump users might don't care about the framebuffer...
>>
>> III) we go with approach (b) above and refactor arm64 code to allow the
>> custom crash handler on kdump time, then [with point (I) above] the
>> logic proposed in this series is still valid - seems more and more the
>> most correct/complete solution.
> 
> But even when/if we get approach (b) implemented, having the
> framebuffer notifier on the pre_reboot list is still too late with the
> default of panic_notifier_level = 2.  The kdump path will reset the
> VMbus connection and then the framebuffer notifier won't work.
> 

OK, perfect! I'll work something along these lines in V2, allowing the
FB notifier to always run in the hypervisor list before the vmbus unload
mechanism.


>> [...]
 +static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long 
 val,
  void *args)
 +{
 +  if (!kexec_crash_loaded())
>>>
>>> I'm not clear on the purpose of this condition.  I think it means
>>> we will skip the vmbus_initiate_unload() if a panic occurs in the
>>> kdump kernel.  Is there a reason a panic in the kdump kernel
>>> should be treated differently?  Or am I misunderstanding?
>>
>> This is really related with the point discussed in the top of this
>> response - I assumed both ARM64/x86_64 would behave the same and
>> disconnect the vmbus through the custom crash handler when kdump is set,
>> so worth skipping it here in the notifier. But that's not true for ARM64
>> as you pointed, so this guard against kexec is really part of the
>> decision/discussion on what to do with ARM64 heh
> 
> But note that vmbus_initiate_unload() already has a guard built-in.
> If the intent of this test is just as a guard against running twice,
> then it isn't needed.

Since we're going to avoid relying in the custom crash_shutdown(), due
to the lack of ARM64 support for now, this check will be removed in V2.

Its purpose was to skip the notifier *proactively* in case kexec is set,
given that...once kexec happens, the custom crash_shutdown() would run
the same function (wrong assumption for ARM64, my bad).

Postponing that slightly would maybe gain us some time while the
hypervisor finish its work, so we'd delay less in the vmbus unload path
- that was the rationale behind this check.


Cheers!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/m

Re: [PATCH 10/30] alpha: Clean-up the panic notifier code

2022-05-09 Thread Guilherme G. Piccoli
On 27/04/2022 19:49, Guilherme G. Piccoli wrote:
> The alpha panic notifier has some code issues, not following
> the conventions of other notifiers. Also, it might halt the
> machine but still it is set to run as early as possible, which
> doesn't seem to be a good idea.
> 
> This patch cleans the code, and set the notifier to run as the
> latest, following the same approach other architectures are doing.
> Also, we remove the unnecessary include of a header already
> included indirectly.
> 
> Cc: Ivan Kokshaysky 
> Cc: Matt Turner 
> Cc: Richard Henderson 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/alpha/kernel/setup.c | 36 +++-
>  1 file changed, 15 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c
> index b4fbbba30aa2..d88bdf852753 100644
> --- a/arch/alpha/kernel/setup.c
> +++ b/arch/alpha/kernel/setup.c
> @@ -41,19 +41,11 @@
>  #include 
>  #include 
>  #endif
> -#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  
> -static int alpha_panic_event(struct notifier_block *, unsigned long, void *);
> -static struct notifier_block alpha_panic_block = {
> - alpha_panic_event,
> -NULL,
> -INT_MAX /* try to do it first */
> -};
> -
>  #include 
>  #include 
>  #include 
> @@ -435,6 +427,21 @@ static const struct sysrq_key_op srm_sysrq_reboot_op = {
>  };
>  #endif
>  
> +static int alpha_panic_event(struct notifier_block *this,
> +  unsigned long event, void *ptr)
> +{
> + /* If we are using SRM and serial console, just hard halt here. */
> + if (alpha_using_srm && srmcons_output)
> + __halt();
> +
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block alpha_panic_block = {
> + .notifier_call = alpha_panic_event,
> + .priority = INT_MIN, /* may not return, do it last */
> +};
> +
>  void __init
>  setup_arch(char **cmdline_p)
>  {
> @@ -1427,19 +1434,6 @@ const struct seq_operations cpuinfo_op = {
>   .show   = show_cpuinfo,
>  };
>  
> -
> -static int
> -alpha_panic_event(struct notifier_block *this, unsigned long event, void 
> *ptr)
> -{
> -#if 1
> - /* FIXME FIXME FIXME */
> - /* If we are using SRM and serial console, just hard halt here. */
> - if (alpha_using_srm && srmcons_output)
> - __halt();
> -#endif
> -return NOTIFY_DONE;
> -}
> -
>  static __init int add_pcspkr(void)
>  {
>   struct platform_device *pd;


Hi folks, I'm updating Richard's email and re-sending the V1, any
reviews are greatly appreciated!

Cheers,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path

2022-05-09 Thread Guilherme G. Piccoli
On 03/05/2022 15:03, Evan Green wrote:
> [...]
> gsmi_shutdown_reason() is a common function called in other scenarios
> as well, like reboot and thermal trip, where it may still make sense
> to wait to acquire a spinlock. Maybe we should add a parameter to
> gsmi_shutdown_reason() so that you can get your change on panic, but
> we don't convert other callbacks into try-fail scenarios causing us to
> miss logs.
> 

Hi Evan, thanks for your feedback, much appreciated!
What I've done in other cases like this was to have a helper checking
the spinlock in the panic notifier - if we can acquire that, go ahead
but if not, bail out. For a proper example of an implementation, check
patch 13 of the series:
https://lore.kernel.org/lkml/20220427224924.592546-14-gpicc...@igalia.com/ .

Do you agree with that, or prefer really a parameter in
gsmi_shutdown_reason() ? I'll follow your choice =)


> Though thinking more about it, is this really a Good Change (TM)? The
> spinlock itself already disables interrupts, meaning the only case
> where this change makes a difference is if the panic happens from
> within the function that grabbed the spinlock (in which case the
> callback is also likely to panic), or in an NMI that panics within
> that window. The downside of this change is that if one core was
> politely working through an event with the lock held, and another core
> panics, we now might lose the panic log, even though it probably would
> have gone through fine assuming the other core has a chance to
> continue.

My feeling is that this is a good change, indeed - a lot of places are
getting changed like this, in this series.

Reasoning: the problem with your example is that, by default, secondary
CPUs are disabled in the panic path, through an IPI mechanism. IPIs take
precedence and interrupt the work in these CPUs, effectively
interrupting the "polite work" with the lock held heh

Then, such CPU is put to sleep and we finally reach the panic notifier
hereby discussed, in the main CPU. If the other CPU was shut-off *with
the lock held*, it's never finishing such work, so the lock is never to
be released. Conclusion: the spinlock can't be acquired, hence we broke
the machine (which is already broken, given it's panic) in the path of
this notifier.
This should be really rare, but..possible. So I think we should protect
against this scenario.

We can grab others' feedback if you prefer, and of course you have the
rights to refuse this change in the gsmi code, but from my
point-of-view, I don't see any advantage in just assume the risk,
specially since the change is very very simple.

Cheers,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path

2022-05-09 Thread Evan Green
On Wed, Apr 27, 2022 at 3:51 PM Guilherme G. Piccoli
 wrote:
>
> Currently the gsmi driver registers a panic notifier as well as
> reboot and die notifiers. The callbacks registered are called in
> atomic and very limited context - for instance, panic disables
> preemption, local IRQs and all other CPUs that aren't running the
> current panic function.
>
> With that said, taking a spinlock in this scenario is a
> dangerous invitation for a deadlock scenario. So, we fix
> that in this commit by changing the regular spinlock with
> a trylock, which is a safer approach.
>
> Fixes: 74c5b31c6618 ("driver: Google EFI SMI")
> Cc: Ard Biesheuvel 
> Cc: David Gow 
> Cc: Evan Green 
> Cc: Julius Werner 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  drivers/firmware/google/gsmi.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/google/gsmi.c b/drivers/firmware/google/gsmi.c
> index adaa492c3d2d..b01ed02e4a87 100644
> --- a/drivers/firmware/google/gsmi.c
> +++ b/drivers/firmware/google/gsmi.c
> @@ -629,7 +629,10 @@ static int gsmi_shutdown_reason(int reason)
> if (saved_reason & (1 << reason))
> return 0;
>
> -   spin_lock_irqsave(&gsmi_dev.lock, flags);
> +   if (!spin_trylock_irqsave(&gsmi_dev.lock, flags)) {
> +   rc = -EBUSY;
> +   goto out;
> +   }

gsmi_shutdown_reason() is a common function called in other scenarios
as well, like reboot and thermal trip, where it may still make sense
to wait to acquire a spinlock. Maybe we should add a parameter to
gsmi_shutdown_reason() so that you can get your change on panic, but
we don't convert other callbacks into try-fail scenarios causing us to
miss logs.

Though thinking more about it, is this really a Good Change (TM)? The
spinlock itself already disables interrupts, meaning the only case
where this change makes a difference is if the panic happens from
within the function that grabbed the spinlock (in which case the
callback is also likely to panic), or in an NMI that panics within
that window. The downside of this change is that if one core was
politely working through an event with the lock held, and another core
panics, we now might lose the panic log, even though it probably would
have gone through fine assuming the other core has a chance to
continue.

-Evan

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path

2022-05-09 Thread Guilherme G. Piccoli
On 03/05/2022 18:56, Evan Green wrote:
> Hi Guilherme,
> [...] 
>> Do you agree with that, or prefer really a parameter in
>> gsmi_shutdown_reason() ? I'll follow your choice =)
> 
> I'm fine with either, thanks for the link. Mostly I want to make sure
> other paths to gsmi_shutdown_reason() aren't also converted to a try.

Hi Evan, thanks for the prompt response! So, I'll proceed like I did in
s390, for consistency.

> [...]
>> Reasoning: the problem with your example is that, by default, secondary
>> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take
>> precedence and interrupt the work in these CPUs, effectively
>> interrupting the "polite work" with the lock held heh
> 
> The IPI can only interrupt a CPU with irqs disabled if the IPI is an
> NMI. I haven't looked before to see if we use NMI IPIs to corral the
> other CPUs on panic. On x86, I grepped my way down to
> native_stop_other_cpus(), which looks like it does a normal IPI, waits
> 1 second, then does an NMI IPI. So, if a secondary CPU has the lock
> held, on x86 it has roughly 1s to finish what it's doing and re-enable
> interrupts before smp_send_stop() brings the NMI hammer down. I think
> this should be more than enough time for the secondary CPU to get out
> and release the lock.
> 
> So then it makes sense to me that you're fixing cases where we
> panicked with the lock held, or hung with the lock held. Given the 1
> second grace period x86 gives us, I'm on board, as that helps mitigate
> the risk that we bailed out early with the try and should have spun a
> bit longer instead. Thanks.
> 
> -Evan

Well, in the old path without "crash_kexec_post_notifiers", we indeed
end-up relying on native_stop_other_cpus() for x86 as you said, and the
"1s rule" makes sense. But after this series (or even before, if the
kernel parameter "crash_kexec_post_notifiers" was used) the function
used to stop CPUs in the panic path is crash_smp_send_stop(), and the
call chain is like:

Main CPU:
crash_smp_send_stop()
--kdump_nmi_shootdown_cpus()
nmi_shootdown_cpus()

Then, in each CPU (except the main one, running panic() path),
we execute kdump_nmi_callback() in NMI context.

So, we seem to indeed interrupt any context (even with IRQs disabled),
increasing the likelihood of the potential lockups due to stopped CPUs
holding the locks heheh

Thanks again for the good discussion, let me know if anything I'm saying
doesn't make sense - this crash path is a bit convoluted, specially in
x86, I might have understood something wrongly =)
Cheers,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-09 Thread Guilherme G. Piccoli
On 03/05/2022 14:31, Michael Kelley (LINUX) wrote:
> [...]
> 
> To me, it's a weak correlation between having a kmsg dumper, and
> wanting or not wanting the info level output to come before kdump.
> Hyper-V is one of only a few places that register a kmsg dumper, so most
> Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
> the info level output after kdump.  It seems like anyone who cared strongly
> about the info level output would set the panic_notifier_level to 1 or to 3
> so that the result is more deterministic.  But that's just my opinion, and
> it's probably an opinion that is not as well informed on the topic as some
> others in the discussion. So keeping things as in your patch set is not a
> show-stopper for me.
> 
> However, I would request a clarification in the documentation.   The
> panic_notifier_level affects not only the hypervisor, informational,
> and pre_reboot lists, but it also affects panic_print_sys_info() and
> kmsg_dump().  Specifically, at level 1, panic_print_sys_info() and
> kmsg_dump() will not be run before kdump.  At level 3, they will
> always be run before kdump.  Your documentation above mentions
> "informational lists" (plural), which I take to vaguely include
> kmsg_dump() and panic_print_sys_info(), but being explicit about
> the effect would be better.
> 
> Michael

Thanks again Michael, to express your points and concerns - great idea
of documentation improvement here, I'll do that for V2, for sure.

The idea of "defaulting" to skip the info list on kdump (if no
kmsg_dump() is set) is again a mechanism that aims at accommodating all
users and concerns of antagonistic goals, kdump vs notifier lists.

Before this patch set, by default no notifier executed before kdump. So,
the "pendulum"  was strongly on kdump side, and clearly this was a
sub-optimal decision - proof of that is that both Hyper-V / PowerPC code
forcibly set the "crash_kexec_post_notifiers". The goal here is to have
a more lightweight list that by default runs before kdump, a secondary
list that only runs before kdump if there's usage for that (either user
sets that or kmsg_dumper set is considered a valid user), and the
remaining notifiers run by default only after kdump, all of that very
customizable through the levels idea.

Now, one thing we could do to improve consistency for the hyper-v case:
having a kmsg_dump_once() helper, and *for Hyper-V only*, call it on the
hypervisor list, within the info notifier (that would be moved to
hypervisor list, ofc).
Let's wait for more feedback on that, just throwing some ideas in order
we can have everyone happy with the end-result!

Cheers,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 07/30] mips: ip22: Reword PANICED to PANICKED and remove useless header

2022-05-09 Thread Thomas Bogendoerfer
On Wed, Apr 27, 2022 at 07:49:01PM -0300, Guilherme G. Piccoli wrote:
> Many other place in the kernel prefer the latter, so let's keep
> it consistent in MIPS code as well. Also, removes a useless header.
> 
> Cc: Thomas Bogendoerfer 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/mips/sgi-ip22/ip22-reset.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)

applied to mips-next.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path

2022-05-09 Thread Evan Green
Hi Guilherme,

On Tue, May 3, 2022 at 12:12 PM Guilherme G. Piccoli
 wrote:
>
> On 03/05/2022 15:03, Evan Green wrote:
> > [...]
> > gsmi_shutdown_reason() is a common function called in other scenarios
> > as well, like reboot and thermal trip, where it may still make sense
> > to wait to acquire a spinlock. Maybe we should add a parameter to
> > gsmi_shutdown_reason() so that you can get your change on panic, but
> > we don't convert other callbacks into try-fail scenarios causing us to
> > miss logs.
> >
>
> Hi Evan, thanks for your feedback, much appreciated!
> What I've done in other cases like this was to have a helper checking
> the spinlock in the panic notifier - if we can acquire that, go ahead
> but if not, bail out. For a proper example of an implementation, check
> patch 13 of the series:
> https://lore.kernel.org/lkml/20220427224924.592546-14-gpicc...@igalia.com/ .
>
> Do you agree with that, or prefer really a parameter in
> gsmi_shutdown_reason() ? I'll follow your choice =)

I'm fine with either, thanks for the link. Mostly I want to make sure
other paths to gsmi_shutdown_reason() aren't also converted to a try.

>
>
> > Though thinking more about it, is this really a Good Change (TM)? The
> > spinlock itself already disables interrupts, meaning the only case
> > where this change makes a difference is if the panic happens from
> > within the function that grabbed the spinlock (in which case the
> > callback is also likely to panic), or in an NMI that panics within
> > that window. The downside of this change is that if one core was
> > politely working through an event with the lock held, and another core
> > panics, we now might lose the panic log, even though it probably would
> > have gone through fine assuming the other core has a chance to
> > continue.
>
> My feeling is that this is a good change, indeed - a lot of places are
> getting changed like this, in this series.
>
> Reasoning: the problem with your example is that, by default, secondary
> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take
> precedence and interrupt the work in these CPUs, effectively
> interrupting the "polite work" with the lock held heh

The IPI can only interrupt a CPU with irqs disabled if the IPI is an
NMI. I haven't looked before to see if we use NMI IPIs to corral the
other CPUs on panic. On x86, I grepped my way down to
native_stop_other_cpus(), which looks like it does a normal IPI, waits
1 second, then does an NMI IPI. So, if a secondary CPU has the lock
held, on x86 it has roughly 1s to finish what it's doing and re-enable
interrupts before smp_send_stop() brings the NMI hammer down. I think
this should be more than enough time for the secondary CPU to get out
and release the lock.

So then it makes sense to me that you're fixing cases where we
panicked with the lock held, or hung with the lock held. Given the 1
second grace period x86 gives us, I'm on board, as that helps mitigate
the risk that we bailed out early with the try and should have spun a
bit longer instead. Thanks.

-Evan

>
> Then, such CPU is put to sleep and we finally reach the panic notifier
> hereby discussed, in the main CPU. If the other CPU was shut-off *with
> the lock held*, it's never finishing such work, so the lock is never to
> be released. Conclusion: the spinlock can't be acquired, hence we broke
> the machine (which is already broken, given it's panic) in the path of
> this notifier.
> This should be really rare, but..possible. So I think we should protect
> against this scenario.
>
> We can grab others' feedback if you prefer, and of course you have the
> rights to refuse this change in the gsmi code, but from my
> point-of-view, I don't see any advantage in just assume the risk,
> specially since the change is very very simple.
>
> Cheers,
>
>
> Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 08/30] powerpc/setup: Refactor/untangle panic notifiers

2022-05-09 Thread Guilherme G. Piccoli
On 05/05/2022 15:55, Hari Bathini wrote:
> [...] 
> 
> The change looks good. I have tested it on an LPAR (ppc64).
> 
> Reviewed-by: Hari Bathini 

Thanks a bunch Hari, much appreciated!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 19/30] panic: Add the panic hypervisor notifier list

2022-05-09 Thread Guilherme G. Piccoli
On 03/05/2022 14:44, Michael Kelley (LINUX) wrote:
> [...]
>>
>> Hi Michael, thanks for your feedback! I agree that your idea could work,
>> but...there is one downside: imagine the kmsg_dump() approach is not set
>> in some Hyper-V guest, then we would rely in the regular notification
>> mechanism [hv_die_panic_notify_crash()], right?
>> But...you want then to run this notifier in the informational list,
>> which...won't execute *by default* before kdump if no kmsg_dump() is
>> set. So, this logic is convoluted when you mix it with the default level
>> concept + kdump.
> 
> Yes, you are right.  But to me that speaks as much to the linkage
> between the informational list and kmsg_dump() being the core
> problem.  But as I described in my reply to Patch 24, I can live with
> the linkage as-is.

Thanks for the feedback Michael!

> [...] 
>> I feel the panic notification mechanism does really fit with a
>> hypervisor list, it's a good match with the nature of the list, which
>> aims at informing the panic notification to the hypervisor/FW.
>> Of course we can modify it if you prefer...but please take into account
>> the kdump case and how it complicates the logic.
> 
> I agree that the runtime effect of one list vs. the other is nil.  The
> code works and can stay as you written it.
> 
> I was trying to align from a conceptual standpoint.  It was a bit
> unexpected that one path would be on the hypervisor list, and the
> other path effectively on the informational list.  When I see
> conceptual mismatches like that, I tend to want to understand why,
> and if there is something more fundamental that is out-of-whack.
> 

Totally agree with you here, I am like that as well - try to really
understand the details, this is very important specially in this patch
set, since it's a refactor and affects every user of the notifiers
infrastructure.

Again, just to double-say it: feel free to suggest any change for the
Hyper-V portion (might as well for any patch in the series, indeed) -
you and the other Hyper-V maintainers own this code and I'd be glad to
align with your needs, you are honor citizens in the panic notifiers
area, being one the most heavy users for that =)

Cheers,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 07/30] mips: ip22: Reword PANICED to PANICKED and remove useless header

2022-05-09 Thread Guilherme G. Piccoli
On 04/05/2022 17:32, Thomas Bogendoerfer wrote:
> [...]
> 
> applied to mips-next.
> 
> Thomas.
> 

Thanks a bunch Thomas =)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 01/30] x86/crash,reboot: Avoid re-disabling VMX in all CPUs on crash/restart

2022-05-09 Thread Guilherme G. Piccoli
On 27/04/2022 19:48, Guilherme G. Piccoli wrote:
> In the panic path we have a list of functions to be called, the panic
> notifiers - such callbacks perform various actions in the machine's
> last breath, and sometimes users want them to run before kdump. We
> have the parameter "crash_kexec_post_notifiers" for that. When such
> parameter is used, the function "crash_smp_send_stop()" is executed
> to poweroff all secondary CPUs through the NMI-shootdown mechanism;
> part of this process involves disabling virtualization features in
> all CPUs (except the main one).
> 
> Now, in the emergency restart procedure we have also a way of
> disabling VMX in all CPUs, using the same NMI-shootdown mechanism;
> what happens though is that in case we already NMI-disabled all CPUs,
> the emergency restart fails due to a second addition of the same items
> in the NMI list, as per the following log output:
> 
> sysrq: Trigger a crash
> Kernel panic - not syncing: sysrq triggered crash
> [...]
> Rebooting in 2 seconds..
> list_add double add: new=, prev=, next=.
> [ cut here ]
> kernel BUG at lib/list_debug.c:29!
> invalid opcode:  [#1] PREEMPT SMP PTI
> 
> In order to reproduce the problem, users just need to set the kernel
> parameter "crash_kexec_post_notifiers" *without* kdump set in any
> system with the VMX feature present.
> 
> Since there is no benefit in re-disabling VMX in all CPUs in case
> it was already done, this patch prevents that by guarding the restart
> routine against doubly issuing NMIs unnecessarily. Notice we still
> need to disable VMX locally in the emergency restart.
> 
> Fixes: ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is 
> supported)
> Fixes: 0ee59413c967 ("x86/panic: replace smp_send_stop() with kdump friendly 
> version in panic path")
> Cc: David P. Reed 
> Cc: Hidehiro Kawai 
> Cc: Paolo Bonzini 
> Cc: Sean Christopherson 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/x86/include/asm/cpu.h |  1 +
>  arch/x86/kernel/crash.c|  8 
>  arch/x86/kernel/reboot.c   | 14 --
>  3 files changed, 17 insertions(+), 6 deletions(-)
> 

Hi Paolo / Sean / Vitaly, sorry for the ping.
But do you think this fix is OK from the VMX point-of-view?

I'd like to send a V2 of this set soon, so any review here is highly
appreciated!

Cheers,


Guilherme


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 08/30] powerpc/setup: Refactor/untangle panic notifiers

2022-05-09 Thread Hari Bathini



On 28/04/22 4:19 am, Guilherme G. Piccoli wrote:

The panic notifiers infrastructure is a bit limited in the scope of
the callbacks - basically every kind of functionality is dropped
in a list that runs in the same point during the kernel panic path.
This is not really on par with the complexities and particularities
of architecture / hypervisors' needs, and a refactor is ongoing.

As part of this refactor, it was observed that powerpc has 2 notifiers,
with mixed goals: one is just a KASLR offset dumper, whereas the other
aims to hard-disable IRQs (necessary on panic path), warn firmware of
the panic event (fadump) and run low-level platform-specific machinery
that might stop kernel execution and never come back.

Clearly, the 2nd notifier has opposed goals: disable IRQs / fadump
should run earlier while low-level platform actions should
run late since it might not even return. Hence, this patch decouples
the notifiers splitting them in three:

- First one is responsible for hard-disable IRQs and fadump,
should run early;

- The kernel KASLR offset dumper is really an informative notifier,
harmless and may run at any moment in the panic path;

- The last notifier should run last, since it aims to perform
low-level actions for specific platforms, and might never return.
It is also only registered for 2 platforms, pseries and ps3.

The patch better documents the notifiers and clears the code too,
also removing a useless header.

Currently no functionality change should be observed, but after
the planned panic refactor we should expect more panic reliability
with this patch.

Cc: Benjamin Herrenschmidt 
Cc: Hari Bathini 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Paul Mackerras 
Signed-off-by: Guilherme G. Piccoli 


The change looks good. I have tested it on an LPAR (ppc64).

Reviewed-by: Hari Bathini 


---

We'd like to thanks specially the MiniCloud infrastructure [0] maintainers,
that allow us to test PowerPC code in a very complete, functional and FREE
environment (there's no need even for adding a credit card, like many "free"
clouds require ¬¬ ).

[0] https://openpower.ic.unicamp.br/minicloud

  arch/powerpc/kernel/setup-common.c | 74 ++
  1 file changed, 54 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 518ae5aa9410..52f96b209a96 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -23,7 +23,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
@@ -680,8 +679,25 @@ int check_legacy_ioport(unsigned long base_port)
  }
  EXPORT_SYMBOL(check_legacy_ioport);
  
-static int ppc_panic_event(struct notifier_block *this,

- unsigned long event, void *ptr)
+/*
+ * Panic notifiers setup
+ *
+ * We have 3 notifiers for powerpc, each one from a different "nature":
+ *
+ * - ppc_panic_fadump_handler() is a hypervisor notifier, which hard-disables
+ *   IRQs and deal with the Firmware-Assisted dump, when it is configured;
+ *   should run early in the panic path.
+ *
+ * - dump_kernel_offset() is an informative notifier, just showing the KASLR
+ *   offset if we have RANDOMIZE_BASE set.
+ *
+ * - ppc_panic_platform_handler() is a low-level handler that's registered
+ *   only if the platform wishes to perform final actions in the panic path,
+ *   hence it should run late and might not even return. Currently, only
+ *   pseries and ps3 platforms register callbacks.
+ */
+static int ppc_panic_fadump_handler(struct notifier_block *this,
+   unsigned long event, void *ptr)
  {
/*
 * panic does a local_irq_disable, but we really
@@ -691,45 +707,63 @@ static int ppc_panic_event(struct notifier_block *this,
  
  	/*

 * If firmware-assisted dump has been registered then trigger
-* firmware-assisted dump and let firmware handle everything else.
+* its callback and let the firmware handles everything else.
 */
crash_fadump(NULL, ptr);
-   if (ppc_md.panic)
-   ppc_md.panic(ptr);  /* May not return */
+
return NOTIFY_DONE;
  }
  
-static struct notifier_block ppc_panic_block = {

-   .notifier_call = ppc_panic_event,
-   .priority = INT_MIN /* may not return; must be done last */
-};
-
-/*
- * Dump out kernel offset information on panic.
- */
  static int dump_kernel_offset(struct notifier_block *self, unsigned long v,
  void *p)
  {
pr_emerg("Kernel Offset: 0x%lx from 0x%lx\n",
 kaslr_offset(), KERNELBASE);
  
-	return 0;

+   return NOTIFY_DONE;
  }
  
+static int ppc_panic_platform_handler(struct notifier_block *this,

+ unsigned long event, void *ptr)
+{
+   /*
+* This handler is only registered if we have a panic callback
+* on ppc_md, hence NULL check is not needed.
+* Als

Re: [PATCH 09/30] coresight: cpu-debug: Replace mutex with mutex_trylock on panic notifier

2022-05-09 Thread Guilherme G. Piccoli
On 28/04/2022 05:11, Suzuki K Poulose wrote:
> Hi Guilherme,
> 
> On 27/04/2022 23:49, Guilherme G. Piccoli wrote:
>> The panic notifier infrastructure executes registered callbacks when
>> a panic event happens - such callbacks are executed in atomic context,
>> with interrupts and preemption disabled in the running CPU and all other
>> CPUs disabled. That said, mutexes in such context are not a good idea.
>>
>> This patch replaces a regular mutex with a mutex_trylock safer approach;
>> given the nature of the mutex used in the driver, it should be pretty
>> uncommon being unable to acquire such mutex in the panic path, hence
>> no functional change should be observed (and if it is, that would be
>> likely a deadlock with the regular mutex).
>>
>> Fixes: 2227b7c74634 ("coresight: add support for CPU debug module")
>> Cc: Leo Yan 
>> Cc: Mathieu Poirier 
>> Cc: Mike Leach 
>> Cc: Suzuki K Poulose 
>> Signed-off-by: Guilherme G. Piccoli 
> 
> How would you like to proceed with queuing this ? I am happy
> either way. In case you plan to push this as part of this
> series (I don't see any potential conflicts) :
> 
> Reviewed-by: Suzuki K Poulose 

Hi Suzuki, some other maintainers are taking the patches to their next
branches for example. I'm working on V2, and I guess in the end would be
nice to reduce the size of the series a bit.

So, do you think you could pick this one for your coresight/next branch
(or even for rc cycle, your call - this is really a fix)?
This way, I won't re-submit this one in V2, since it's gonna be merged
already in your branch.

Thanks in advance,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 22/30] panic: Introduce the panic post-reboot notifier list

2022-05-09 Thread Guilherme G. Piccoli
On 27/04/2022 19:49, Guilherme G. Piccoli wrote:
> Currently we have 3 notifier lists in the panic path, which will
> be wired in a way to allow the notifier callbacks to run in
> different moments at panic time, in a subsequent patch.
> 
> But there is also an odd set of architecture calls hardcoded in
> the end of panic path, after the restart machinery. They're
> responsible for late time tunings / events, like enabling a stop
> button (Sparc) or effectively stopping the machine (s390).
> 
> This patch introduces yet another notifier list to offer the
> architectures a way to add callbacks in such late moment on
> panic path without the need of ifdefs / hardcoded approaches.
> 
> Cc: Alexander Gordeev 
> Cc: Christian Borntraeger 
> Cc: "David S. Miller" 
> Cc: Heiko Carstens 
> Cc: Sven Schnelle 
> Cc: Vasily Gorbik 
> Signed-off-by: Guilherme G. Piccoli 

Hey S390/SPARC folks, sorry for the ping!

Any reviews on this V1 would be greatly appreciated, I'm working on V2
and seeking feedback in the non-reviewed patches.

Thanks in advance,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 08/30] powerpc/setup: Refactor/untangle panic notifiers

2022-05-09 Thread Guilherme G. Piccoli
On 05/05/2022 15:55, Hari Bathini wrote:
> [...] 
> The change looks good. I have tested it on an LPAR (ppc64).
> 
> Reviewed-by: Hari Bathini 
> 

Hi Michael. do you think it's possible to add this one to powerpc/next
(or something like that), or do you prefer a V2 with his tag?
Thanks,


Guilherme

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-09 Thread Guilherme G. Piccoli
On 29/04/2022 13:04, Guilherme G. Piccoli wrote:
> On 27/04/2022 21:28, Randy Dunlap wrote:
>>
>>
>> On 4/27/22 15:49, Guilherme G. Piccoli wrote:
>>> +   crash_kexec_post_notifiers
>>> +   This was DEPRECATED - users should always prefer the
>>
>>  This is DEPRECATED - users should always prefer the
>>
>>> +   parameter "panic_notifiers_level" - check its entry
>>> +   in this documentation for details on how it works.
>>> +   Setting this parameter is exactly the same as setting
>>> +   "panic_notifiers_level=4".
>>
> 
> Thanks Randy, for your suggestion - but I confess I couldn't understand
> it properly. It's related to spaces/tabs, right? What you suggest me to
> change in this formatting? Just by looking the email I can't parse.
> 
> Cheers,
> 
> 
> Guilherme

Complete lack of attention from me, apologies!
The suggestions was s/was/is - already fixed for V2, thanks Randy.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 1/7] x86/crash: fix minor typo/bug in debug message

2022-05-09 Thread Eric DeVolder




On 5/9/22 00:26, Baoquan He wrote:

On 05/09/22 at 10:46am, Sourabh Jain wrote:


On 06/05/22 00:15, Eric DeVolder wrote:

The pr_debug() intends to display the memsz member, but the
parameter is actually the bufsz member (which is already
displayed). Correct this to display memsz value.

Signed-off-by: Eric DeVolder
Reviewed-by: David Hildenbrand
Acked-by: Baoquan He
---
   arch/x86/kernel/crash.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e8326a8d1c5d..9730c88530fc 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -407,7 +407,7 @@ int crash_load_segments(struct kimage *image)
}
image->elf_load_addr = kbuf.mem;
pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-image->elf_load_addr, kbuf.bufsz, kbuf.bufsz);
+image->elf_load_addr, kbuf.bufsz, kbuf.memsz);


I think we can push this patch separately.


Boris has taken this into tip/x86/kdump.


Excellent, I'll remove this patch going forward.
Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 3/7] crash: add generic infrastructure for crash hotplug support

2022-05-09 Thread Eric DeVolder




On 5/6/22 02:12, Baoquan He wrote:

On 05/05/22 at 02:45pm, Eric DeVolder wrote:
..

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 256cf6db573c..f197af50def6 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -9,12 +9,17 @@
  #include 
  #include 
  #include 
+#include 


Wondering where highmem.h is needed. Just curious.


Ahh, I missed that. At one point in time we moved map_crash_pages() into this file, which brought 
highmem.h along with it. But we have since moved map_crash_pages() into x86/crash.c. And I missed 
eliminating highmem.h at that time.


I have removed this for v9.

eric





+#include 
+#include 
  
  #include 

  #include 
  
  #include 
  
+#include "kexec_internal.h"

+
  /* vmcoreinfo stuff */
  unsigned char *vmcoreinfo_data;
  size_t vmcoreinfo_size;
@@ -491,3 +496,90 @@ static int __init crash_save_vmcoreinfo_init(void)
  }
  
  subsys_initcall(crash_save_vmcoreinfo_init);

+
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
+void __weak arch_crash_handle_hotplug_event(struct kimage *image,
+   unsigned int hp_action, 
unsigned int cpu)
+{
+   WARN(1, "crash hotplug handler not implemented");
+}
+
+static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
+{
+   /* Obtain lock while changing crash information */
+   if (!mutex_trylock(&kexec_mutex))
+   return;
+
+   /* Check kdump is loaded */
+   if (kexec_crash_image) {
+   pr_debug("crash hp: hp_action %u, cpu %u", hp_action, cpu);
+
+   /* Needed in order for the segments to be updated */
+   arch_kexec_unprotect_crashkres();
+
+   /* Flag to differentiate between normal load and hotplug */
+   kexec_crash_image->hotplug_event = true;
+
+   /* Now invoke arch-specific update handler */
+   arch_crash_handle_hotplug_event(kexec_crash_image, hp_action, 
cpu);
+
+   /* No longer handling a hotplug event */
+   kexec_crash_image->hotplug_event = false;
+
+   /* Change back to read-only */
+   arch_kexec_protect_crashkres();
+   }
+
+   /* Release lock now that update complete */
+   mutex_unlock(&kexec_mutex);
+}
+
+static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, 
void *v)
+{
+   switch (val) {
+   case MEM_ONLINE:
+   handle_hotplug_event(KEXEC_CRASH_HP_ADD_MEMORY, 0);
+   break;
+
+   case MEM_OFFLINE:
+   handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_MEMORY, 0);
+   break;
+   }
+   return NOTIFY_OK;
+}
+
+static struct notifier_block crash_memhp_nb = {
+   .notifier_call = crash_memhp_notifier,
+   .priority = 0
+};
+
+static int crash_cpuhp_online(unsigned int cpu)
+{
+   handle_hotplug_event(KEXEC_CRASH_HP_ADD_CPU, cpu);
+   return 0;
+}
+
+static int crash_cpuhp_offline(unsigned int cpu)
+{
+   handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_CPU, cpu);
+   return 0;
+}
+
+static int __init crash_hotplug_init(void)
+{
+   int result = 0;
+
+   if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+   register_memory_notifier(&crash_memhp_nb);
+
+   if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
+   result = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+   
"crash/cpuhp",
+   
crash_cpuhp_online,
+   
crash_cpuhp_offline);
+
+   return result;
+}
+
+subsys_initcall(crash_hotplug_init);
+#endif
--
2.27.0





___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv3 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable

2022-05-09 Thread Pingfan Liu
On Mon, May 09, 2022 at 12:55:21PM +0200, Thomas Gleixner wrote:
> On Mon, May 09 2022 at 12:13, Pingfan Liu wrote:
> > The following code chunk repeats in both
> > migrate_to_reboot_cpu() and smp_shutdown_nonboot_cpus():
> >
> > if (!cpu_online(primary_cpu))
> > primary_cpu = cpumask_first(cpu_online_mask);
> >
> > This is due to a breakage like the following:
> 
> I don't see what's broken here.
> 

No, no broken. Could it be better to replace 'breakage' with 'breakin'?

> > kernel_kexec()
> >migrate_to_reboot_cpu();
> >cpu_hotplug_enable();
> > ---> comes a cpu_down(this_cpu) on other cpu
> >machine_shutdown();
> >  smp_shutdown_nonboot_cpus(); // re-check "if 
> > (!cpu_online(primary_cpu))" to protect against the former breakin
> >
> > Although the kexec-reboot task can get through a cpu_down() on its cpu,
> > this code looks a little confusing.
> 
> Confusing != broken.
> 

Yes. And it only recurs confusing.

> > +/* primary_cpu keeps unchanged after migrate_to_reboot_cpu() */
> 
> This comment makes no sense.
> 

Since migrate_to_reboot_cpu() disables cpu hotplug, so the selected
valid online cpu -- primary_cpu keeps unchange.

> >  void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
> >  {
> > unsigned int cpu;
> > int error;
> >  
> > +   /*
> > +* Block other cpu hotplug event, so primary_cpu is always online if
> > +* it is not touched by us
> > +*/
> > cpu_maps_update_begin();
> > -
> > /*
> > -* Make certain the cpu I'm about to reboot on is online.
> > -*
> > -* This is inline to what migrate_to_reboot_cpu() already do.
> > +* migrate_to_reboot_cpu() disables CPU hotplug assuming that
> > +* no further code needs to use CPU hotplug (which is true in
> > +* the reboot case). However, the kexec path depends on using
> > +* CPU hotplug again; so re-enable it here.
> 
> You want to reduce confusion, but in reality this is even more confusing
> than before.
> 

This __cpu_hotplug_enable() can be considered to defer from kernel_kexec() to
arch-dependent code chunk (here), which is a more proper point.

Could it make things better by rephrasing the words as the following?
   migrate_to_reboot_cpu() disables CPU hotplug to prevent the selected
   reboot cpu from disappearing. But arches need cpu_down to hot remove
   cpus except rebooting-cpu, so re-enabling cpu hotplug again.

> >  */
> > -   if (!cpu_online(primary_cpu))
> > -   primary_cpu = cpumask_first(cpu_online_mask);
> > +   __cpu_hotplug_enable();
> 
> How is this decrement solving anything? At the end of this function, the
> counter is incremented again. So what's the point of this exercise?
> 

This decrement enables the cpu hot-removing.  Since
smp_shutdown_nonboot_cpus()->cpu_down_maps_locked(), if
cpu_hotplug_disabled, it returns -EBUSY. 

On the other hand, at the end of this function, cpu_hotplug_disabled++,
which aims to prevent any new coming cpu, since machine_kexec() assumes
to execute on the only rebooting-cpu, any dangling cpu has unexpected
result.

> > for_each_online_cpu(cpu) {
> > if (cpu == primary_cpu)
> > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > index 68480f731192..db4fa6b174e3 100644
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -1168,14 +1168,12 @@ int kernel_kexec(void)
> > kexec_in_progress = true;
> > kernel_restart_prepare("kexec reboot");
> > migrate_to_reboot_cpu();
> > -
> > /*
> > -* migrate_to_reboot_cpu() disables CPU hotplug assuming that
> > -* no further code needs to use CPU hotplug (which is true in
> > -* the reboot case). However, the kexec path depends on using
> > -* CPU hotplug again; so re-enable it here.
> > +* migrate_to_reboot_cpu() disables CPU hotplug. If an arch
> > +* relies on the cpu teardown to achieve reboot, it needs to
> > +* re-enable CPU hotplug there.
> 
> What does that for arch/powerpc/kernel/kexec_machine64.c now?
> 
> Nothing, as far as I can tell. Which means you basically reverted
> 011e4b02f1da ("powerpc, kexec: Fix "Processor X is stuck" issue during
> kexec from ST mode") unless I'm completely confused.
> 

Oops. Forget about powerpc. Considering the cpu hotplug is an
arch-dependent feature in machine_shutdown(), as x86 does not need it.

What about supplying an cpu hotplug enable in the powerpc machine_kexec 
implement.

diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 6cc7793b8420..8ccf22197f08 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -224,6 +224,7 @@ static void wake_offline_cpus(void)

 static void kexec_prepare_cpus(void)
 {
+   cpu_hotplug_enable();
wake_offline_cpus();
smp_call_function(kexec_smp_down, NULL, /* wait */0);
local_irq_disable();


> >