"Nicholas Piggin" <npig...@gmail.com> writes:
> On Wed Sep 14, 2022 at 3:39 AM AEST, Leonardo Brás wrote:
>> On Mon, 2022-09-12 at 14:58 -0500, Nathan Lynch wrote:
>> > Leonardo Brás <leobra...@gmail.com> writes:
>> > > On Fri, 2022-09-09 at 09:04 -0500, Nathan Lynch wrote:
>> > > > Leonardo Brás <leobra...@gmail.com> writes:
>> > > > > On Wed, 2022-09-07 at 17:01 -0500, Nathan Lynch wrote:
>> > > > > > At the time this was submitted by Leonardo, I confirmed -- or 
>> > > > > > thought
>> > > > > > I had confirmed -- with PowerVM partition firmware development that
>> > > > > > the following RTAS functions:
>> > > > > > 
>> > > > > > - ibm,get-xive
>> > > > > > - ibm,int-off
>> > > > > > - ibm,int-on
>> > > > > > - ibm,set-xive
>> > > > > > 
>> > > > > > were safe to call on multiple CPUs simultaneously, not only with
>> > > > > > respect to themselves as indicated by PAPR, but with arbitrary 
>> > > > > > other
>> > > > > > RTAS calls:
>> > > > > > 
>> > > > > > https://lore.kernel.org/linuxppc-dev/875zcy2v8o....@linux.ibm.com/
>> > > > > > 
>> > > > > > Recent discussion with firmware development makes it clear that 
>> > > > > > this
>> > > > > > is not true, and that the code in commit b664db8e3f97 
>> > > > > > ("powerpc/rtas:
>> > > > > > Implement reentrant rtas call") is unsafe, likely explaining 
>> > > > > > several
>> > > > > > strange bugs we've seen in internal testing involving DLPAR and
>> > > > > > LPM. These scenarios use ibm,configure-connector, whose internal 
>> > > > > > state
>> > > > > > can be corrupted by the concurrent use of the "reentrant" 
>> > > > > > functions,
>> > > > > > leading to symptoms like endless busy statuses from RTAS.
>> > > > > 
>> > > > > Oh, does not it means PowerVM is not compliant to the PAPR specs?
>> > > > 
>> > > > No, it means the premise of commit b664db8e3f97 ("powerpc/rtas:
>> > > > Implement reentrant rtas call") change is incorrect. The "reentrant"
>> > > > property described in the spec applies only to the individual RTAS
>> > > > functions. The OS can invoke (for example) ibm,set-xive on multiple 
>> > > > CPUs
>> > > > simultaneously, but it must adhere to the more general requirement to
>> > > > serialize with other RTAS functions.
>> > > > 
>> > > 
>> > > I see. Thanks for explaining that part!
>> > > I agree: reentrant calls that way don't look as useful on Linux than I
>> > > previously thought.
>> > > 
>> > > OTOH, I think that instead of reverting the change, we could make use of 
>> > > the
>> > > correct information and fix the current implementation. (This could help 
>> > > when we
>> > > do the same rtas call in multiple cpus)
>> > 
>> > Hmm I'm happy to be mistaken here, but I doubt we ever really need to do
>> > that. I'm not seeing the need.
>> > 
>> > > I have an idea of a patch to fix this. 
>> > > Do you think it would be ok if I sent that, to prospect being an 
>> > > alternative to
>> > > this reversion?
>> > 
>> > It is my preference, and I believe it is more common, to revert to the
>> > well-understood prior state, imperfect as it may be. The revert can be
>> > backported to -stable and distros while development and review of
>> > another approach proceeds.
>>
>> Ok then, as long as you are aware of the kdump bug, I'm good.
>>
>> FWIW:
>> Reviewed-by: Leonardo Bras <leobra...@gmail.com>
>
> A shame. I guess a reader/writer lock would not be much help because
> the crash is probably more likely to hit longer running rtas calls?
>
> Alternative is just cheat and do this...?
>
> Thanks,
> Nick
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 693133972294..89728714a06e 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -26,6 +26,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/of.h>
>  #include <linux/of_fdt.h>
> +#include <linux/panic.h>
>  
>  #include <asm/interrupt.h>
>  #include <asm/rtas.h>
> @@ -97,6 +98,19 @@ static unsigned long lock_rtas(void)
>  {
>         unsigned long flags;
>  
> +       if (atomic_read(&panic_cpu) == raw_smp_processor_id()) {
> +               /*
> +                * Crash in progress on this CPU. Other CPUs should be
> +                * stopped by now, so skip the lock in case it was being
> +                * held, and is now needed for crashing e.g., kexec
> +                * (machine_kexec_mask_interrupts) requires rtas calls.
> +                *
> +                * It's possible this could have caused rtas state
> breakage
> +                * but the alternative is deadlock.
> +                */
> +               return 0;
> +       }
> +
>         local_irq_save(flags);
>         preempt_disable();
>         arch_spin_lock(&rtas.lock);
> @@ -105,6 +119,9 @@ static unsigned long lock_rtas(void)
>  
>  static void unlock_rtas(unsigned long flags)
>  {
> +       if (atomic_read(&panic_cpu) == raw_smp_processor_id())
> +               return;
> +
>         arch_spin_unlock(&rtas.lock);
>         local_irq_restore(flags);
>         preempt_enable();

Looks correct.

I wonder - would it be worth making the panic path use a separate
"emergency" rtas_args buffer as well? If a CPU is actually "stuck" in
RTAS at panic time, then leaving rtas.args untouched might make the
ibm,int-off, ibm,set-xive, ibm,os-term, and any other RTAS calls we
incur on the panic path more likely to succeed.

Building on yours, something like (sorry, it's ugly):

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 693133972294..4865d26e7391 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -73,6 +73,8 @@ struct rtas_t rtas = {
 };
 EXPORT_SYMBOL(rtas);
 
+static struct rtas_args emergency_rtas_args;
+
 DEFINE_SPINLOCK(rtas_data_buf_lock);
 EXPORT_SYMBOL(rtas_data_buf_lock);
 
@@ -93,20 +95,24 @@ EXPORT_SYMBOL(rtas_flash_term_hook);
  * such as having the timebase stopped which would lockup with
  * normal locks and spinlock debugging enabled
  */
-static unsigned long lock_rtas(void)
+static struct rtas_args *lock_rtas(unsigned long *flags)
 {
-       unsigned long flags;
+       if (atomic_read(&panic_cpu) == raw_smp_processor_id())
+               return &emergency_rtas_args;
 
-       local_irq_save(flags);
+       local_irq_save(*flags);
        preempt_disable();
        arch_spin_lock(&rtas.lock);
-       return flags;
+       return &rtas.args;
 }
 
-static void unlock_rtas(unsigned long flags)
+static void unlock_rtas(struct rtas_args *args, unsigned long *flags)
 {
+       if (atomic_read(&panic_cpu) == raw_smp_processor_id())
+               return;
+
        arch_spin_unlock(&rtas.lock);
-       local_irq_restore(flags);
+       local_irq_restore(*flags);
        preempt_enable();
 }
 
@@ -117,14 +123,15 @@ static void unlock_rtas(unsigned long flags)
  */
 static void call_rtas_display_status(unsigned char c)
 {
-       unsigned long s;
+       struct rtas_args *args;
+       unsigned long flags;
 
        if (!rtas.base)
                return;
 
-       s = lock_rtas();
-       rtas_call_unlocked(&rtas.args, 10, 1, 1, NULL, c);
-       unlock_rtas(s);
+       args = lock_rtas(&flags);
+       rtas_call_unlocked(args, 10, 1, 1, NULL, c);
+       unlock_rtas(args, &flags);
 }
 
 static void call_rtas_display_status_delay(char c)
@@ -468,7 +475,7 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
 {
        va_list list;
        int i;
-       unsigned long s;
+       unsigned long flags;
        struct rtas_args *rtas_args;
        char *buff_copy = NULL;
        int ret;
@@ -481,10 +488,7 @@ int rtas_call(int token, int nargs, int nret, int 
*outputs, ...)
                return -1;
        }
 
-       s = lock_rtas();
-
-       /* We use the global rtas args buffer */
-       rtas_args = &rtas.args;
+       rtas_args = lock_rtas(&flags);
 
        va_start(list, outputs);
        va_rtas_call_unlocked(rtas_args, token, nargs, nret, list);
@@ -500,7 +504,7 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
                        outputs[i] = be32_to_cpu(rtas_args->rets[i+1]);
        ret = (nret > 0)? be32_to_cpu(rtas_args->rets[0]): 0;
 
-       unlock_rtas(s);
+       unlock_rtas(rtas_args, &flags);
 
        if (buff_copy) {
                log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0);
@@ -1190,6 +1194,7 @@ static void __init rtas_syscall_filter_init(void)
 /* We assume to be passed big endian arguments */
 SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 {
+       struct rtas_args *argsp;
        struct rtas_args args;
        unsigned long flags;
        char *buff_copy, *errbuf = NULL;
@@ -1249,18 +1254,18 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 
        buff_copy = get_errorlog_buffer();
 
-       flags = lock_rtas();
+       argsp = lock_rtas(&flags);
 
-       rtas.args = args;
-       do_enter_rtas(__pa(&rtas.args));
-       args = rtas.args;
+       *argsp = args;
+       do_enter_rtas(__pa(argsp));
+       args = *argsp;
 
        /* A -1 return code indicates that the last command couldn't
           be completed due to a hardware error. */
        if (be32_to_cpu(args.rets[0]) == -1)
                errbuf = __fetch_rtas_last_error(buff_copy);
 
-       unlock_rtas(flags);
+       unlock_rtas(argsp, &flags);
 
        if (buff_copy) {
                if (errbuf)

Reply via email to