----- Original Message -----
> From: "Michael Ellerman" <m...@ellerman.id.au>
> To: "Timothy Pearson" <tpear...@raptorengineering.com>, "Jens Axboe" 
> <ax...@kernel.dk>, "regressions"
> <regressi...@lists.linux.dev>, "npiggin" <npig...@gmail.com>, "christophe 
> leroy" <christophe.le...@csgroup.eu>,
> "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Monday, November 20, 2023 1:10:06 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Hi Timothy,
> 
> Great work debugging this. I think your fix is good, but I want to understand 
> it
> a bit more to make sure I can explain why we haven't seen it outside of
> io-uring.
> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
> 
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP 
> regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live 
> values
> for the task.
> 
> There is another case though, which is the path via:
>  copy_process()
>    dup_task_struct()
>      arch_dup_task_struct()
>        flush_all_to_thread()
>          save_all()
> 
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
> 
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal 
> usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
> 
> Even if we handle a signal on the return from the fork() syscall, the worst 
> that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
> 
> That path is something like:
> 
> system_call_vectored_common()
>  system_call_exception()
>    sys_fork()
>      kernel_clone()
>        copy_process()
>          dup_task_struct()
>            arch_dup_task_struct()
>              flush_all_to_thread()
>                save_all()
>                  if (tsk->thread.regs->msr & MSR_FP)
>                    save_fpu()
>                    # does not clear MSR_FP from regs->msr
>  syscall_exit_prepare()
>    interrupt_exit_user_prepare_main()
>      do_notify_resume()
>        get_signal()
>        handle_rt_signal64()
>          prepare_setup_sigcontext()
>            flush_fp_to_thread()
>              if (tsk->thread.regs->msr & MSR_FP)
>                giveup_fpu()
>                  __giveup_fpu
>                    save_fpu()
>                    # clobbered fr0 is saved, but task considers it volatile
>                    # across syscall anyway
> 
> 
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
> 
> Which is:
> 
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>    interrupt_exit_user_prepare_main()
>      do_notify_resume()
>        get_signal()
>          task_work_run()
>            create_worker_cb()
>              create_io_worker()
>                copy_process()
>                  dup_task_struct()
>                    arch_dup_task_struct()
>                      flush_all_to_thread()
>                        save_all()
>                          if (tsk->thread.regs->msr & MSR_FP)
>                            save_fpu()
>                            # fr0 is clobbered and potentially live in 
> userspace
> 
> 
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.

I agree the corruption really only started showing up in earnest on io_uring 
clone-via-signal, as this was confirmed several times in the course of 
debugging.  Bear in mind however that we have seen very, very rare crashes over 
several years in other tasks, and I am starting to think this bug might be the 
root cause (see below).

Note as well that I may very well have a wrong call order in the commit 
message, since I was relying on a couple of WARN_ON() macros I inserted to 
check for a similar (but not identical) condition and didn't spend much time 
getting new traces after identifying the root cause.

I went back and grabbed some real world system-wide stack traces, since I now 
know what to trigger on.  A typical example is:

interrupt_return_srr_user()
 interrupt_exit_user_prepare()
  interrupt_exit_user_prepare_main()
   schedule()
    __schedule()
     __switch_to()
      giveup_all()
       # tsk->thread.regs->msr MSR_FP is still set here
       __giveup_fpu()
        save_fpu()
        # fr0 is clobbered and potentially live in userspace

This indicates that the pending signal is not actually required, it simply 
makes triggering much more likely.  Both the pending signal and the 
_TIF_NEED_RESCHED paths give up control and end up in the same overall position 
of trigging down-call routines that assume the FPU state is valid.

perl and bash seem to be affected to some degree, though current builds don't 
use enough VSX instructions rapidly enough to cause crashes with any 
significant probability.  That said, over many years of running POWER at 
datacenter scale I have seen enough random bash/perl crashes in the logs to 
recognize the pattern; I think this has been a low-grade issue for a long time, 
but with an infantismally small chance of happening it was seen as random noise 
/ hardware issues / other rare bugs in various programs.

Reply via email to