On Tue Nov 21, 2023 at 2:45 AM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <m...@ellerman.id.au>
> > To: "Timothy Pearson" <tpear...@raptorengineering.com>, "Jens Axboe" 
> > <ax...@kernel.dk>, "regressions"
> > <regressi...@lists.linux.dev>, "npiggin" <npig...@gmail.com>, "christophe 
> > leroy" <christophe.le...@csgroup.eu>,
> > "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Monday, November 20, 2023 1:10:06 AM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> > register  save
>
> > Hi Timothy,
> > 
> > Great work debugging this. I think your fix is good, but I want to 
> > understand it
> > a bit more to make sure I can explain why we haven't seen it outside of
> > io-uring.
> > If this can be triggered outside of io-uring then I have even more 
> > backporting
> > in my future :}
> > 
> > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs 
> > and
> > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP 
> > regs
> > from the thread struct before letting the task use FP again. So in that case
> > save_fpu() is free to clobber fr0 because the FP regs no longer hold live 
> > values
> > for the task.
> > 
> > There is another case though, which is the path via:
> >  copy_process()
> >    dup_task_struct()
> >      arch_dup_task_struct()
> >        flush_all_to_thread()
> >          save_all()
> > 
> > That path saves the FP regs but leaves them live. That's meant as an
> > optimisation for a process that's using FP/VSX and then calls fork(), 
> > leaving
> > the regs live means the parent process doesn't have to take a fault after 
> > the
> > fork to get its FP regs back.
> > 
> > That path does clobber fr0, but fr0 is volatile across a syscall, and the 
> > only
> > way to reach copy_process() from userspace is via a syscall. So in normal 
> > usage
> > fr0 being clobbered across a syscall shouldn't cause data corruption.
> > 
> > Even if we handle a signal on the return from the fork() syscall, the worst 
> > that
> > happens is that the task's thread struct holds the clobbered fr0, but the 
> > task
> > doesn't care (because fr0 is volatile across the syscall anyway).
> > 
> > That path is something like:
> > 
> > system_call_vectored_common()
> >  system_call_exception()
> >    sys_fork()
> >      kernel_clone()
> >        copy_process()
> >          dup_task_struct()
> >            arch_dup_task_struct()
> >              flush_all_to_thread()
> >                save_all()
> >                  if (tsk->thread.regs->msr & MSR_FP)
> >                    save_fpu()
> >                    # does not clear MSR_FP from regs->msr
> >  syscall_exit_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >        handle_rt_signal64()
> >          prepare_setup_sigcontext()
> >            flush_fp_to_thread()
> >              if (tsk->thread.regs->msr & MSR_FP)
> >                giveup_fpu()
> >                  __giveup_fpu
> >                    save_fpu()
> >                    # clobbered fr0 is saved, but task considers it volatile
> >                    # across syscall anyway
> > 
> > 
> > But we now have a new path, because io-uring can call copy_process() via
> > create_io_thread() from the signal handling path. That's OK if the signal is
> > handled as we return from a syscall, but it's not OK if the signal is 
> > handled
> > due to some other interrupt.
> > 
> > Which is:
> > 
> > interrupt_return_srr_user()
> >  interrupt_exit_user_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >          task_work_run()
> >            create_worker_cb()
> >              create_io_worker()
> >                copy_process()
> >                  dup_task_struct()
> >                    arch_dup_task_struct()
> >                      flush_all_to_thread()
> >                        save_all()
> >                          if (tsk->thread.regs->msr & MSR_FP)
> >                            save_fpu()
> >                            # fr0 is clobbered and potentially live in 
> > userspace
> > 
> > 
> > So tldr I think the corruption is only an issue since io-uring started doing
> > the clone via signal, which I think matches the observed timeline of this 
> > bug
> > appearing.
>
> I agree the corruption really only started showing up in earnest on io_uring 
> clone-via-signal, as this was confirmed several times in the course of 
> debugging.  Bear in mind however that we have seen very, very rare crashes 
> over several years in other tasks, and I am starting to think this bug might 
> be the root cause (see below).
>
> Note as well that I may very well have a wrong call order in the commit 
> message, since I was relying on a couple of WARN_ON() macros I inserted to 
> check for a similar (but not identical) condition and didn't spend much time 
> getting new traces after identifying the root cause.
>
> I went back and grabbed some real world system-wide stack traces, since I now 
> know what to trigger on.  A typical example is:
>
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>   interrupt_exit_user_prepare_main()
>    schedule()
>     __schedule()
>      __switch_to()
>       giveup_all()
>        # tsk->thread.regs->msr MSR_FP is still set here
>        __giveup_fpu()
>         save_fpu()
>         # fr0 is clobbered and potentially live in userspace
>
> This indicates that the pending signal is not actually required, it simply 
> makes triggering much more likely.  Both the pending signal and the 
> _TIF_NEED_RESCHED paths give up control and end up in the same overall 
> position of trigging down-call routines that assume the FPU state is valid.

That seems probably true, but that's on the other side of the equation,
I think? Because __giveup_fpu does clear MSR[FP] before any context
switch or return to user is possile.

*After* we have clobbered fr0 without clearing MSR_FP from the user msr,
then yes any context switch or return to user will cause uerspace to
next run with a clobbered fr0. But getting to that fr0/msr state
requires the flush_all_to_thread(), and that needs to be called in a
path where user fr0 must not be clobbered. I don't see one other than
io-uring yet.

[ KVM via kvmppc_save_user_regs() (which is basically the same as
flush_all_to_thread()) can do it. Not via the fr0 clobber in save_fpu,
because this is called via a VCPU run ioctl, but KVM will later clobber
all FP/VEC registers via running guest code. So we clobber non-volatile
regs as well. This wouldn't be causing random corruption and crashes
though, only possibly bugs in code that runs KVM guests. ]

Thanks,
Nick

>
> perl and bash seem to be affected to some degree, though current builds don't 
> use enough VSX instructions rapidly enough to cause crashes with any 
> significant probability.  That said, over many years of running POWER at 
> datacenter scale I have seen enough random bash/perl crashes in the logs to 
> recognize the pattern; I think this has been a low-grade issue for a long 
> time, but with an infantismally small chance of happening it was seen as 
> random noise / hardware issues / other rare bugs in various programs.

Reply via email to