On Tue Nov 21, 2023 at 2:45 AM AEST, Timothy Pearson wrote: > > > ----- Original Message ----- > > From: "Michael Ellerman" <m...@ellerman.id.au> > > To: "Timothy Pearson" <tpear...@raptorengineering.com>, "Jens Axboe" > > <ax...@kernel.dk>, "regressions" > > <regressi...@lists.linux.dev>, "npiggin" <npig...@gmail.com>, "christophe > > leroy" <christophe.le...@csgroup.eu>, > > "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org> > > Sent: Monday, November 20, 2023 1:10:06 AM > > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > > register save > > > Hi Timothy, > > > > Great work debugging this. I think your fix is good, but I want to > > understand it > > a bit more to make sure I can explain why we haven't seen it outside of > > io-uring. > > If this can be triggered outside of io-uring then I have even more > > backporting > > in my future :} > > > > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs > > and > > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP > > regs > > from the thread struct before letting the task use FP again. So in that case > > save_fpu() is free to clobber fr0 because the FP regs no longer hold live > > values > > for the task. > > > > There is another case though, which is the path via: > > copy_process() > > dup_task_struct() > > arch_dup_task_struct() > > flush_all_to_thread() > > save_all() > > > > That path saves the FP regs but leaves them live. That's meant as an > > optimisation for a process that's using FP/VSX and then calls fork(), > > leaving > > the regs live means the parent process doesn't have to take a fault after > > the > > fork to get its FP regs back. > > > > That path does clobber fr0, but fr0 is volatile across a syscall, and the > > only > > way to reach copy_process() from userspace is via a syscall. So in normal > > usage > > fr0 being clobbered across a syscall shouldn't cause data corruption. > > > > Even if we handle a signal on the return from the fork() syscall, the worst > > that > > happens is that the task's thread struct holds the clobbered fr0, but the > > task > > doesn't care (because fr0 is volatile across the syscall anyway). > > > > That path is something like: > > > > system_call_vectored_common() > > system_call_exception() > > sys_fork() > > kernel_clone() > > copy_process() > > dup_task_struct() > > arch_dup_task_struct() > > flush_all_to_thread() > > save_all() > > if (tsk->thread.regs->msr & MSR_FP) > > save_fpu() > > # does not clear MSR_FP from regs->msr > > syscall_exit_prepare() > > interrupt_exit_user_prepare_main() > > do_notify_resume() > > get_signal() > > handle_rt_signal64() > > prepare_setup_sigcontext() > > flush_fp_to_thread() > > if (tsk->thread.regs->msr & MSR_FP) > > giveup_fpu() > > __giveup_fpu > > save_fpu() > > # clobbered fr0 is saved, but task considers it volatile > > # across syscall anyway > > > > > > But we now have a new path, because io-uring can call copy_process() via > > create_io_thread() from the signal handling path. That's OK if the signal is > > handled as we return from a syscall, but it's not OK if the signal is > > handled > > due to some other interrupt. > > > > Which is: > > > > interrupt_return_srr_user() > > interrupt_exit_user_prepare() > > interrupt_exit_user_prepare_main() > > do_notify_resume() > > get_signal() > > task_work_run() > > create_worker_cb() > > create_io_worker() > > copy_process() > > dup_task_struct() > > arch_dup_task_struct() > > flush_all_to_thread() > > save_all() > > if (tsk->thread.regs->msr & MSR_FP) > > save_fpu() > > # fr0 is clobbered and potentially live in > > userspace > > > > > > So tldr I think the corruption is only an issue since io-uring started doing > > the clone via signal, which I think matches the observed timeline of this > > bug > > appearing. > > I agree the corruption really only started showing up in earnest on io_uring > clone-via-signal, as this was confirmed several times in the course of > debugging. Bear in mind however that we have seen very, very rare crashes > over several years in other tasks, and I am starting to think this bug might > be the root cause (see below). > > Note as well that I may very well have a wrong call order in the commit > message, since I was relying on a couple of WARN_ON() macros I inserted to > check for a similar (but not identical) condition and didn't spend much time > getting new traces after identifying the root cause. > > I went back and grabbed some real world system-wide stack traces, since I now > know what to trigger on. A typical example is: > > interrupt_return_srr_user() > interrupt_exit_user_prepare() > interrupt_exit_user_prepare_main() > schedule() > __schedule() > __switch_to() > giveup_all() > # tsk->thread.regs->msr MSR_FP is still set here > __giveup_fpu() > save_fpu() > # fr0 is clobbered and potentially live in userspace > > This indicates that the pending signal is not actually required, it simply > makes triggering much more likely. Both the pending signal and the > _TIF_NEED_RESCHED paths give up control and end up in the same overall > position of trigging down-call routines that assume the FPU state is valid.
That seems probably true, but that's on the other side of the equation, I think? Because __giveup_fpu does clear MSR[FP] before any context switch or return to user is possile. *After* we have clobbered fr0 without clearing MSR_FP from the user msr, then yes any context switch or return to user will cause uerspace to next run with a clobbered fr0. But getting to that fr0/msr state requires the flush_all_to_thread(), and that needs to be called in a path where user fr0 must not be clobbered. I don't see one other than io-uring yet. [ KVM via kvmppc_save_user_regs() (which is basically the same as flush_all_to_thread()) can do it. Not via the fr0 clobber in save_fpu, because this is called via a VCPU run ioctl, but KVM will later clobber all FP/VEC registers via running guest code. So we clobber non-volatile regs as well. This wouldn't be causing random corruption and crashes though, only possibly bugs in code that runs KVM guests. ] Thanks, Nick > > perl and bash seem to be affected to some degree, though current builds don't > use enough VSX instructions rapidly enough to cause crashes with any > significant probability. That said, over many years of running POWER at > datacenter scale I have seen enough random bash/perl crashes in the logs to > recognize the pattern; I think this has been a low-grade issue for a long > time, but with an infantismally small chance of happening it was seen as > random noise / hardware issues / other rare bugs in various programs.