Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
On Sun, 19 Nov 2023 09:18:02 -0600, Timothy Pearson wrote: > During floating point and vector save to thread data fr0/vs0 are clobbered > by the FPSCR/VSCR store routine. > > [...] Applied to powerpc/fixes. [1/1] powerpc: Don't clobber fr0/vs0 during fp|altivec register save https://git.kernel.org/powerpc/c/5e1d824f9a283cbf90f25241b66d1f69adb3835b cheers
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Tuesday, November 28, 2023 6:57:01 AM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Michael Ellerman writes: >> Timothy Pearson writes: >> >>> Just wanted to check back and see if this patch was going to be queued >>> up soon? We're still having to work around / advertise the data >>> destruction issues the underlying bug is causing on e.g. Debian >>> Stable. >> >> Yeah I'll apply it this week, so it will be in rc4. > > I reworked the change log to include the exact call path I identified > instead of the more high level description you had. And tweaked a few > other bits of wording and so on, apparently fr0 is a kernelism, the ABI > and binutils calls it f0. > > I'm not sure how wedded you were to your change log, so if you dislike > my edits let me know and we can come up with a joint one. > > The actual patch is unchanged. > > cheers The commit message looks OK to me. I've also seen application crashes as a result of the register corruption, but that may be a minor detail that isn't really worth updating things over at this point -- those come from e.g. glibc using vs0 as part of a path that processes pointer information, typically seen where there's a need to replicate the same pointer to adjacent fields in a data struct.
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Michael Ellerman writes: > Timothy Pearson writes: > >> Just wanted to check back and see if this patch was going to be queued >> up soon? We're still having to work around / advertise the data >> destruction issues the underlying bug is causing on e.g. Debian >> Stable. > > Yeah I'll apply it this week, so it will be in rc4. I reworked the change log to include the exact call path I identified instead of the more high level description you had. And tweaked a few other bits of wording and so on, apparently fr0 is a kernelism, the ABI and binutils calls it f0. I'm not sure how wedded you were to your change log, so if you dislike my edits let me know and we can come up with a joint one. The actual patch is unchanged. cheers >From 5e1d824f9a283cbf90f25241b66d1f69adb3835b Mon Sep 17 00:00:00 2001 From: Timothy Pearson Date: Sun, 19 Nov 2023 09:18:02 -0600 Subject: [PATCH] powerpc: Don't clobber f0/vs0 during fp|altivec register save During floating point and vector save to thread data f0/vs0 are clobbered by the FPSCR/VSCR store routine. This has been obvserved to lead to userspace register corruption and application data corruption with io-uring. Fix it by restoring f0/vs0 after FPSCR/VSCR store has completed for all the FP, altivec, VMX register save paths. Tested under QEMU in kvm mode, running on a Talos II workstation with dual POWER9 DD2.2 CPUs. Additional detail (mpe): Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs from the thread struct before letting the task use FP again. So in that case save_fpu() is free to clobber f0 because the FP regs no longer hold live values for the task. There is another case though, which is the path via: sys_clone() ... copy_process() dup_task_struct() arch_dup_task_struct() flush_all_to_thread() save_all() That path saves the FP regs but leaves them live. That's meant as an optimisation for a process that's using FP/VSX and then calls fork(), leaving the regs live means the parent process doesn't have to take a fault after the fork to get its FP regs back. The optimisation was added in commit 8792468da5e1 ("powerpc: Add the ability to save FPU without giving it up"). That path does clobber f0, but f0 is volatile across function calls, and typically programs reach copy_process() from userspace via a syscall wrapper function. So in normal usage f0 being clobbered across a syscall doesn't cause visible data corruption. But there is now a new path, because io-uring can call copy_process() via create_io_thread() from the signal handling path. That's OK if the signal is handled as part of syscall return, but it's not OK if the signal is handled due to some other interrupt. That path is: interrupt_return_srr_user() interrupt_exit_user_prepare() interrupt_exit_user_prepare_main() do_notify_resume() get_signal() task_work_run() create_worker_cb() create_io_worker() copy_process() dup_task_struct() arch_dup_task_struct() flush_all_to_thread() save_all() if (tsk->thread.regs->msr & MSR_FP) save_fpu() # f0 is clobbered and potentially live in userspace Note the above discussion applies equally to save_altivec(). Fixes: 8792468da5e1 ("powerpc: Add the ability to save FPU without giving it up") Cc: sta...@vger.kernel.org # v4.6+ Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.javamail.zim...@raptorengineeringinc.com/ Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.javamail.zim...@raptorengineeringinc.com/ Tested-by: Timothy Pearson Tested-by: Jens Axboe Signed-off-by: Timothy Pearson [mpe: Reword change log to describe exact path of corruption & other minor tweaks] Signed-off-by: Michael Ellerman Link: https://msgid.link/1921539696.48534988.1700407082933.javamail.zim...@raptorengineeringinc.com --- arch/powerpc/kernel/fpu.S| 13 + arch/powerpc/kernel/vector.S | 2 ++ 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S index 6a9acfb690c9..2f8f3f93cbb6 100644 --- a/arch/powerpc/kernel/fpu.S +++ b/arch/powerpc/kernel/fpu.S @@ -23,6 +23,15 @@ #include #ifdef CONFIG_VSX +#define __REST_1FPVSR(n,c,base) \ +BEGIN_FTR_SECTION \ + b 2f; \ +END_FTR_SECTION_IFSET(CPU_FTR_VSX);\ + REST_FPR(n,base); \ + b 3f; \ +2: REST_VSR(n,c,base);
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
On Tue Nov 28, 2023 at 10:59 AM AEST, Michael Ellerman wrote: > Christophe Leroy writes: > > Le 27/11/2023 à 19:39, Timothy Pearson a écrit : > >> Just wanted to check back and see if this patch was going to be > >> queued up soon? We're still having to work around / advertise the > >> data destruction issues the underlying bug is causing on e.g. Debian > >> Stable. > > > > Has any agreement been reach on the final solution ? Seeing the many > > discussion on patch v2 I had the feeling that it was not the final solution. > > The actual patch is fine I think. > > The discussion was about improving the explanation of exactly what's > happening in the change log, and whether there is a larger bug causing > FP corruption unrelated to io-uring. One thing I said is maybe we should remove the "register save" concept entirely and always giveup. But that would not be suitable for a minimal fix. I didn't mean it as an alternate fix. Even if we did always giveup in future, this patch still gives a nicer API. There would have to be a noticable performance advantage to not restoring fr0/vs0 after saving before we'd think about changing it back to clobber, since that encourages foot shooting. Thanks, Nick
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Christophe Leroy writes: > Le 27/11/2023 à 19:39, Timothy Pearson a écrit : >> Just wanted to check back and see if this patch was going to be >> queued up soon? We're still having to work around / advertise the >> data destruction issues the underlying bug is causing on e.g. Debian >> Stable. > > Has any agreement been reach on the final solution ? Seeing the many > discussion on patch v2 I had the feeling that it was not the final solution. The actual patch is fine I think. The discussion was about improving the explanation of exactly what's happening in the change log, and whether there is a larger bug causing FP corruption unrelated to io-uring. I'm now reasonably confident there's no detectable corruption of fr0 happening except via the io-uring -> clone path. It's still a bad bug for us to corrupt fr0 across sys_clone(), but in practice it doesn't affect userspace because fr0 is volatile across function calls. cheers
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Timothy Pearson writes: > Just wanted to check back and see if this patch was going to be queued > up soon? We're still having to work around / advertise the data > destruction issues the underlying bug is causing on e.g. Debian > Stable. Yeah I'll apply it this week, so it will be in rc4. I wanted to be more certain the corruption only happens in practice with io-uring before applying it. cheers
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Hi, Le 27/11/2023 à 19:39, Timothy Pearson a écrit : > Just wanted to check back and see if this patch was going to be queued up > soon? We're still having to work around / advertise the data destruction > issues the underlying bug is causing on e.g. Debian Stable. > Has any agreement been reach on the final solution ? Seeing the many discussion on patch v2 I had the feeling that it was not the final solution. Christophe
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Just wanted to check back and see if this patch was going to be queued up soon? We're still having to work around / advertise the data destruction issues the underlying bug is causing on e.g. Debian Stable. Thanks!
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Tuesday, November 21, 2023 11:01:50 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Timothy Pearson writes: >> > ... >> >> So a little more detail on this, just to put it to rest properly vs. >> assuming hand analysis caught every possible pathway. :) >> >> The debugging that generates this stack trace also verifies the following in >> __giveup_fpu(): >> >> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to >> calling >> save_fpu() >> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after >> calling >> save_fpu() >> 3.) MSR_FP is set both in the task struct and in the live MSR. >> >> Only if all three conditions are met will it generate the trace. This >> is a generalization of the hack I used to find the problem in the >> first place. >> >> If the state will subsequently be reloaded from the thread struct, >> that means we're reloading the registers from the thread struct that >> we just verified was corrupted by the earlier save_fpu() call. There >> are only two ways I can see for that to be true -- one is if the >> registers were already clobbered when giveup_all() was entered, and >> the other is if save_fpu() went ahead and clobbered them right here >> inside giveup_all(). >> >> To see which scenario we were dealing with, I added a bit more >> instrumentation to dump the current register state if MSR_FP bit was >> already set in registers (i.e. not dumping data from task struct, but >> using the live FPU registers instead), and sure enough the registers >> are corrupt on entry, so something else has already called save_fpu() >> before we even hit giveup_all() in this call chain. > > Can you share the debug patch you're using? > > cheers Sure, here you go. Note that with my FPU patch there is no WARN_ON hit, at least in my testing, so it isn't userspace purposefully loading the fr0/vs0 register with the FPSCR. diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 392404688cec..bde57dc3262a 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -154,7 +154,49 @@ static void __giveup_fpu(struct task_struct *tsk) { unsigned long msr; + // DEBUGGING + uint64_t prev_fpr0 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0); + uint64_t prev_fpr1 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1); + struct thread_fp_state debug_fp_state; + unsigned long currentmsr = mfmsr(); + + if (currentmsr & MSR_FP) { + store_fp_state(&debug_fp_state); + load_fp_state(&debug_fp_state); + } + save_fpu(tsk); + + // DEBUGGING + if (tsk->thread.regs->msr & MSR_FP) { + if (((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0) == 0x82004000) && (prev_fpr0 != 0x82004000)) +|| ((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1) == 0x82004000) && (prev_fpr1 != 0x82004000))) + { + WARN_ON(1); + + printk("[TS %lld] In __giveup_fpu() for process [comm: '%s' pid %d tid %d], before save current " + "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 0x%016llx/%016llx fp9: 0x%016llx/%016llx" + " msr: 0x%016lx (FP %d VSX %d EE %d) on core %d\n", + ktime_get_boottime_ns(), current->comm, current->pid, current->tgid, + *(((uint64_t*)(&debug_fp_state.fpr[0]))+0), *(((uint64_t*)(&debug_fp_state.fpr[0]))+1), + *(((uint64_t*)(&debug_fp_state.fpr[1]))+0), *(((uint64_t*)(&debug_fp_state.fpr[1]))+1), + *(((uint64_t*)(&debug_fp_state.fpr[8]))+0), *(((uint64_t*)(&debug_fp_state.fpr[8]))+1), + *(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+0), *(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+1), + currentmsr, !!(currentmsr & MSR_FP), !!(currentmsr & MSR_VSX), !!(currentmsr & MSR_EE), raw_smp_processor_id()); + + printk("[TS %lld] In __giveup_fpu() for process [comm: '%s' pid %d tid %d], after save saved " + "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 0x%016llx/%016ll
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Timothy Pearson writes: > ... > > So a little more detail on this, just to put it to rest properly vs. > assuming hand analysis caught every possible pathway. :) > > The debugging that generates this stack trace also verifies the following in > __giveup_fpu(): > > 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to > calling save_fpu() > 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after > calling save_fpu() > 3.) MSR_FP is set both in the task struct and in the live MSR. > > Only if all three conditions are met will it generate the trace. This > is a generalization of the hack I used to find the problem in the > first place. > > If the state will subsequently be reloaded from the thread struct, > that means we're reloading the registers from the thread struct that > we just verified was corrupted by the earlier save_fpu() call. There > are only two ways I can see for that to be true -- one is if the > registers were already clobbered when giveup_all() was entered, and > the other is if save_fpu() went ahead and clobbered them right here > inside giveup_all(). > > To see which scenario we were dealing with, I added a bit more > instrumentation to dump the current register state if MSR_FP bit was > already set in registers (i.e. not dumping data from task struct, but > using the live FPU registers instead), and sure enough the registers > are corrupt on entry, so something else has already called save_fpu() > before we even hit giveup_all() in this call chain. Can you share the debug patch you're using? cheers
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
On Tue Nov 21, 2023 at 11:23 AM AEST, Timothy Pearson wrote: > > > - Original Message - > > From: "Michael Ellerman" > > To: "Timothy Pearson" > > Cc: "Jens Axboe" , "regressions" > > , "npiggin" , > > "christophe leroy" , "linuxppc-dev" > > > > Sent: Monday, November 20, 2023 5:39:52 PM > > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > > register save > > > Timothy Pearson writes: > >> - Original Message - > >>> From: "Michael Ellerman" > > ... > >>> > >>> But we now have a new path, because io-uring can call copy_process() via > >>> create_io_thread() from the signal handling path. That's OK if the signal > >>> is > >>> handled as we return from a syscall, but it's not OK if the signal is > >>> handled > >>> due to some other interrupt. > >>> > >>> Which is: > >>> > >>> interrupt_return_srr_user() > >>> interrupt_exit_user_prepare() > >>>interrupt_exit_user_prepare_main() > >>> do_notify_resume() > >>>get_signal() > >>> task_work_run() > >>>create_worker_cb() > >>> create_io_worker() > >>>copy_process() > >>> dup_task_struct() > >>>arch_dup_task_struct() > >>> flush_all_to_thread() > >>>save_all() > >>> if (tsk->thread.regs->msr & MSR_FP) > >>>save_fpu() > >>># fr0 is clobbered and potentially live in > >>> userspace > >>> > >>> > >>> So tldr I think the corruption is only an issue since io-uring started > >>> doing > >>> the clone via signal, which I think matches the observed timeline of this > >>> bug > >>> appearing. > >> > >> I agree the corruption really only started showing up in earnest on > >> io_uring clone-via-signal, as this was confirmed several times in the > >> course of debugging. > > > > Thanks. > > > >> Note as well that I may very well have a wrong call order in the > >> commit message, since I was relying on a couple of WARN_ON() macros I > >> inserted to check for a similar (but not identical) condition and > >> didn't spend much time getting new traces after identifying the root > >> cause. > > > > Yep no worries. I'll reword it to incorporate the full path from my mail. > > > >> I went back and grabbed some real world system-wide stack traces, since I > >> now > >> know what to trigger on. A typical example is: > >> > >> interrupt_return_srr_user() > >> interrupt_exit_user_prepare() > >> interrupt_exit_user_prepare_main() > >>schedule() > >> __schedule() > >> __switch_to() > >> giveup_all() > >># tsk->thread.regs->msr MSR_FP is still set here > >>__giveup_fpu() > >> save_fpu() > >> # fr0 is clobbered and potentially live in userspace > > > > fr0 is not live there. > > > > __giveup_fpu() does roughly: > > > > msr = tsk->thread.regs->msr; > > msr &= ~(MSR_FP|MSR_FE0|MSR_FE1); > >msr &= ~MSR_VSX; > > tsk->thread.regs = msr; > > > > ie. it clears the FP etc. bits from the task's MSR. That means the FP > > state will be reloaded from the thread struct before the task is run again. > > > > Also on that path we're switching to another task, so we'll be reloading > > the other task's FP state before returning to userspace. > > > > So I don't see any bug there. > > Yeah, you're right. I was trying to get traces while doing something else, > and didn't think that all the way through, sorry. :) It's not going to be > super easy to get a real trace (I was triggering the WARN_ON() from of fr0 > getting set to to FPSCR), so let's just assume it's mainly the path you > manually found above and update the commit message accordingly. > > > There's only two places that call save_fpu() and skip the giveup logic, > > which is save_all() and kvmppc_save_
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
On Tue Nov 21, 2023 at 2:10 PM AEST, Timothy Pearson wrote: > - Original Message - > > From: "Michael Ellerman" > > To: "Timothy Pearson" > > Cc: "Jens Axboe" , "regressions" > > , "npiggin" , > > "christophe leroy" , "linuxppc-dev" > > > > Sent: Monday, November 20, 2023 5:39:52 PM > > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > > register save > > > Timothy Pearson writes: > >> - Original Message - > >>> From: "Michael Ellerman" > > ... > >>> > >>> But we now have a new path, because io-uring can call copy_process() via > >>> create_io_thread() from the signal handling path. That's OK if the signal > >>> is > >>> handled as we return from a syscall, but it's not OK if the signal is > >>> handled > >>> due to some other interrupt. > >>> > >>> Which is: > >>> > >>> interrupt_return_srr_user() > >>> interrupt_exit_user_prepare() > >>>interrupt_exit_user_prepare_main() > >>> do_notify_resume() > >>>get_signal() > >>> task_work_run() > >>>create_worker_cb() > >>> create_io_worker() > >>>copy_process() > >>> dup_task_struct() > >>>arch_dup_task_struct() > >>> flush_all_to_thread() > >>>save_all() > >>> if (tsk->thread.regs->msr & MSR_FP) > >>>save_fpu() > >>># fr0 is clobbered and potentially live in > >>> userspace > >>> > >>> > >>> So tldr I think the corruption is only an issue since io-uring started > >>> doing > >>> the clone via signal, which I think matches the observed timeline of this > >>> bug > >>> appearing. > >> > >> I agree the corruption really only started showing up in earnest on > >> io_uring clone-via-signal, as this was confirmed several times in the > >> course of debugging. > > > > Thanks. > > > >> Note as well that I may very well have a wrong call order in the > >> commit message, since I was relying on a couple of WARN_ON() macros I > >> inserted to check for a similar (but not identical) condition and > >> didn't spend much time getting new traces after identifying the root > >> cause. > > > > Yep no worries. I'll reword it to incorporate the full path from my mail. > > > >> I went back and grabbed some real world system-wide stack traces, since I > >> now > >> know what to trigger on. A typical example is: > >> > >> interrupt_return_srr_user() > >> interrupt_exit_user_prepare() > >> interrupt_exit_user_prepare_main() > >>schedule() > >> __schedule() > >> __switch_to() > >> giveup_all() > >># tsk->thread.regs->msr MSR_FP is still set here > >>__giveup_fpu() > >> save_fpu() > >> # fr0 is clobbered and potentially live in userspace > > > > fr0 is not live there. > > > ie. it clears the FP etc. bits from the task's MSR. That means the FP > > state will be reloaded from the thread struct before the task is run again. > > So a little more detail on this, just to put it to rest properly vs. assuming > hand analysis caught every possible pathway. :) > > The debugging that generates this stack trace also verifies the following in > __giveup_fpu(): > > 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to > calling save_fpu() > 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after > calling save_fpu() > 3.) MSR_FP is set both in the task struct and in the live MSR. > > Only if all three conditions are met will it generate the trace. This is a > generalization of the hack I used to find the problem in the first place. > > If the state will subsequently be reloaded from the thread struct, that means > we're reloading the registers from the thread struct that we just verified > was corrupted by the earlier save_fpu() call. There are only two ways I can > see for that to be true -- one is if the registers were already clobbered
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Timothy Pearson" > To: "Michael Ellerman" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Monday, November 20, 2023 10:10:32 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > - Original Message - >> From: "Michael Ellerman" >> To: "Timothy Pearson" >> Cc: "Jens Axboe" , "regressions" >> , >> "npiggin" , >> "christophe leroy" , "linuxppc-dev" >> >> Sent: Monday, November 20, 2023 5:39:52 PM >> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec >> register save > >> Timothy Pearson writes: >>> - Original Message - >>>> From: "Michael Ellerman" >> ... >>>> >>>> But we now have a new path, because io-uring can call copy_process() via >>>> create_io_thread() from the signal handling path. That's OK if the signal >>>> is >>>> handled as we return from a syscall, but it's not OK if the signal is >>>> handled >>>> due to some other interrupt. >>>> >>>> Which is: >>>> >>>> interrupt_return_srr_user() >>>> interrupt_exit_user_prepare() >>>>interrupt_exit_user_prepare_main() >>>> do_notify_resume() >>>>get_signal() >>>> task_work_run() >>>>create_worker_cb() >>>> create_io_worker() >>>>copy_process() >>>> dup_task_struct() >>>>arch_dup_task_struct() >>>> flush_all_to_thread() >>>>save_all() >>>> if (tsk->thread.regs->msr & MSR_FP) >>>>save_fpu() >>>># fr0 is clobbered and potentially live in >>>> userspace >>>> >>>> >>>> So tldr I think the corruption is only an issue since io-uring started >>>> doing >>>> the clone via signal, which I think matches the observed timeline of this >>>> bug >>>> appearing. >>> >>> I agree the corruption really only started showing up in earnest on >>> io_uring clone-via-signal, as this was confirmed several times in the >>> course of debugging. >> >> Thanks. >> >>> Note as well that I may very well have a wrong call order in the >>> commit message, since I was relying on a couple of WARN_ON() macros I >>> inserted to check for a similar (but not identical) condition and >>> didn't spend much time getting new traces after identifying the root >>> cause. >> >> Yep no worries. I'll reword it to incorporate the full path from my mail. >> >>> I went back and grabbed some real world system-wide stack traces, since I >>> now >>> know what to trigger on. A typical example is: >>> >>> interrupt_return_srr_user() >>> interrupt_exit_user_prepare() >>> interrupt_exit_user_prepare_main() >>>schedule() >>> __schedule() >>> __switch_to() >>> giveup_all() >>># tsk->thread.regs->msr MSR_FP is still set here >>>__giveup_fpu() >>> save_fpu() >>> # fr0 is clobbered and potentially live in userspace >> >> fr0 is not live there. > >> ie. it clears the FP etc. bits from the task's MSR. That means the FP >> state will be reloaded from the thread struct before the task is run again. > > So a little more detail on this, just to put it to rest properly vs. assuming > hand analysis caught every possible pathway. :) > > The debugging that generates this stack trace also verifies the following in > __giveup_fpu(): > > 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to > calling > save_fpu() > 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after > calling > save_fpu() > 3.) MSR_FP is set both in the task struct and in the live MSR. > > Only if all three conditions are met will it generate the trace. This is a > generalization of the hack I used to find the problem in the first place. > > If the state will subsequently
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Monday, November 20, 2023 5:39:52 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Timothy Pearson writes: >> - Original Message - >>> From: "Michael Ellerman" > ... >>> >>> But we now have a new path, because io-uring can call copy_process() via >>> create_io_thread() from the signal handling path. That's OK if the signal is >>> handled as we return from a syscall, but it's not OK if the signal is >>> handled >>> due to some other interrupt. >>> >>> Which is: >>> >>> interrupt_return_srr_user() >>> interrupt_exit_user_prepare() >>>interrupt_exit_user_prepare_main() >>> do_notify_resume() >>>get_signal() >>> task_work_run() >>>create_worker_cb() >>> create_io_worker() >>>copy_process() >>> dup_task_struct() >>>arch_dup_task_struct() >>> flush_all_to_thread() >>>save_all() >>> if (tsk->thread.regs->msr & MSR_FP) >>>save_fpu() >>># fr0 is clobbered and potentially live in >>> userspace >>> >>> >>> So tldr I think the corruption is only an issue since io-uring started doing >>> the clone via signal, which I think matches the observed timeline of this >>> bug >>> appearing. >> >> I agree the corruption really only started showing up in earnest on >> io_uring clone-via-signal, as this was confirmed several times in the >> course of debugging. > > Thanks. > >> Note as well that I may very well have a wrong call order in the >> commit message, since I was relying on a couple of WARN_ON() macros I >> inserted to check for a similar (but not identical) condition and >> didn't spend much time getting new traces after identifying the root >> cause. > > Yep no worries. I'll reword it to incorporate the full path from my mail. > >> I went back and grabbed some real world system-wide stack traces, since I now >> know what to trigger on. A typical example is: >> >> interrupt_return_srr_user() >> interrupt_exit_user_prepare() >> interrupt_exit_user_prepare_main() >>schedule() >> __schedule() >> __switch_to() >> giveup_all() >># tsk->thread.regs->msr MSR_FP is still set here >>__giveup_fpu() >> save_fpu() >> # fr0 is clobbered and potentially live in userspace > > fr0 is not live there. > ie. it clears the FP etc. bits from the task's MSR. That means the FP > state will be reloaded from the thread struct before the task is run again. So a little more detail on this, just to put it to rest properly vs. assuming hand analysis caught every possible pathway. :) The debugging that generates this stack trace also verifies the following in __giveup_fpu(): 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling save_fpu() 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling save_fpu() 3.) MSR_FP is set both in the task struct and in the live MSR. Only if all three conditions are met will it generate the trace. This is a generalization of the hack I used to find the problem in the first place. If the state will subsequently be reloaded from the thread struct, that means we're reloading the registers from the thread struct that we just verified was corrupted by the earlier save_fpu() call. There are only two ways I can see for that to be true -- one is if the registers were already clobbered when giveup_all() was entered, and the other is if save_fpu() went ahead and clobbered them right here inside giveup_all(). To see which scenario we were dealing with, I added a bit more instrumentation to dump the current register state if MSR_FP bit was already set in registers (i.e. not dumping data from task struct, but using the live FPU registers instead), and sure enough the registers are corrupt on entry, so something else has already called save_fpu() before we even hit giveup_all() in this call chain. Unless I'm missing something, doesn't this effectively mean that anything interrupting a task can hit this bug? Or, put another way, I'm seeing several processes hit this exact call chain with the corrupt register going back out to userspace without io_uring even in the mix, so there seems to be another pathway in play. These traces are from a qemu guest, in case it matters given the kvm path is possibly susceptible. Just a few things to think about. The FPU patch itself definitely resolves the problems; I used a sledgehammer approach *specifically* so that there is no place for a rare call sequence we didn't consider to hit it again down the line. :)
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Monday, November 20, 2023 5:39:52 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Timothy Pearson writes: >> - Original Message - >>> From: "Michael Ellerman" > ... >>> >>> But we now have a new path, because io-uring can call copy_process() via >>> create_io_thread() from the signal handling path. That's OK if the signal is >>> handled as we return from a syscall, but it's not OK if the signal is >>> handled >>> due to some other interrupt. >>> >>> Which is: >>> >>> interrupt_return_srr_user() >>> interrupt_exit_user_prepare() >>>interrupt_exit_user_prepare_main() >>> do_notify_resume() >>>get_signal() >>> task_work_run() >>>create_worker_cb() >>> create_io_worker() >>>copy_process() >>> dup_task_struct() >>>arch_dup_task_struct() >>> flush_all_to_thread() >>>save_all() >>> if (tsk->thread.regs->msr & MSR_FP) >>>save_fpu() >>># fr0 is clobbered and potentially live in >>> userspace >>> >>> >>> So tldr I think the corruption is only an issue since io-uring started doing >>> the clone via signal, which I think matches the observed timeline of this >>> bug >>> appearing. >> >> I agree the corruption really only started showing up in earnest on >> io_uring clone-via-signal, as this was confirmed several times in the >> course of debugging. > > Thanks. > >> Note as well that I may very well have a wrong call order in the >> commit message, since I was relying on a couple of WARN_ON() macros I >> inserted to check for a similar (but not identical) condition and >> didn't spend much time getting new traces after identifying the root >> cause. > > Yep no worries. I'll reword it to incorporate the full path from my mail. > >> I went back and grabbed some real world system-wide stack traces, since I now >> know what to trigger on. A typical example is: >> >> interrupt_return_srr_user() >> interrupt_exit_user_prepare() >> interrupt_exit_user_prepare_main() >>schedule() >> __schedule() >> __switch_to() >> giveup_all() >># tsk->thread.regs->msr MSR_FP is still set here >>__giveup_fpu() >> save_fpu() >> # fr0 is clobbered and potentially live in userspace > > fr0 is not live there. > > __giveup_fpu() does roughly: > > msr = tsk->thread.regs->msr; > msr &= ~(MSR_FP|MSR_FE0|MSR_FE1); >msr &= ~MSR_VSX; > tsk->thread.regs = msr; > > ie. it clears the FP etc. bits from the task's MSR. That means the FP > state will be reloaded from the thread struct before the task is run again. > > Also on that path we're switching to another task, so we'll be reloading > the other task's FP state before returning to userspace. > > So I don't see any bug there. Yeah, you're right. I was trying to get traces while doing something else, and didn't think that all the way through, sorry. :) It's not going to be super easy to get a real trace (I was triggering the WARN_ON() from of fr0 getting set to to FPSCR), so let's just assume it's mainly the path you manually found above and update the commit message accordingly. > There's only two places that call save_fpu() and skip the giveup logic, > which is save_all() and kvmppc_save_user_regs(). Now that's interesting as well, since it might explain some issues I've seen for years on a specific QEMU workload. Once this is backported to stable I'll need to get the kernels updated on those boxes and see if the issues disappear...
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
On Tue Nov 21, 2023 at 9:39 AM AEST, Michael Ellerman wrote: > Timothy Pearson writes: > > - Original Message - > >> From: "Michael Ellerman" > ... > >> > >> But we now have a new path, because io-uring can call copy_process() via > >> create_io_thread() from the signal handling path. That's OK if the signal > >> is > >> handled as we return from a syscall, but it's not OK if the signal is > >> handled > >> due to some other interrupt. > >> > >> Which is: > >> > >> interrupt_return_srr_user() > >> interrupt_exit_user_prepare() > >>interrupt_exit_user_prepare_main() > >> do_notify_resume() > >>get_signal() > >> task_work_run() > >>create_worker_cb() > >> create_io_worker() > >>copy_process() > >> dup_task_struct() > >>arch_dup_task_struct() > >> flush_all_to_thread() > >>save_all() > >> if (tsk->thread.regs->msr & MSR_FP) > >>save_fpu() > >># fr0 is clobbered and potentially live in > >> userspace > >> > >> > >> So tldr I think the corruption is only an issue since io-uring started > >> doing > >> the clone via signal, which I think matches the observed timeline of this > >> bug > >> appearing. > > > > I agree the corruption really only started showing up in earnest on > > io_uring clone-via-signal, as this was confirmed several times in the > > course of debugging. > > Thanks. > > > Note as well that I may very well have a wrong call order in the > > commit message, since I was relying on a couple of WARN_ON() macros I > > inserted to check for a similar (but not identical) condition and > > didn't spend much time getting new traces after identifying the root > > cause. > > Yep no worries. I'll reword it to incorporate the full path from my mail. > > > I went back and grabbed some real world system-wide stack traces, since I > > now know what to trigger on. A typical example is: > > > > interrupt_return_srr_user() > > interrupt_exit_user_prepare() > > interrupt_exit_user_prepare_main() > >schedule() > > __schedule() > > __switch_to() > > giveup_all() > ># tsk->thread.regs->msr MSR_FP is still set here > >__giveup_fpu() > > save_fpu() > > # fr0 is clobbered and potentially live in userspace > > fr0 is not live there. > > __giveup_fpu() does roughly: > > msr = tsk->thread.regs->msr; > msr &= ~(MSR_FP|MSR_FE0|MSR_FE1); > msr &= ~MSR_VSX; > tsk->thread.regs = msr; > > ie. it clears the FP etc. bits from the task's MSR. That means the FP > state will be reloaded from the thread struct before the task is run again. > > Also on that path we're switching to another task, so we'll be reloading > the other task's FP state before returning to userspace. > > So I don't see any bug there. > > There's only two places that call save_fpu() and skip the giveup logic, > which is save_all() and kvmppc_save_user_regs(). > > save_all() is only called via clone() so I think that's unable to > actually cause visible register corruption as I described in my previous > mail. > > I thought the KVM case was similar, as it's called via an ioctl, but > I'll have to talk to Nick as his mail indicates otherwise. Yeah it can, because later on it runs the guest and that will clobber other regs. I reproduced fr14 corruption in the host easily with KVM selftests. It should just do a "giveup" on FP/VEC (as it does with TM). Thanks, Nick
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
On Tue Nov 21, 2023 at 2:45 AM AEST, Timothy Pearson wrote: > > > - Original Message - > > From: "Michael Ellerman" > > To: "Timothy Pearson" , "Jens Axboe" > > , "regressions" > > , "npiggin" , "christophe > > leroy" , > > "linuxppc-dev" > > Sent: Monday, November 20, 2023 1:10:06 AM > > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > > register save > > > Hi Timothy, > > > > Great work debugging this. I think your fix is good, but I want to > > understand it > > a bit more to make sure I can explain why we haven't seen it outside of > > io-uring. > > If this can be triggered outside of io-uring then I have even more > > backporting > > in my future :} > > > > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs > > and > > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP > > regs > > from the thread struct before letting the task use FP again. So in that case > > save_fpu() is free to clobber fr0 because the FP regs no longer hold live > > values > > for the task. > > > > There is another case though, which is the path via: > > copy_process() > >dup_task_struct() > > arch_dup_task_struct() > >flush_all_to_thread() > > save_all() > > > > That path saves the FP regs but leaves them live. That's meant as an > > optimisation for a process that's using FP/VSX and then calls fork(), > > leaving > > the regs live means the parent process doesn't have to take a fault after > > the > > fork to get its FP regs back. > > > > That path does clobber fr0, but fr0 is volatile across a syscall, and the > > only > > way to reach copy_process() from userspace is via a syscall. So in normal > > usage > > fr0 being clobbered across a syscall shouldn't cause data corruption. > > > > Even if we handle a signal on the return from the fork() syscall, the worst > > that > > happens is that the task's thread struct holds the clobbered fr0, but the > > task > > doesn't care (because fr0 is volatile across the syscall anyway). > > > > That path is something like: > > > > system_call_vectored_common() > > system_call_exception() > >sys_fork() > > kernel_clone() > >copy_process() > > dup_task_struct() > >arch_dup_task_struct() > > flush_all_to_thread() > >save_all() > > if (tsk->thread.regs->msr & MSR_FP) > >save_fpu() > ># does not clear MSR_FP from regs->msr > > syscall_exit_prepare() > >interrupt_exit_user_prepare_main() > > do_notify_resume() > >get_signal() > >handle_rt_signal64() > > prepare_setup_sigcontext() > >flush_fp_to_thread() > > if (tsk->thread.regs->msr & MSR_FP) > >giveup_fpu() > > __giveup_fpu > >save_fpu() > ># clobbered fr0 is saved, but task considers it volatile > ># across syscall anyway > > > > > > But we now have a new path, because io-uring can call copy_process() via > > create_io_thread() from the signal handling path. That's OK if the signal is > > handled as we return from a syscall, but it's not OK if the signal is > > handled > > due to some other interrupt. > > > > Which is: > > > > interrupt_return_srr_user() > > interrupt_exit_user_prepare() > >interrupt_exit_user_prepare_main() > > do_notify_resume() > >get_signal() > > task_work_run() > >create_worker_cb() > > create_io_worker() > >copy_process() > > dup_task_struct() > >arch_dup_task_struct() > > flush_all_to_thread() > >save_all() > > if (tsk->thread.regs->msr & MSR_FP) > >save_fpu() > ># fr0 is clobbered and potentially live in > > userspace > > > > > > So tldr I think the corruption is only an issue since io-uring started doing > > the clone via signal, which I
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Timothy Pearson writes: > - Original Message - >> From: "Michael Ellerman" ... >> >> But we now have a new path, because io-uring can call copy_process() via >> create_io_thread() from the signal handling path. That's OK if the signal is >> handled as we return from a syscall, but it's not OK if the signal is handled >> due to some other interrupt. >> >> Which is: >> >> interrupt_return_srr_user() >> interrupt_exit_user_prepare() >>interrupt_exit_user_prepare_main() >> do_notify_resume() >>get_signal() >> task_work_run() >>create_worker_cb() >> create_io_worker() >>copy_process() >> dup_task_struct() >>arch_dup_task_struct() >> flush_all_to_thread() >>save_all() >> if (tsk->thread.regs->msr & MSR_FP) >>save_fpu() >># fr0 is clobbered and potentially live in >> userspace >> >> >> So tldr I think the corruption is only an issue since io-uring started doing >> the clone via signal, which I think matches the observed timeline of this bug >> appearing. > > I agree the corruption really only started showing up in earnest on > io_uring clone-via-signal, as this was confirmed several times in the > course of debugging. Thanks. > Note as well that I may very well have a wrong call order in the > commit message, since I was relying on a couple of WARN_ON() macros I > inserted to check for a similar (but not identical) condition and > didn't spend much time getting new traces after identifying the root > cause. Yep no worries. I'll reword it to incorporate the full path from my mail. > I went back and grabbed some real world system-wide stack traces, since I now > know what to trigger on. A typical example is: > > interrupt_return_srr_user() > interrupt_exit_user_prepare() > interrupt_exit_user_prepare_main() >schedule() > __schedule() > __switch_to() > giveup_all() ># tsk->thread.regs->msr MSR_FP is still set here >__giveup_fpu() > save_fpu() > # fr0 is clobbered and potentially live in userspace fr0 is not live there. __giveup_fpu() does roughly: msr = tsk->thread.regs->msr; msr &= ~(MSR_FP|MSR_FE0|MSR_FE1); msr &= ~MSR_VSX; tsk->thread.regs = msr; ie. it clears the FP etc. bits from the task's MSR. That means the FP state will be reloaded from the thread struct before the task is run again. Also on that path we're switching to another task, so we'll be reloading the other task's FP state before returning to userspace. So I don't see any bug there. There's only two places that call save_fpu() and skip the giveup logic, which is save_all() and kvmppc_save_user_regs(). save_all() is only called via clone() so I think that's unable to actually cause visible register corruption as I described in my previous mail. I thought the KVM case was similar, as it's called via an ioctl, but I'll have to talk to Nick as his mail indicates otherwise. cheers
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" , "Jens Axboe" > , "regressions" > , "npiggin" , "christophe > leroy" , > "linuxppc-dev" > Sent: Monday, November 20, 2023 1:10:06 AM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Hi Timothy, > > Great work debugging this. I think your fix is good, but I want to understand > it > a bit more to make sure I can explain why we haven't seen it outside of > io-uring. > If this can be triggered outside of io-uring then I have even more backporting > in my future :} > > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP > regs > from the thread struct before letting the task use FP again. So in that case > save_fpu() is free to clobber fr0 because the FP regs no longer hold live > values > for the task. > > There is another case though, which is the path via: > copy_process() >dup_task_struct() > arch_dup_task_struct() >flush_all_to_thread() > save_all() > > That path saves the FP regs but leaves them live. That's meant as an > optimisation for a process that's using FP/VSX and then calls fork(), leaving > the regs live means the parent process doesn't have to take a fault after the > fork to get its FP regs back. > > That path does clobber fr0, but fr0 is volatile across a syscall, and the only > way to reach copy_process() from userspace is via a syscall. So in normal > usage > fr0 being clobbered across a syscall shouldn't cause data corruption. > > Even if we handle a signal on the return from the fork() syscall, the worst > that > happens is that the task's thread struct holds the clobbered fr0, but the task > doesn't care (because fr0 is volatile across the syscall anyway). > > That path is something like: > > system_call_vectored_common() > system_call_exception() >sys_fork() > kernel_clone() >copy_process() > dup_task_struct() >arch_dup_task_struct() > flush_all_to_thread() >save_all() > if (tsk->thread.regs->msr & MSR_FP) >save_fpu() ># does not clear MSR_FP from regs->msr > syscall_exit_prepare() >interrupt_exit_user_prepare_main() > do_notify_resume() >get_signal() >handle_rt_signal64() > prepare_setup_sigcontext() >flush_fp_to_thread() > if (tsk->thread.regs->msr & MSR_FP) >giveup_fpu() > __giveup_fpu >save_fpu() ># clobbered fr0 is saved, but task considers it volatile ># across syscall anyway > > > But we now have a new path, because io-uring can call copy_process() via > create_io_thread() from the signal handling path. That's OK if the signal is > handled as we return from a syscall, but it's not OK if the signal is handled > due to some other interrupt. > > Which is: > > interrupt_return_srr_user() > interrupt_exit_user_prepare() >interrupt_exit_user_prepare_main() > do_notify_resume() >get_signal() > task_work_run() >create_worker_cb() > create_io_worker() >copy_process() > dup_task_struct() >arch_dup_task_struct() > flush_all_to_thread() >save_all() > if (tsk->thread.regs->msr & MSR_FP) >save_fpu() ># fr0 is clobbered and potentially live in > userspace > > > So tldr I think the corruption is only an issue since io-uring started doing > the clone via signal, which I think matches the observed timeline of this bug > appearing. I agree the corruption really only started showing up in earnest on io_uring clone-via-signal, as this was confirmed several times in the course of debugging. Bear in mind however that we have seen very, very rare crashes over several years in other tasks, and I am starting to think this bug might be the root cause (see below). Note as well that I may very well have a wrong call order in the commit message, since I was relying on a couple of WARN_ON() macros I inserted to check for a similar (but not identical) condition and didn't spend much time getting new traces after identifying the root cause. I went
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Yeah, awesome. On Mon Nov 20, 2023 at 5:10 PM AEST, Michael Ellerman wrote: > Hi Timothy, > > Great work debugging this. I think your fix is good, but I want to understand > it > a bit more to make sure I can explain why we haven't seen it outside of > io-uring. Analysis seems right to me. Probably the best minimal fix. But I wonder if we should just use the one path for saving/flushing/giving up, just use giveup instead of save? KVM looks odd too, and actually gets this wrong. In a way that's not fixed by Timothy's patch, because it's just not restoring userspace registers at all. Fortunately QEMU isn't in the habit of using non volatile FP/VEC registers over a VCPU ioctl, but there's no reason it couldn't do since GCC/LLVM can easily use them. KVM really wants to be using giveup. Thanks, Nick > If this can be triggered outside of io-uring then I have even more backporting > in my future :} > > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP > regs > from the thread struct before letting the task use FP again. So in that case > save_fpu() is free to clobber fr0 because the FP regs no longer hold live > values > for the task. > > There is another case though, which is the path via: > copy_process() > dup_task_struct() > arch_dup_task_struct() > flush_all_to_thread() > save_all() > > That path saves the FP regs but leaves them live. That's meant as an > optimisation for a process that's using FP/VSX and then calls fork(), leaving > the regs live means the parent process doesn't have to take a fault after the > fork to get its FP regs back. > > That path does clobber fr0, but fr0 is volatile across a syscall, and the only > way to reach copy_process() from userspace is via a syscall. So in normal > usage > fr0 being clobbered across a syscall shouldn't cause data corruption. > > Even if we handle a signal on the return from the fork() syscall, the worst > that > happens is that the task's thread struct holds the clobbered fr0, but the task > doesn't care (because fr0 is volatile across the syscall anyway). > > That path is something like: > > system_call_vectored_common() > system_call_exception() > sys_fork() > kernel_clone() > copy_process() > dup_task_struct() > arch_dup_task_struct() > flush_all_to_thread() > save_all() > if (tsk->thread.regs->msr & MSR_FP) > save_fpu() > # does not clear MSR_FP from regs->msr > syscall_exit_prepare() > interrupt_exit_user_prepare_main() > do_notify_resume() > get_signal() > handle_rt_signal64() > prepare_setup_sigcontext() > flush_fp_to_thread() > if (tsk->thread.regs->msr & MSR_FP) > giveup_fpu() > __giveup_fpu > save_fpu() > # clobbered fr0 is saved, but task considers it volatile > # across syscall anyway > > > But we now have a new path, because io-uring can call copy_process() via > create_io_thread() from the signal handling path. That's OK if the signal is > handled as we return from a syscall, but it's not OK if the signal is handled > due to some other interrupt. > > Which is: > > interrupt_return_srr_user() > interrupt_exit_user_prepare() > interrupt_exit_user_prepare_main() > do_notify_resume() > get_signal() > task_work_run() > create_worker_cb() > create_io_worker() > copy_process() > dup_task_struct() > arch_dup_task_struct() > flush_all_to_thread() > save_all() > if (tsk->thread.regs->msr & MSR_FP) > save_fpu() > # fr0 is clobbered and potentially live in > userspace > > > So tldr I think the corruption is only an issue since io-uring started doing > the clone via signal, which I think matches the observed timeline of this bug > appearing. > > Gotta run home, will have a closer look at the actual patch later on. > > cheers > > > Timothy Pearson writes: > > During floating point and vector save to thread data fr0/vs0 are clobbered > > by the FPSCR/VSCR store routine. This leads to userspace register > > corruption > > and application data corruption / crash under the following rare condition: > > > > * A userspace thread is executing with VSX/FP mode enabled > > * The userspace thread is making active use of fr0 and/or vs0 > > * An IPI is taken in kernel mode, forcing the userspace thread to > > reschedule > > * The userspace thread is interrupted by the IPI before accessing data it > >previously stored in fr0/vs0 > > * The thread being switched in by the IPI has a pending signa
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Hi Timothy, Great work debugging this. I think your fix is good, but I want to understand it a bit more to make sure I can explain why we haven't seen it outside of io-uring. If this can be triggered outside of io-uring then I have even more backporting in my future :} Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs from the thread struct before letting the task use FP again. So in that case save_fpu() is free to clobber fr0 because the FP regs no longer hold live values for the task. There is another case though, which is the path via: copy_process() dup_task_struct() arch_dup_task_struct() flush_all_to_thread() save_all() That path saves the FP regs but leaves them live. That's meant as an optimisation for a process that's using FP/VSX and then calls fork(), leaving the regs live means the parent process doesn't have to take a fault after the fork to get its FP regs back. That path does clobber fr0, but fr0 is volatile across a syscall, and the only way to reach copy_process() from userspace is via a syscall. So in normal usage fr0 being clobbered across a syscall shouldn't cause data corruption. Even if we handle a signal on the return from the fork() syscall, the worst that happens is that the task's thread struct holds the clobbered fr0, but the task doesn't care (because fr0 is volatile across the syscall anyway). That path is something like: system_call_vectored_common() system_call_exception() sys_fork() kernel_clone() copy_process() dup_task_struct() arch_dup_task_struct() flush_all_to_thread() save_all() if (tsk->thread.regs->msr & MSR_FP) save_fpu() # does not clear MSR_FP from regs->msr syscall_exit_prepare() interrupt_exit_user_prepare_main() do_notify_resume() get_signal() handle_rt_signal64() prepare_setup_sigcontext() flush_fp_to_thread() if (tsk->thread.regs->msr & MSR_FP) giveup_fpu() __giveup_fpu save_fpu() # clobbered fr0 is saved, but task considers it volatile # across syscall anyway But we now have a new path, because io-uring can call copy_process() via create_io_thread() from the signal handling path. That's OK if the signal is handled as we return from a syscall, but it's not OK if the signal is handled due to some other interrupt. Which is: interrupt_return_srr_user() interrupt_exit_user_prepare() interrupt_exit_user_prepare_main() do_notify_resume() get_signal() task_work_run() create_worker_cb() create_io_worker() copy_process() dup_task_struct() arch_dup_task_struct() flush_all_to_thread() save_all() if (tsk->thread.regs->msr & MSR_FP) save_fpu() # fr0 is clobbered and potentially live in userspace So tldr I think the corruption is only an issue since io-uring started doing the clone via signal, which I think matches the observed timeline of this bug appearing. Gotta run home, will have a closer look at the actual patch later on. cheers Timothy Pearson writes: > During floating point and vector save to thread data fr0/vs0 are clobbered > by the FPSCR/VSCR store routine. This leads to userspace register corruption > and application data corruption / crash under the following rare condition: > > * A userspace thread is executing with VSX/FP mode enabled > * The userspace thread is making active use of fr0 and/or vs0 > * An IPI is taken in kernel mode, forcing the userspace thread to reschedule > * The userspace thread is interrupted by the IPI before accessing data it >previously stored in fr0/vs0 > * The thread being switched in by the IPI has a pending signal > > If these exact criteria are met, then the following sequence happens: > > * The existing thread FP storage is still valid before the IPI, due to a >prior call to save_fpu() or store_fp_state(). Note that the current >fr0/vs0 registers have been clobbered, so the FP/VSX state in registers >is now invalid pending a call to restore_fp()/restore_altivec(). > * IPI -- FP/VSX register state remains invalid > * interrupt_exit_user_prepare_main() calls do_notify_resume(), >due to the pending signal > * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which >merrily reads and saves the invalid FP/VSX state to thread local storage. > * interrupt_exit_user_prepare_main() calls restore_math(), writing the > invalid >FP/VSX state back to registers. > * Execution is released to userspace, and the app
[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
During floating point and vector save to thread data fr0/vs0 are clobbered by the FPSCR/VSCR store routine. This leads to userspace register corruption and application data corruption / crash under the following rare condition: * A userspace thread is executing with VSX/FP mode enabled * The userspace thread is making active use of fr0 and/or vs0 * An IPI is taken in kernel mode, forcing the userspace thread to reschedule * The userspace thread is interrupted by the IPI before accessing data it previously stored in fr0/vs0 * The thread being switched in by the IPI has a pending signal If these exact criteria are met, then the following sequence happens: * The existing thread FP storage is still valid before the IPI, due to a prior call to save_fpu() or store_fp_state(). Note that the current fr0/vs0 registers have been clobbered, so the FP/VSX state in registers is now invalid pending a call to restore_fp()/restore_altivec(). * IPI -- FP/VSX register state remains invalid * interrupt_exit_user_prepare_main() calls do_notify_resume(), due to the pending signal * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which merrily reads and saves the invalid FP/VSX state to thread local storage. * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid FP/VSX state back to registers. * Execution is released to userspace, and the application crashes or corrupts data. Without the pending signal, do_notify_resume() is never called, therefore the invalid register state does't matter as it is overwritten nearly immediately by interrupt_exit_user_prepare_main() calling restore_math() before return to userspace. Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and altivec register save paths. Tested under QEMU in kvm mode, running on a Talos II workstation with dual POWER9 DD2.2 CPUs. Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.javamail.zim...@raptorengineeringinc.com/ Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.javamail.zim...@raptorengineeringinc.com/ Tested-by: Timothy Pearson Tested-by: Jens Axboe Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/fpu.S| 13 + arch/powerpc/kernel/vector.S | 2 ++ 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S index 6a9acfb690c9..2f8f3f93cbb6 100644 --- a/arch/powerpc/kernel/fpu.S +++ b/arch/powerpc/kernel/fpu.S @@ -23,6 +23,15 @@ #include #ifdef CONFIG_VSX +#define __REST_1FPVSR(n,c,base) \ +BEGIN_FTR_SECTION \ + b 2f; \ +END_FTR_SECTION_IFSET(CPU_FTR_VSX);\ + REST_FPR(n,base); \ + b 3f; \ +2: REST_VSR(n,c,base); \ +3: + #define __REST_32FPVSRS(n,c,base) \ BEGIN_FTR_SECTION \ b 2f; \ @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX); \ 2: SAVE_32VSRS(n,c,base); \ 3: #else +#define __REST_1FPVSR(n,b,base)REST_FPR(n, base) #define __REST_32FPVSRS(n,b,base) REST_32FPRS(n, base) #define __SAVE_32FPVSRS(n,b,base) SAVE_32FPRS(n, base) #endif +#define REST_1FPVSR(n,c,base) __REST_1FPVSR(n,__REG_##c,__REG_##base) #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base) #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base) @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state) SAVE_32FPVSRS(0, R4, R3) mffsfr0 stfdfr0,FPSTATE_FPSCR(r3) + REST_1FPVSR(0, R4, R3) blr EXPORT_SYMBOL(store_fp_state) @@ -138,4 +150,5 @@ _GLOBAL(save_fpu) 2: SAVE_32FPVSRS(0, R4, R6) mffsfr0 stfdfr0,FPSTATE_FPSCR(r6) + REST_1FPVSR(0, R4, R6) blr diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S index 4094e4c4c77a..80b3f6e476b6 100644 --- a/arch/powerpc/kernel/vector.S +++ b/arch/powerpc/kernel/vector.S @@ -33,6 +33,7 @@ _GLOBAL(store_vr_state) mfvscr v0 li r4, VRSTATE_VSCR stvxv0, r4, r3 + lvx v0, 0, r3 blr EXPORT_SYMBOL(store_vr_state) @@ -109,6 +110,7 @@ _GLOBAL(save_altivec) mfvscr v0 li r4,VRSTATE_VSCR stvxv0,r4,r7 + lvx v0,0,r7 blr #ifdef CONFIG_VSX -- 2.39.2