On Monday, February 22, 2021 at 7:30:31 AM UTC+2 jwkoz...@gmail.com wrote:
> I think I have an explanation of what is going on. Before I present it let > me recap the calling convention for aarch64: > > Caller: > > 1. If we need any of x0-x18 registers, save them. They are corruptible. > 2. Move the first 8 parameters into registers x0-x7. > 3. Push any additional parameters on the stack. > 4. Use BL to call the function. > 5. Evaluate the return code in x0. > 6. Restore any of x0-x18 that we saved in step 1. > > Callee: > > 1. Push LR (x30) and x19-x29 onto the stack if used by this routine. > 2. Do the work > 3. Put return code in x0. > 4. Pop LR and x19-x29 if pushed in step 1. > 5. Use RET instruction to return execution to the caller (this will > implicitly use LR (x30) as an address to return to). > > Now imagine the following scenario involving function F executing on > thread T1 that calls thread::yield() or another function calling yield(): > > 1. Function F pushes one of the callee saved registers - x23 (just an > example) - on the T1 stack becuase it uses it for something and it must do > it per the calling convention. > 2. Function F stores some value in x23. > 3. Function F calls thread::yield() directly or indirectly. > 4. Eventually, reschedule_from_interrupt() is called and it calls > switch_to() to switch stack pointer to the new thread T2 stack. The debug > version of reschedule_from_interrupt() nor switch_to() stores x23 as they > do not use this register (unlike the release version). > 5. At some point, later reschedule_from_interrupt() is called again > (not necessarily the very next time) and calls switch_to() that switches > back to T1. > 6. T1 resumes and eventually returns the control to the function F1 > right after it called yield(). > 7. The code in F1 after calling yield() reads the value of x23 ... and > boom. The x23 quite likely contains garbage because it was never restored > by F1 after calling yield() because per calling convention yield() or > other > callees should have saved and restored. But it did not, did it? Or rather > different routines on different threads running on the same cpu in between > ruined it. > > Why does it all work with the release version? It does because the > reschedule_from_interrupt() compiled with -02 happens to use and save all > callee-saved registers x19-x28. So they happen to be restored to correct > values after the switch. > > So it seems that the right solution is to save and restore x19-x28 (callee > saved registers) in switch_to() like so: > > diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh > index dff7467c..45aff4a7 100644 > --- a/arch/aarch64/arch-switch.hh > +++ b/arch/aarch64/arch-switch.hh > @@ -27,6 +27,7 @@ void thread::switch_to() > > asm volatile("\n" > "str x29, %0 \n" > + "sub sp, sp, #0x50\n" > "mov x2, sp \n" > "adr x1, 1f \n" /* address of label */ > "stp x2, x1, %1 \n" > @@ -34,10 +35,23 @@ void thread::switch_to() > "ldp x29, x0, %2 \n" > "ldp x2, x1, %3 \n" > > + "stp x19, x20, [sp, #0]\n" > + "stp x21, x22, [sp, #16]\n" > + "stp x23, x24, [sp, #32]\n" > + "stp x25, x26, [sp, #48]\n" > + "stp x27, x28, [sp, #64]\n" > + > "mov sp, x2 \n" > "blr x1 \n" > > "1: \n" /* label */ > + > + "ldp x19, x20, [sp, #0]\n" > + "ldp x21, x22, [sp, #16]\n" > + "ldp x23, x24, [sp, #32]\n" > + "ldp x25, x26, [sp, #48]\n" > + "ldp x27, x28, [sp, #64]\n" > + "add sp, sp, #0x50\n" > : > : "Q"(old->_state.fp), "Ump"(old->_state.sp), > "Ump"(this->_state.fp), "Ump"(this->_state.sp) > > And indeed the crashes in both -00 and -O1 go away. > > Does my explanation have holes? Or am I completely wrong? > I think you're completely right, well spotted. I think you're completely right. Here's the equivalent from Linux. Here's the Linux equivalent: SYM_FUNC_START(cpu_switch_to) mov x10, #THREAD_CPU_CONTEXT add x8, x0, x10 mov x9, sp stp x19, x20, [x8], #16 // store callee-saved registers stp x21, x22, [x8], #16 stp x23, x24, [x8], #16 stp x25, x26, [x8], #16 stp x27, x28, [x8], #16 stp x29, x9, [x8], #16 str lr, [x8] add x8, x1, x10 ldp x19, x20, [x8], #16 // restore callee-saved registers ldp x21, x22, [x8], #16 ldp x23, x24, [x8], #16 ldp x25, x26, [x8], #16 ldp x27, x28, [x8], #16 ldp x29, x9, [x8], #16 ldr lr, [x8] mov sp, x9 msr sp_el0, x1 ptrauth_keys_install_kernel x1, x8, x9, x10 scs_save x0, x8 scs_load x1, x8 ret SYM_FUNC_END(cpu_switch_to) NOKPROBE(cpu_switch_to) -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/a6d7b7a4-4757-4959-9c6b-1e63c24997den%40googlegroups.com.