As the issue #1124 and the thread https://groups.google.com/g/osv-dev/c/ystZElFNdBI on the mailing list explains, some tests in the aarch64 debug build fail in a consistent manner with page faults suggesting that possibly the values of parameters passed from function to function get somehow currupt. After painful research, I pin-pointed the crashes to the reschedule_from_interrupt() method that does not work correctly in voluntary (non-preemptive) context switch scenarios where thread::yield() method is called by an app. It turns out that in those situations so called callee-save registers used by an app function calling thread::yield() are not saved during context switch and then later not restored when such thread gets resumed later. Following steps illustrate such scenario:
1. Function F pushes one of the callee saved registers - x23 (just an example) - on the T1 stack becuase it uses it for something and it must do it per the calling convention. 2. Function F stores some value in x23. 3. Function F calls thread::yield() directly or indirectly. 4. Eventually, reschedule_from_interrupt() is called and it calls switch_to() to switch stack pointer to the new thread T2 stack. The debug version of neither reschedule_from_interrupt() nor switch_to() stores x23 as they do not use this register (unlike the release version). 5. At some point, later reschedule_from_interrupt() is called again (not necessarily the very next time) and calls switch_to() that switches back to T1. 6. T1 resumes and eventually returns the control to the function F1 right after it called yield(). 7. The code in F1 after calling yield() reads the value of x23 ... and boom. The x23 quite likely contains garbage because it was never restored by F1 after calling yield() because per calling convention yield() or other callees should have saved and restored it. But they did not, did they? Or rather different routines on different threads running on the same cpu in between ruined it. One way to solve this problem would be to add all callee-saved registers (x19-x28 and d8-d15 per https://github.com/ARM-software/abi-aa/blob/f52e1ad3f81254497a83578dc102f6aac89e52d0/aapcs64/aapcs64.rst - see "6.1.1 General-purpose Registers" and "6.1.2 SIMD and Floating-Point Registers" about v8-v15 registers) to the list of clobberred registers of the inline assembly in thread::switch_to() in arch/aarch/arch-switch.hh. But this does not always work as some versions of gcc (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342) depending on the optimization level (-O0, -O2, etc) inline switch_to() into reschedule_from_interrupt or not and generate machine code that does not do what our intention is. On top of that, the code in reschedule_from_interrupt() that destroys any terminated thread on the current cpu uses the thread local memory (TLS) pointer register - tpidr_el0 - which is not re-read after the switch_to() and causes interesting crashes as the issue #1121 describes. Given all above and the fact that reschedule_from_interrupt() calls switch_to() to switch threads which is very tricky to do right using pure C++ and some inline assembly, this patch implements high-level version of reschedule_from_interrupt() in assembly for aarch64 (the x64 stays as is). This patch modifies include/osv/sched.hh and core/sched.cc with number of `#ifdef AARCH64_PORT_STUB` to keep reschedule_from_interrupt() as is for x64 and split it into 2 new methods - schedule_next_thread() and destroy_current_cpu_terminating_thread() for aarch64. The two new functions correspond to the parts of the reschedule_from_interrupt() up to the switch_to() call and after. Please note that schedule_next_thread() returns simple 2-field struct with pointers to old and new thread state (pc,sp,tcb,etc) that gets used by reschedule_from_interrupt. On top of this, this patch adds arch/aarch64/sched.S which implements reschedule_from_interrupt() as a C function in assembly. The reschedule_from_interrupt in essence calls new schedule_next_thread(), saves all callee-saves registers on old thread stack, executes a context switch as the old switch_to() did, restores the callee-saves registers from new thread stack and finally calls destroy_current_cpu_terminating_thread(). This patch has been tested on with the -O0, -O1, -O2 optimization levels (release, debug 1 and 2) on real ARM hardware and emulated TGI mode. Please note that the new code makes the context switch a little bit more expensive (~3%) per this numbers: BEFORE: taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so' OSv v0.55.0-204-g837a9b5d getauxval() stubbed eth0: 192.168.122.15 Booted up in 58.19 ms Cmdline: /tests/misc-ctxsw.so 664 colocated 665 apart 665 nopin AFTER: taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so' OSv v0.55.0-205-g7e7c49f7 getauxval() stubbed eth0: 192.168.122.15 Booted up in 58.20 ms Cmdline: /tests/misc-ctxsw.so 683 colocated 690 apart 690 nopin Fixes #1121 Fixes #1124 Signed-off-by: Waldemar Kozaczuk <jwkozac...@gmail.com> --- Makefile | 1 + arch/aarch64/arch-switch.hh | 38 +++-------------- arch/aarch64/arch-thread-state.hh | 1 + arch/aarch64/sched.S | 70 +++++++++++++++++++++++++++++++ core/sched.cc | 63 ++++++++++++++++++++++++++-- include/osv/sched.hh | 34 ++++++++++++++- 6 files changed, 170 insertions(+), 37 deletions(-) create mode 100644 arch/aarch64/sched.S diff --git a/Makefile b/Makefile index bbc57252..e1866c3b 100644 --- a/Makefile +++ b/Makefile @@ -878,6 +878,7 @@ objects += arch/$(arch)/memset.o objects += arch/$(arch)/memcpy.o objects += arch/$(arch)/memmove.o objects += arch/$(arch)/tlsdesc.o +objects += arch/$(arch)/sched.o objects += $(libfdt) endif diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh index dff7467c..1a78d132 100644 --- a/arch/aarch64/arch-switch.hh +++ b/arch/aarch64/arch-switch.hh @@ -20,32 +20,6 @@ void thread_main_c(sched::thread* t); namespace sched { -void thread::switch_to() -{ - thread* old = current(); - asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory"); - - asm volatile("\n" - "str x29, %0 \n" - "mov x2, sp \n" - "adr x1, 1f \n" /* address of label */ - "stp x2, x1, %1 \n" - - "ldp x29, x0, %2 \n" - "ldp x2, x1, %3 \n" - - "mov sp, x2 \n" - "blr x1 \n" - - "1: \n" /* label */ - : - : "Q"(old->_state.fp), "Ump"(old->_state.sp), - "Ump"(this->_state.fp), "Ump"(this->_state.sp) - : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", - "x9", "x10", "x11", "x12", "x13", "x14", "x15", - "x16", "x17", "x18", "x30", "memory"); -} - void thread::switch_to_first() { asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory"); @@ -59,14 +33,13 @@ void thread::switch_to_first() asm volatile("\n" "ldp x29, x0, %0 \n" - "ldp x2, x1, %1 \n" - "mov sp, x2 \n" - "blr x1 \n" + "ldp x22, x21, %1 \n" + "mov sp, x22 \n" + "blr x21 \n" : : "Ump"(this->_state.fp), "Ump"(this->_state.sp) - : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", - "x9", "x10", "x11", "x12", "x13", "x14", "x15", - "x16", "x17", "x18", "x30", "memory"); + : "x0", "x19", "x20", "x21", "x22", "x23", "x24", + "x25", "x26", "x27", "x28", "x29", "x30", "memory"); } void thread::init_stack() @@ -150,6 +123,7 @@ void thread::setup_tcb() void* p = aligned_alloc(64, total_tls_size + sizeof(*_tcb)); _tcb = (thread_control_block *)p; _tcb[0].tls_base = &_tcb[1]; + _state.tcb = p; // // First goes kernel TLS data auto kernel_tls = _tcb[0].tls_base; diff --git a/arch/aarch64/arch-thread-state.hh b/arch/aarch64/arch-thread-state.hh index af424472..6f1b680d 100644 --- a/arch/aarch64/arch-thread-state.hh +++ b/arch/aarch64/arch-thread-state.hh @@ -14,6 +14,7 @@ struct thread_state { void* sp; void* pc; + void* tcb; }; #endif /* ARCH_THREAD_STATE_HH_ */ diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S new file mode 100644 index 00000000..1e69f8fd --- /dev/null +++ b/arch/aarch64/sched.S @@ -0,0 +1,70 @@ +#void cpu::reschedule_from_interrupt(bool called_from_yield, +# thread_runtime::duration preempt_after) +.global reschedule_from_interrupt +.type reschedule_from_interrupt, @function +reschedule_from_interrupt: + #.cfi_startproc + stp x29, x30, [sp, #-176]! + #.cfi_adjust_cfa_offset 176 + mov x29, sp + #.cfi_def_cfa_register x29 + + #Call cpu_schedule_next_thread() to determine next thread to switch to if any + bl cpu_schedule_next_thread + + #The cpu_schedule_next_thread returns thread_switch_data in x0 and x1, + #where x0 holds old_thread_state and x1 holds new_thread_state + #If cpu_schedule_next_thread() returns thread_switch_data with null pointers, exit + cmp x0, #0 + b.eq 2f + + #Store all callee-save registers on the old thread stack + stp x19, x20, [sp, #16] + stp x21, x22, [sp, #32] + stp x23, x24, [sp, #48] + stp x25, x26, [sp, #64] + stp x27, x28, [sp, #80] + + stp d8, d9, [sp, #96] + stp d10, d11, [sp, #112] + stp d12, d13, [sp, #128] + stp d14, d15, [sp, #144] + + #Switch to new thread + ldr x2, [x1, #32] //Fetch _tcb of the new thread + msr tpidr_el0, x2 //Set thread pointer + isb + + str x29, [x0, #0] //Save frame pointer of the old thread + mov x3, sp //Fetch old thread stack pointer + adr x4, 1f //Fetch old thread instruction point + stp x3, x4, [x0, #16] //Save old thread sp and pc + + ldp x29, x0, [x1, #0] //Set frame pointer of the new thread and this (x0) of the new thread + //Please note that the pc may point to thread_main_c(thread*) which is + //why we have to set x0 (1st argument) to new thread object + ldp x3, x4, [x1, #16] //Fetch new thread sp and pc + mov sp, x3 //Set new thread stack pointer + blr x4 //Jump to the new thread pc + +1: + #Restore all callee-save registers from the new thread stack + ldp x19, x20, [sp, #16] + ldp x21, x22, [sp, #32] + ldp x23, x24, [sp, #48] + ldp x25, x26, [sp, #64] + ldp x27, x28, [sp, #80] + + ldp d8, d9, [sp, #96] + ldp d10, d11, [sp, #112] + ldp d12, d13, [sp, #128] + ldp d14, d15, [sp, #144] + + #Call destroy_current_cpu_terminating_thread() + bl destroy_current_cpu_terminating_thread + +2: + ldp x29, x30, [sp], #176 + #.cfi_def_cfa sp, 0 + ret + #.cfi_endproc diff --git a/core/sched.cc b/core/sched.cc index 06f849d1..74b4ab95 100644 --- a/core/sched.cc +++ b/core/sched.cc @@ -225,13 +225,34 @@ void thread::cputime_estimator_get( void cpu::schedule() { WITH_LOCK(irq_lock) { +#ifdef AARCH64_PORT_STUB + reschedule_from_interrupt(sched::cpu::current(), false, thyst); +#else current()->reschedule_from_interrupt(); - } -} - +#endif + } +} + +// In aarch64 port, the reschedule_from_interrupt() needs to be implemented +// in assembly (please see arch/aarch64/sched.S) to give us better control +// which registers are used and which ones are saved and restored during +// the context switch. In essence, most of the original reschedule_from_interrupt() +// code up to switch_to() is shared with aarch64-specific cpu::schedule_next_thread() +// that is called from reschedule_from_interrupt() in arch/arch64/sched.S assembly. +// The logic executed after switch_to() that makes up destroy_current_cpu_terminating_thread() +// is called from arch/arch64/sched.S as well. +// At the end, we define reschedule_from_interrupt() in C++ for x86_64 and schedule_next_thread() +// and destroy_current_cpu_terminating_thread() in C++ for aarch64. +#ifdef AARCH64_PORT_STUB +inline thread_switch_data cpu::schedule_next_thread(bool called_from_yield, + thread_runtime::duration preempt_after) +{ + thread_switch_data switch_data; +#else void cpu::reschedule_from_interrupt(bool called_from_yield, thread_runtime::duration preempt_after) { +#endif trace_sched_sched(); assert(sched::exception_depth <= 1); need_reschedule = false; @@ -259,7 +280,11 @@ void cpu::reschedule_from_interrupt(bool called_from_yield, // lowest runtime, and update the timer until the next thread's turn. if (runqueue.empty()) { preemption_timer.cancel(); +#ifdef AARCH64_PORT_STUB + return switch_data; +#else return; +#endif } else if (!called_from_yield) { auto &t = *runqueue.begin(); if (p->_runtime.get_local() < t._runtime.get_local()) { @@ -268,7 +293,11 @@ void cpu::reschedule_from_interrupt(bool called_from_yield, if (delta > 0) { preemption_timer.set(now + delta); } +#ifdef AARCH64_PORT_STUB + return switch_data; +#else return; +#endif } } // If we're here, p no longer has the lowest runtime. Before queuing @@ -336,19 +365,39 @@ void cpu::reschedule_from_interrupt(bool called_from_yield, if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) { mmu::flush_tlb_local(); } +#ifdef AARCH64_PORT_STUB + switch_data.old_thread_state = &(p->_state); + switch_data.new_thread_state = &(n->_state); + return switch_data; +} +#else n->switch_to(); +#endif // Note: after the call to n->switch_to(), we should no longer use any of // the local variables, nor "this" object, because we just switched to n's // stack and the values we can access now are those that existed in the // reschedule call which scheduled n out, and will now be returning. // So to get the current cpu, we must use cpu::current(), not "this". +#ifdef AARCH64_PORT_STUB +extern "C" void destroy_current_cpu_terminating_thread() +{ +#endif if (cpu::current()->terminating_thread) { cpu::current()->terminating_thread->destroy(); cpu::current()->terminating_thread = nullptr; } } +#ifdef AARCH64_PORT_STUB +extern "C" thread_switch_data cpu_schedule_next_thread(cpu* cpu, + bool called_from_yield, + thread_runtime::duration preempt_after) +{ + return cpu->schedule_next_thread(called_from_yield, preempt_after); +} +#endif + void cpu::timer_fired() { // nothing to do, preemption will happen if needed @@ -521,7 +570,11 @@ void thread::pin(cpu *target_cpu) // Note that wakeme is on the same CPU, and irq is disabled, // so it will not actually run until we stop running. wakeme->wake_with([&] { do_wakeme = true; }); +#ifdef AARCH64_PORT_STUB + reschedule_from_interrupt(source_cpu, false, thyst); +#else source_cpu->reschedule_from_interrupt(); +#endif } // wakeme will be implicitly join()ed here. } @@ -760,7 +813,11 @@ void thread::yield(thread_runtime::duration preempt_after) } trace_sched_yield_switch(); +#ifdef AARCH64_PORT_STUB + reschedule_from_interrupt(cpu::current(), true, preempt_after); +#else cpu::current()->reschedule_from_interrupt(true, preempt_after); +#endif } void thread::set_priority(float priority) diff --git a/include/osv/sched.hh b/include/osv/sched.hh index 0fb44f77..1dacd623 100644 --- a/include/osv/sched.hh +++ b/include/osv/sched.hh @@ -31,6 +31,9 @@ typedef float runtime_t; extern "C" { void smp_main(); +#ifdef AARCH64_PORT_STUB +void destroy_current_cpu_terminating_thread(); +#endif }; void smp_launch(); @@ -321,7 +324,12 @@ constexpr thread_runtime::duration thyst = std::chrono::milliseconds(5); constexpr thread_runtime::duration context_switch_penalty = std::chrono::microseconds(10); - +#ifdef AARCH64_PORT_STUB +struct thread_switch_data { + thread_state* old_thread_state = nullptr; + thread_state* new_thread_state = nullptr; +}; +#endif /** * OSv thread */ @@ -600,7 +608,9 @@ private: unsigned allowed_initial_states_mask = 1 << unsigned(status::waiting)); static void sleep_impl(timer &tmr); void main(); +#ifndef AARCH64_PORT_STUB void switch_to(); +#endif void switch_to_first(); void prepare_wait(); void wait(); @@ -702,7 +712,12 @@ private: std::vector<char*> _tls; bool _app; std::shared_ptr<osv::application_runtime> _app_runtime; +public: //TODO: For whatever reason we need to make this public even though destroy_current_cpu_terminating_thread is defined as friend below, so compiler still complains. void destroy(); +private: +#ifdef AARCH64_PORT_STUB + friend void ::destroy_current_cpu_terminating_thread(); +#endif friend class thread_ref_guard; friend void thread_main_c(thread* t); friend class wait_guard; @@ -884,8 +899,13 @@ struct cpu : private timer_base::client { * @param preempt_after fire the preemption timer after this time period * (relevant only when "called_from_yield" is TRUE) */ +#ifdef AARCH64_PORT_STUB + thread_switch_data schedule_next_thread(bool called_from_yield = false, + thread_runtime::duration preempt_after = thyst); +#else void reschedule_from_interrupt(bool called_from_yield = false, - thread_runtime::duration preempt_after = thyst); + thread_runtime::duration preempt_after = thyst); +#endif void enqueue(thread& t); void init_idle_thread(); virtual void timer_fired() override; @@ -895,6 +915,12 @@ struct cpu : private timer_base::client { int renormalize_count; }; +#ifdef AARCH64_PORT_STUB +extern "C" void reschedule_from_interrupt(cpu* cpu, + bool called_from_yield, + thread_runtime::duration preempt_after); +#endif + class cpu::notifier { public: explicit notifier(std::function<void ()> cpu_up); @@ -996,7 +1022,11 @@ inline bool preemptable() inline void preempt() { if (preemptable()) { +#ifdef AARCH64_PORT_STUB + reschedule_from_interrupt(sched::cpu::current(), false, thyst); +#else sched::cpu::current()->reschedule_from_interrupt(); +#endif } else { // preempt_enable() will pick this up eventually need_reschedule = true; -- 2.27.0 -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20210303215711.335280-1-jwkozaczuk%40gmail.com.