As the issue #1124 and the thread 
https://groups.google.com/g/osv-dev/c/ystZElFNdBI
on the mailing list explains, some tests in the aarch64 debug build
fail in a consistent manner with page faults suggesting that possibly
the values of parameters passed from function to function get somehow
currupt. After painful research, I pin-pointed the crashes to the
reschedule_from_interrupt() method that does not work correctly in
voluntary (non-preemptive) context switch scenarios where
thread::yield() method is called by an app. It turns out that in those
situations so called callee-save registers used by an app function
calling thread::yield() are not saved during context switch and then
later not restored when such thread gets resumed later. Following steps
illustrate such scenario:

1. Function F pushes one of the callee saved registers - x23 (just an example) 
- on the T1
   stack becuase it uses it for something and it must do it per the calling 
convention.
2. Function F stores some value in x23.
3. Function F calls thread::yield() directly or indirectly.
4. Eventually, reschedule_from_interrupt() is called and it calls switch_to() 
to switch
   stack pointer to the new thread T2 stack. The debug version of neither 
reschedule_from_interrupt()
   nor switch_to() stores x23 as they do not use this register (unlike the 
release version).
5. At some point, later reschedule_from_interrupt() is called again (not 
necessarily the very next time)
   and calls switch_to() that switches back to T1.
6. T1 resumes and eventually returns the control to the function F1 right after 
it called yield().
7. The code in F1 after calling yield() reads the value of x23 ... and boom. 
The x23 quite likely
   contains garbage because it was never restored by F1 after calling yield() 
because per calling
   convention yield() or other callees should have saved and restored it. But 
they did not, did they?
   Or rather different routines on different threads running on the same cpu in 
between ruined it.

One way to solve this problem would be to add all callee-saved registers
(x19-x28 and d8-d15 per 
https://github.com/ARM-software/abi-aa/blob/f52e1ad3f81254497a83578dc102f6aac89e52d0/aapcs64/aapcs64.rst
 - 
see "6.1.1 General-purpose Registers" and "6.1.2 SIMD and Floating-Point 
Registers" about v8-v15 registers)
to the list of clobberred registers of the inline assembly in
thread::switch_to() in arch/aarch/arch-switch.hh. But this does not
always work as some versions of gcc (see 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342) depending on
the optimization level (-O0, -O2, etc) inline switch_to() into
reschedule_from_interrupt or not and generate machine code that does not do
what our intention is. On top of that, the code in reschedule_from_interrupt()
that destroys any terminated thread on the current cpu uses the thread
local memory (TLS) pointer register - tpidr_el0 - which is not re-read
after the switch_to() and causes interesting crashes as the issue #1121
describes.

Given all above and the fact that reschedule_from_interrupt() calls
switch_to() to switch threads which is very tricky to do right using
pure C++ and some inline assembly, this patch implements high-level
version of reschedule_from_interrupt() in assembly for aarch64 (the x64
stays as is).

This patch modifies include/osv/sched.hh and core/sched.cc with number
of `#ifdef AARCH64_PORT_STUB` to keep reschedule_from_interrupt() as is
for x64 and split it into 2 new methods - schedule_next_thread() and
destroy_current_cpu_terminating_thread() for aarch64. The two new
functions correspond to the parts of the reschedule_from_interrupt() up
to the switch_to() call and after. Please note that
schedule_next_thread() returns simple 2-field struct with pointers to
old and new thread state (pc,sp,tcb,etc) that gets used by
reschedule_from_interrupt.

On top of this, this patch adds arch/aarch64/sched.S which implements
reschedule_from_interrupt() as a C function in assembly. The 
reschedule_from_interrupt
in essence calls new schedule_next_thread(), saves all callee-saves
registers on old thread stack, executes a context switch as the old switch_to() 
did, restores the callee-saves registers from new thread stack and
finally calls destroy_current_cpu_terminating_thread().

This patch has been tested on with the -O0, -O1, -O2 optimization levels
(release, debug 1 and 2) on real ARM hardware and emulated TGI mode.

Please note that the new code makes the context switch a little bit more
expensive (~3%) per this numbers:

BEFORE:
taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so'
OSv v0.55.0-204-g837a9b5d
getauxval() stubbed
eth0: 192.168.122.15
Booted up in 58.19 ms
Cmdline: /tests/misc-ctxsw.so
       664 colocated
       665 apart
       665 nopin

AFTER:
taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so'
OSv v0.55.0-205-g7e7c49f7
getauxval() stubbed
eth0: 192.168.122.15
Booted up in 58.20 ms
Cmdline: /tests/misc-ctxsw.so
       683 colocated
       690 apart
       690 nopin

Fixes #1121
Fixes #1124

Signed-off-by: Waldemar Kozaczuk <jwkozac...@gmail.com>
---
 Makefile                          |  1 +
 arch/aarch64/arch-switch.hh       | 38 +++--------------
 arch/aarch64/arch-thread-state.hh |  1 +
 arch/aarch64/sched.S              | 70 +++++++++++++++++++++++++++++++
 core/sched.cc                     | 63 ++++++++++++++++++++++++++--
 include/osv/sched.hh              | 34 ++++++++++++++-
 6 files changed, 170 insertions(+), 37 deletions(-)
 create mode 100644 arch/aarch64/sched.S

diff --git a/Makefile b/Makefile
index bbc57252..e1866c3b 100644
--- a/Makefile
+++ b/Makefile
@@ -878,6 +878,7 @@ objects += arch/$(arch)/memset.o
 objects += arch/$(arch)/memcpy.o
 objects += arch/$(arch)/memmove.o
 objects += arch/$(arch)/tlsdesc.o
+objects += arch/$(arch)/sched.o
 objects += $(libfdt)
 endif
 
diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh
index dff7467c..1a78d132 100644
--- a/arch/aarch64/arch-switch.hh
+++ b/arch/aarch64/arch-switch.hh
@@ -20,32 +20,6 @@ void thread_main_c(sched::thread* t);
 
 namespace sched {
 
-void thread::switch_to()
-{
-    thread* old = current();
-    asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");
-
-    asm volatile("\n"
-                 "str x29,     %0  \n"
-                 "mov x2, sp       \n"
-                 "adr x1, 1f       \n" /* address of label */
-                 "stp x2, x1,  %1  \n"
-
-                 "ldp x29, x0, %2  \n"
-                 "ldp x2, x1,  %3  \n"
-
-                 "mov sp, x2       \n"
-                 "blr x1           \n"
-
-                 "1:               \n" /* label */
-                 :
-                 : "Q"(old->_state.fp), "Ump"(old->_state.sp),
-                   "Ump"(this->_state.fp), "Ump"(this->_state.sp)
-                 : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
-                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
-                   "x16", "x17", "x18", "x30", "memory");
-}
-
 void thread::switch_to_first()
 {
     asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");
@@ -59,14 +33,13 @@ void thread::switch_to_first()
 
     asm volatile("\n"
                  "ldp x29, x0, %0  \n"
-                 "ldp x2, x1, %1   \n"
-                 "mov sp, x2       \n"
-                 "blr x1           \n"
+                 "ldp x22, x21, %1 \n"
+                 "mov sp, x22      \n"
+                 "blr x21          \n"
                  :
                  : "Ump"(this->_state.fp), "Ump"(this->_state.sp)
-                 : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
-                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
-                   "x16", "x17", "x18", "x30", "memory");
+                 : "x0", "x19", "x20", "x21", "x22", "x23", "x24",
+                  "x25", "x26", "x27", "x28", "x29", "x30", "memory");
 }
 
 void thread::init_stack()
@@ -150,6 +123,7 @@ void thread::setup_tcb()
     void* p = aligned_alloc(64, total_tls_size + sizeof(*_tcb));
     _tcb = (thread_control_block *)p;
     _tcb[0].tls_base = &_tcb[1];
+    _state.tcb = p;
     //
     // First goes kernel TLS data
     auto kernel_tls = _tcb[0].tls_base;
diff --git a/arch/aarch64/arch-thread-state.hh 
b/arch/aarch64/arch-thread-state.hh
index af424472..6f1b680d 100644
--- a/arch/aarch64/arch-thread-state.hh
+++ b/arch/aarch64/arch-thread-state.hh
@@ -14,6 +14,7 @@ struct thread_state {
 
     void* sp;
     void* pc;
+    void* tcb;
 };
 
 #endif /* ARCH_THREAD_STATE_HH_ */
diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S
new file mode 100644
index 00000000..1e69f8fd
--- /dev/null
+++ b/arch/aarch64/sched.S
@@ -0,0 +1,70 @@
+#void cpu::reschedule_from_interrupt(bool called_from_yield,
+#                                    thread_runtime::duration preempt_after)
+.global reschedule_from_interrupt
+.type reschedule_from_interrupt, @function
+reschedule_from_interrupt:
+        #.cfi_startproc
+        stp     x29, x30, [sp, #-176]!
+        #.cfi_adjust_cfa_offset 176
+        mov    x29, sp
+        #.cfi_def_cfa_register x29
+
+        #Call cpu_schedule_next_thread() to determine next thread to switch to 
if any
+        bl cpu_schedule_next_thread
+
+       #The cpu_schedule_next_thread returns thread_switch_data in x0 and x1,
+        #where x0 holds old_thread_state and x1 holds new_thread_state
+        #If cpu_schedule_next_thread() returns thread_switch_data with null 
pointers, exit 
+        cmp x0, #0
+        b.eq 2f
+
+        #Store all callee-save registers on the old thread stack
+        stp    x19, x20, [sp, #16]
+        stp    x21, x22, [sp, #32]
+        stp    x23, x24, [sp, #48]
+        stp    x25, x26, [sp, #64]
+        stp    x27, x28, [sp, #80]
+
+        stp    d8, d9, [sp, #96]
+        stp    d10, d11, [sp, #112]
+        stp    d12, d13, [sp, #128]
+        stp    d14, d15, [sp, #144]
+
+        #Switch to new thread
+        ldr    x2, [x1, #32]     //Fetch _tcb of the new thread
+        msr    tpidr_el0, x2     //Set thread pointer 
+        isb
+
+        str     x29, [x0, #0]     //Save frame pointer of the old thread
+        mov     x3, sp            //Fetch old thread stack pointer
+        adr     x4, 1f            //Fetch old thread instruction point
+        stp     x3, x4, [x0, #16] //Save old thread sp and pc
+
+        ldp     x29, x0, [x1, #0] //Set frame pointer of the new thread and 
this (x0) of the new thread
+                                  //Please note that the pc may point to 
thread_main_c(thread*) which is
+                                  //why we have to set x0 (1st argument) to 
new thread object
+        ldp     x3, x4, [x1, #16] //Fetch new thread sp and pc
+        mov     sp, x3            //Set new thread stack pointer
+        blr     x4                //Jump to the new thread pc
+
+1:
+        #Restore all callee-save registers from the new thread stack
+        ldp     x19, x20, [sp, #16]
+        ldp     x21, x22, [sp, #32]
+        ldp     x23, x24, [sp, #48]
+        ldp     x25, x26, [sp, #64]
+        ldp     x27, x28, [sp, #80]
+
+        ldp    d8, d9, [sp, #96]
+        ldp    d10, d11, [sp, #112]
+        ldp    d12, d13, [sp, #128]
+        ldp    d14, d15, [sp, #144]
+
+        #Call destroy_current_cpu_terminating_thread()
+        bl destroy_current_cpu_terminating_thread
+
+2:
+        ldp     x29, x30, [sp], #176
+        #.cfi_def_cfa sp, 0
+        ret
+        #.cfi_endproc
diff --git a/core/sched.cc b/core/sched.cc
index 06f849d1..74b4ab95 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -225,13 +225,34 @@ void thread::cputime_estimator_get(
 void cpu::schedule()
 {
     WITH_LOCK(irq_lock) {
+#ifdef AARCH64_PORT_STUB
+        reschedule_from_interrupt(sched::cpu::current(), false, thyst);
+#else
         current()->reschedule_from_interrupt();
-    }
-}
-
+#endif
+    }
+}
+
+// In aarch64 port, the reschedule_from_interrupt() needs to be implemented
+// in assembly (please see arch/aarch64/sched.S) to give us better control
+// which registers are used and which ones are saved and restored during
+// the context switch. In essence, most of the original 
reschedule_from_interrupt()
+// code up to switch_to() is shared with aarch64-specific 
cpu::schedule_next_thread()
+// that is called from reschedule_from_interrupt() in arch/arch64/sched.S 
assembly.
+// The logic executed after switch_to() that makes up 
destroy_current_cpu_terminating_thread()
+// is called from arch/arch64/sched.S as well.
+// At the end, we define reschedule_from_interrupt() in C++ for x86_64 and 
schedule_next_thread()
+// and destroy_current_cpu_terminating_thread() in C++ for aarch64.
+#ifdef AARCH64_PORT_STUB
+inline thread_switch_data cpu::schedule_next_thread(bool called_from_yield,
+                                                    thread_runtime::duration 
preempt_after)
+{
+    thread_switch_data switch_data;
+#else
 void cpu::reschedule_from_interrupt(bool called_from_yield,
                                     thread_runtime::duration preempt_after)
 {
+#endif
     trace_sched_sched();
     assert(sched::exception_depth <= 1);
     need_reschedule = false;
@@ -259,7 +280,11 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
         // lowest runtime, and update the timer until the next thread's turn.
         if (runqueue.empty()) {
             preemption_timer.cancel();
+#ifdef AARCH64_PORT_STUB
+            return switch_data;
+#else
             return;
+#endif
         } else if (!called_from_yield) {
             auto &t = *runqueue.begin();
             if (p->_runtime.get_local() < t._runtime.get_local()) {
@@ -268,7 +293,11 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
                 if (delta > 0) {
                     preemption_timer.set(now + delta);
                 }
+#ifdef AARCH64_PORT_STUB
+                return switch_data;
+#else
                 return;
+#endif
             }
         }
         // If we're here, p no longer has the lowest runtime. Before queuing
@@ -336,19 +365,39 @@ void cpu::reschedule_from_interrupt(bool 
called_from_yield,
     if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) {
         mmu::flush_tlb_local();
     }
+#ifdef AARCH64_PORT_STUB
+    switch_data.old_thread_state = &(p->_state);
+    switch_data.new_thread_state = &(n->_state);
+    return switch_data;
+}
+#else
     n->switch_to();
+#endif
 
     // Note: after the call to n->switch_to(), we should no longer use any of
     // the local variables, nor "this" object, because we just switched to n's
     // stack and the values we can access now are those that existed in the
     // reschedule call which scheduled n out, and will now be returning.
     // So to get the current cpu, we must use cpu::current(), not "this".
+#ifdef AARCH64_PORT_STUB
+extern "C" void destroy_current_cpu_terminating_thread()
+{
+#endif
     if (cpu::current()->terminating_thread) {
         cpu::current()->terminating_thread->destroy();
         cpu::current()->terminating_thread = nullptr;
     }
 }
 
+#ifdef AARCH64_PORT_STUB
+extern "C" thread_switch_data cpu_schedule_next_thread(cpu* cpu,
+                                                       bool called_from_yield,
+                                                       
thread_runtime::duration preempt_after)
+{
+    return cpu->schedule_next_thread(called_from_yield, preempt_after);
+}
+#endif
+
 void cpu::timer_fired()
 {
     // nothing to do, preemption will happen if needed
@@ -521,7 +570,11 @@ void thread::pin(cpu *target_cpu)
         // Note that wakeme is on the same CPU, and irq is disabled,
         // so it will not actually run until we stop running.
         wakeme->wake_with([&] { do_wakeme = true; });
+#ifdef AARCH64_PORT_STUB
+        reschedule_from_interrupt(source_cpu, false, thyst);
+#else
         source_cpu->reschedule_from_interrupt();
+#endif
     }
     // wakeme will be implicitly join()ed here.
 }
@@ -760,7 +813,11 @@ void thread::yield(thread_runtime::duration preempt_after)
     }
     trace_sched_yield_switch();
 
+#ifdef AARCH64_PORT_STUB
+    reschedule_from_interrupt(cpu::current(), true, preempt_after);
+#else
     cpu::current()->reschedule_from_interrupt(true, preempt_after);
+#endif
 }
 
 void thread::set_priority(float priority)
diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index 0fb44f77..1dacd623 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -31,6 +31,9 @@ typedef float runtime_t;
 
 extern "C" {
 void smp_main();
+#ifdef AARCH64_PORT_STUB
+void destroy_current_cpu_terminating_thread();
+#endif
 };
 void smp_launch();
 
@@ -321,7 +324,12 @@ constexpr thread_runtime::duration thyst = 
std::chrono::milliseconds(5);
 constexpr thread_runtime::duration context_switch_penalty =
                                            std::chrono::microseconds(10);
 
-
+#ifdef AARCH64_PORT_STUB
+struct thread_switch_data {
+   thread_state* old_thread_state = nullptr;
+   thread_state* new_thread_state = nullptr;
+};
+#endif
 /**
  * OSv thread
  */
@@ -600,7 +608,9 @@ private:
             unsigned allowed_initial_states_mask = 1 << 
unsigned(status::waiting));
     static void sleep_impl(timer &tmr);
     void main();
+#ifndef AARCH64_PORT_STUB
     void switch_to();
+#endif
     void switch_to_first();
     void prepare_wait();
     void wait();
@@ -702,7 +712,12 @@ private:
     std::vector<char*> _tls;
     bool _app;
     std::shared_ptr<osv::application_runtime> _app_runtime;
+public: //TODO: For whatever reason we need to make this public even though 
destroy_current_cpu_terminating_thread is defined as friend below, so compiler 
still complains.
     void destroy();
+private:
+#ifdef AARCH64_PORT_STUB
+    friend void ::destroy_current_cpu_terminating_thread();
+#endif
     friend class thread_ref_guard;
     friend void thread_main_c(thread* t);
     friend class wait_guard;
@@ -884,8 +899,13 @@ struct cpu : private timer_base::client {
      * @param preempt_after fire the preemption timer after this time period
      *                      (relevant only when "called_from_yield" is TRUE)
      */
+#ifdef AARCH64_PORT_STUB
+    thread_switch_data schedule_next_thread(bool called_from_yield = false,
+                                            thread_runtime::duration 
preempt_after = thyst);
+#else
     void reschedule_from_interrupt(bool called_from_yield = false,
-                                thread_runtime::duration preempt_after = 
thyst);
+                                   thread_runtime::duration preempt_after = 
thyst);
+#endif
     void enqueue(thread& t);
     void init_idle_thread();
     virtual void timer_fired() override;
@@ -895,6 +915,12 @@ struct cpu : private timer_base::client {
     int renormalize_count;
 };
 
+#ifdef AARCH64_PORT_STUB
+extern "C" void reschedule_from_interrupt(cpu* cpu,
+                                          bool called_from_yield,
+                                          thread_runtime::duration 
preempt_after);
+#endif
+
 class cpu::notifier {
 public:
     explicit notifier(std::function<void ()> cpu_up);
@@ -996,7 +1022,11 @@ inline bool preemptable()
 inline void preempt()
 {
     if (preemptable()) {
+#ifdef AARCH64_PORT_STUB
+        reschedule_from_interrupt(sched::cpu::current(), false, thyst);
+#else
         sched::cpu::current()->reschedule_from_interrupt();
+#endif
     } else {
         // preempt_enable() will pick this up eventually
         need_reschedule = true;
-- 
2.27.0

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/20210303215711.335280-1-jwkozaczuk%40gmail.com.

Reply via email to