Re: [osv-dev] OSv can run statically linked executable

Waldek Kozaczuk Tue, 25 Apr 2023 19:55:16 -0700


On Monday, April 24, 2023 at 2:08:26 PM UTC-4 Waldek Kozaczuk wrote:


On Mon, Apr 24, 2023 at 4:34 AM Nadav Har'El <n...@scylladb.com> wrote:

On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

Hi,

Over the recent week, I have been working to get OSv to run a simple "hello 
world" app (aka native-example) built as a position-dependent statically 
linked executable. 


Nice!
 


   1. Support the arch_prctl syscall that sets the app TLS 
      - this was by far the most complicated element that required changing 
      OSv to store new per-pcpu data pointer in GS register and enhancing both 
      syscall handler and interrupt/page fault handler to detect and switch if 
      needed the FS base to the kernel TLS on entry and back to the app one on 
      exit (see 
      
https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880
      ) 
   

If this has noticeable overhead, perhaps it makes sense to make it optional?

I have not measured it in any formal way. But when testing some of the 
earlier versions of the code, I could see the context switch time (the 
colocated one measured by misc-ctxsw) go up from 313 to 362 ns caused by 
adding this line: 

processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb));
which may indirectly measure the cost of the code to change GS or FS base 
using the MSR instruction at ~ 50 ns (yikes). I would think the FSGSBASE 
instruction should be faster.


BTW I have measured indirectly the cost of the MSR and wrgsbase indirectly 
by modifying the thread::switch() like this and running misc-ctxsw (I 
assume the cost of wrfsbase would be identical):
 
+uint32_t IA32_GS_BASE = 0xc0000101;
 void thread::switch_to()
 {
     thread* old = current();
@@ -81,6 +82,8 @@ void thread::switch_to()
     // barriers
     barrier();
     set_fsbase(reinterpret_cast<u64>(_tcb));
+    //asm volatile("wrgsbase %0" : : "r"(reinterpret_cast<u64>(_tcb)));
+    //processor::wrmsr(IA32_GS_BASE, reinterpret_cast<u64>(_tcb));
     barrier();
     auto c = _detached_state->_cpu;
     old->_state.exception_stack = c->arch.get_exception_stack();

With uncommented wrgsbase the cost of colocated context switch barely 
budged. On average, I could see a maximum of 1-2 ns difference if any. 
Sometimes the times were identical. So it seems the wrgsbase is pretty 
cheap though we should avoid calling it in the interrupt/page fault/syscall 
handler.

On the other hand, with uncommented wrmsr code the cost of the context 
switch bumped by ~50ns so this instruction is very expensive. That is also 
why we need to especially avoid it if wrgsbase is not available.


Here is a subset of the changes I had to make to the context-switching code 
and the interrupt/syscall handler:

1. Add 2 new fields to the thread control block:

    unsigned long app_tcb;   //holds address of the address the app passed 
to arch_prctl

    long kernel_tcb_counter; //if 0 means we have to do an app/kernel/app 
FS base switch

2. Setup new per-cpu data intended to hold a pointer to the tcb:

--- a/arch/x64/arch-cpu.hh

+++ b/arch/x64/arch-cpu.hh


+struct tcb_data {

+    u64 kernel_tcb;

+    u64 tmp[2];

+};

+

 struct arch_cpu {

     arch_cpu();

     processor::aligned_task_state_segment atss;

@@ -46,6 +52,7 @@ struct arch_cpu {

     u32 apic_id;

     u32 acpi_id;

     u64 gdt[nr_gdt];

+    tcb_data _tcb_data;

     void init_on_cpu();

     void set_ist_entry(unsigned ist, char* base, size_t size);

     char* get_ist_entry(unsigned ist);

@@ -181,6 +188,8 @@ inline void arch_cpu::init_on_cpu()

     processor::init_fpu();

 

     processor::init_syscall();

+

+    processor::wrmsr(msr::IA32_GS_BASE, 
reinterpret_cast<u64>(&_tcb_data.kernel_tcb));

 }



3. Change kernel fs pointer on each context switch.

--- a/arch/x64/arch-switch.hh

+++ b/arch/x64/arch-switch.hh


@@ -81,11 +81,13 @@ void thread::switch_to()

...

     c->arch.set_exception_stack(_state.exception_stack);

+    c->arch._tcb_data.kernel_tcb = reinterpret_cast<u64>(_tcb); //This 
should be very fast

     auto fpucw = processor::fnstcw();

...

@@ -258,6 +260,7 @@ void thread::setup_tcb()

     else {

         _tcb->syscall_stack_top = 0;

     }

+    _tcb->kernel_tcb_counter = 1; //By default disable fs base switch

 }

4. Handle fs switch if necessary on entry/exit of syscall/exception/page 
fault handler:

This is just a code change around syscall entry but we have to do the 
opposite for exit and similar for page fault/interrupt handler (possibly 
signal handler as well) 

@@ -174,6 +214,26 @@ syscall_entry:

     .cfi_register rip, rcx # rcx took previous rip value

     .cfi_register rflags, r11 # r11 took previous rflags value

     # There is no ring transition and rflags are left unchanged.

+    #

+    # app->kernel tcb switch

+    movq %rax, %gs:8  # save register rax so we can restore it later

+    movq %gs:0, %rax  # copy address of kernel tcb to the temp register rax

+    #1. Check if kernel_tcb_counter 0 and jump over to 3 if not (no need 
to do fsbase switch)

+    cmpq $0, 40(%rax)

+    jne on_kernel_tcb

+

+    #2. If zero set fs MSR to kernel tcb

+    movq %rbx, %gs:16  # save register rbx so we can restore it later

+    movq (%rax), %rbx # set kernel tcb

+    wrfsbase %rbx //TODO: In reality we need to check if wrfsbase is 
available and use wrmsr if not

+    movq %gs:16, %rbx

+

+on_kernel_tcb:

+    #3. Increment counter (for nested case)

+    incq 40(%rax)

+    #4. Restore %rax

+    movq %gs:8, %rax

+

     #

     # Unfortunately the mov instruction cannot be used to dereference an 
address

     # on syscall stack pointed by address in TCB (%fs:16) - double memory 
dereference.


I did measure that the context switch code is not affected in any way. But 
I am sure the syscall/page fault/interrupt handler is affected but 
hopefully by a tiny bit for all the cases except when the application 
thread (of the static elf) gets interrupted, triggers page fault, or makes 
a syscall call. In other words, I hope that kernel threads and normal 
(non-static-elf) threads would not be affected. We could also add the 
necessary #ifdef static_elf.

Any ideas on how to measure how much slower the interrupt/syscall/page 
fault handler would become?


   1. Fixing a potential bug in handling TCGETS in the console driver.

I'm curious what this bug was - I am personally fond of this area of this 
code, as you can see from the history
lesson in drivers/line-discipline.cc :-)

I think it may have to do with some size difference of the termios struct 
between glibc and OSv. The symptom seemed to be a corrupted stack after 
ioctl syscall call that ended up calling the code to handle TCGETS. This 
change seems to fix it:

--- a/drivers/console.cc

+++ b/drivers/console.cc

@@ -68,7 +68,16 @@ console_ioctl(u_long request, void *arg)

 {

     switch (request) {

     case TCGETS:

-        *static_cast<termios*>(arg) = tio;

+        //*static_cast<termios*>(arg) = tio;

+        {

+          termios *in = static_cast<termios*>(arg);

+          in->c_iflag = tio.c_iflag;

+          in->c_oflag = tio.c_oflag;

+          in->c_cflag = tio.c_cflag;

+          in->c_lflag = tio.c_lflag;

+          in->c_line = tio.c_line;

+        }

         return 0;
  
I think I have missed the c_cc field.

Here is the relevant code in glibc - 
https://github.com/bminor/glibc/blob/ae612c45efb5e34713859a5facf92368307efb6e/sysdeps/unix/sysv/linux/tcgetattr.c

 


   1. Implement sys_prlimit
   2. Enable the readlink, geteuid and getegid

I think we already had those - or did you mean the system call?

Yes, just add SYSCALLx() macros to linux.cc 

 

This was enough to run a single-threaded app but we will need to implement 
the clone syscall to support multi-threaded apps. 


Very nice. You can probably start by implementing the "simple" case of 
clone() used by a simple multi-threaded application and
leave the other cases with UNIMPLEMENTED (or "ignore" various parameters 
and leave them to be perfected later, with WARN_ONCE)
 

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/c3b84616-d161-4b89-91cb-1e91b255a211n%40googlegroups.com.

Re: [osv-dev] OSv can run statically linked executable

Reply via email to