On Monday, April 24, 2023 at 2:08:26 PM UTC-4 Waldek Kozaczuk wrote:
On Mon, Apr 24, 2023 at 4:34 AM Nadav Har'El <n...@scylladb.com> wrote: On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote: Hi, Over the recent week, I have been working to get OSv to run a simple "hello world" app (aka native-example) built as a position-dependent statically linked executable. Nice! 1. Support the arch_prctl syscall that sets the app TLS - this was by far the most complicated element that required changing OSv to store new per-pcpu data pointer in GS register and enhancing both syscall handler and interrupt/page fault handler to detect and switch if needed the FS base to the kernel TLS on entry and back to the app one on exit (see https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880 ) If this has noticeable overhead, perhaps it makes sense to make it optional? I have not measured it in any formal way. But when testing some of the earlier versions of the code, I could see the context switch time (the colocated one measured by misc-ctxsw) go up from 313 to 362 ns caused by adding this line: processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb)); which may indirectly measure the cost of the code to change GS or FS base using the MSR instruction at ~ 50 ns (yikes). I would think the FSGSBASE instruction should be faster. BTW I have measured indirectly the cost of the MSR and wrgsbase indirectly by modifying the thread::switch() like this and running misc-ctxsw (I assume the cost of wrfsbase would be identical): +uint32_t IA32_GS_BASE = 0xc0000101; void thread::switch_to() { thread* old = current(); @@ -81,6 +82,8 @@ void thread::switch_to() // barriers barrier(); set_fsbase(reinterpret_cast<u64>(_tcb)); + //asm volatile("wrgsbase %0" : : "r"(reinterpret_cast<u64>(_tcb))); + //processor::wrmsr(IA32_GS_BASE, reinterpret_cast<u64>(_tcb)); barrier(); auto c = _detached_state->_cpu; old->_state.exception_stack = c->arch.get_exception_stack(); With uncommented wrgsbase the cost of colocated context switch barely budged. On average, I could see a maximum of 1-2 ns difference if any. Sometimes the times were identical. So it seems the wrgsbase is pretty cheap though we should avoid calling it in the interrupt/page fault/syscall handler. On the other hand, with uncommented wrmsr code the cost of the context switch bumped by ~50ns so this instruction is very expensive. That is also why we need to especially avoid it if wrgsbase is not available. Here is a subset of the changes I had to make to the context-switching code and the interrupt/syscall handler: 1. Add 2 new fields to the thread control block: unsigned long app_tcb; //holds address of the address the app passed to arch_prctl long kernel_tcb_counter; //if 0 means we have to do an app/kernel/app FS base switch 2. Setup new per-cpu data intended to hold a pointer to the tcb: --- a/arch/x64/arch-cpu.hh +++ b/arch/x64/arch-cpu.hh +struct tcb_data { + u64 kernel_tcb; + u64 tmp[2]; +}; + struct arch_cpu { arch_cpu(); processor::aligned_task_state_segment atss; @@ -46,6 +52,7 @@ struct arch_cpu { u32 apic_id; u32 acpi_id; u64 gdt[nr_gdt]; + tcb_data _tcb_data; void init_on_cpu(); void set_ist_entry(unsigned ist, char* base, size_t size); char* get_ist_entry(unsigned ist); @@ -181,6 +188,8 @@ inline void arch_cpu::init_on_cpu() processor::init_fpu(); processor::init_syscall(); + + processor::wrmsr(msr::IA32_GS_BASE, reinterpret_cast<u64>(&_tcb_data.kernel_tcb)); } 3. Change kernel fs pointer on each context switch. --- a/arch/x64/arch-switch.hh +++ b/arch/x64/arch-switch.hh @@ -81,11 +81,13 @@ void thread::switch_to() ... c->arch.set_exception_stack(_state.exception_stack); + c->arch._tcb_data.kernel_tcb = reinterpret_cast<u64>(_tcb); //This should be very fast auto fpucw = processor::fnstcw(); ... @@ -258,6 +260,7 @@ void thread::setup_tcb() else { _tcb->syscall_stack_top = 0; } + _tcb->kernel_tcb_counter = 1; //By default disable fs base switch } 4. Handle fs switch if necessary on entry/exit of syscall/exception/page fault handler: This is just a code change around syscall entry but we have to do the opposite for exit and similar for page fault/interrupt handler (possibly signal handler as well) @@ -174,6 +214,26 @@ syscall_entry: .cfi_register rip, rcx # rcx took previous rip value .cfi_register rflags, r11 # r11 took previous rflags value # There is no ring transition and rflags are left unchanged. + # + # app->kernel tcb switch + movq %rax, %gs:8 # save register rax so we can restore it later + movq %gs:0, %rax # copy address of kernel tcb to the temp register rax + #1. Check if kernel_tcb_counter 0 and jump over to 3 if not (no need to do fsbase switch) + cmpq $0, 40(%rax) + jne on_kernel_tcb + + #2. If zero set fs MSR to kernel tcb + movq %rbx, %gs:16 # save register rbx so we can restore it later + movq (%rax), %rbx # set kernel tcb + wrfsbase %rbx //TODO: In reality we need to check if wrfsbase is available and use wrmsr if not + movq %gs:16, %rbx + +on_kernel_tcb: + #3. Increment counter (for nested case) + incq 40(%rax) + #4. Restore %rax + movq %gs:8, %rax + # # Unfortunately the mov instruction cannot be used to dereference an address # on syscall stack pointed by address in TCB (%fs:16) - double memory dereference. I did measure that the context switch code is not affected in any way. But I am sure the syscall/page fault/interrupt handler is affected but hopefully by a tiny bit for all the cases except when the application thread (of the static elf) gets interrupted, triggers page fault, or makes a syscall call. In other words, I hope that kernel threads and normal (non-static-elf) threads would not be affected. We could also add the necessary #ifdef static_elf. Any ideas on how to measure how much slower the interrupt/syscall/page fault handler would become? 1. Fixing a potential bug in handling TCGETS in the console driver. I'm curious what this bug was - I am personally fond of this area of this code, as you can see from the history lesson in drivers/line-discipline.cc :-) I think it may have to do with some size difference of the termios struct between glibc and OSv. The symptom seemed to be a corrupted stack after ioctl syscall call that ended up calling the code to handle TCGETS. This change seems to fix it: --- a/drivers/console.cc +++ b/drivers/console.cc @@ -68,7 +68,16 @@ console_ioctl(u_long request, void *arg) { switch (request) { case TCGETS: - *static_cast<termios*>(arg) = tio; + //*static_cast<termios*>(arg) = tio; + { + termios *in = static_cast<termios*>(arg); + in->c_iflag = tio.c_iflag; + in->c_oflag = tio.c_oflag; + in->c_cflag = tio.c_cflag; + in->c_lflag = tio.c_lflag; + in->c_line = tio.c_line; + } return 0; I think I have missed the c_cc field. Here is the relevant code in glibc - https://github.com/bminor/glibc/blob/ae612c45efb5e34713859a5facf92368307efb6e/sysdeps/unix/sysv/linux/tcgetattr.c 1. Implement sys_prlimit 2. Enable the readlink, geteuid and getegid I think we already had those - or did you mean the system call? Yes, just add SYSCALLx() macros to linux.cc This was enough to run a single-threaded app but we will need to implement the clone syscall to support multi-threaded apps. Very nice. You can probably start by implementing the "simple" case of clone() used by a simple multi-threaded application and leave the other cases with UNIMPLEMENTED (or "ignore" various parameters and leave them to be perfected later, with WARN_ONCE) -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/c3b84616-d161-4b89-91cb-1e91b255a211n%40googlegroups.com.