Re: [osv-dev] Lazily allocating thread stacks WIP

Waldek Kozaczuk Tue, 10 Dec 2019 15:03:58 -0800

So I have applied your patch (had to manually reformat it from the email so 
it might be somewhat different):


diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index 6803498f..70224472 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -148,6 +148,11 @@ void thread::init_stack()
     _state.rip = reinterpret_cast<void*>(thread_main);
     _state.rsp = stacktop;
     _state.exception_stack = _arch.exception_stack + 
sizeof(_arch.exception_stack);
+
+    // Since thread stacks are lazily allocated and the thread initially 
starts
+    // running with preemption disabled, we need to pre-fault the first 
stack page.
+    volatile char r = *((char*)(stacktop-1));
+    (void) r; // trick the compiler into thinking that r is used
 }
 
 void thread::setup_tcb()
diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh
index 17df5f5c..a0c18aa4 100644
--- a/arch/x64/arch.hh
+++ b/arch/x64/arch.hh
@@ -20,8 +20,14 @@ namespace arch {
 #define INSTR_SIZE_MIN 1
 #define ELF_IMAGE_START OSV_KERNEL_BASE
 
+inline void read_next_stack_page() {
+    char i;
+    asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
+}
+
 inline void irq_disable()
 {
+    read_next_stack_page();
     processor::cli();
 }
 
@@ -30,6 +36,7 @@ inline void irq_disable_notrace();
 
 inline void irq_disable_notrace()
 {
+    read_next_stack_page();
     processor::cli_notrace();
 }
 
diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh
index 18edf441..694815f8 100644
--- a/include/osv/mmu-defs.hh
+++ b/include/osv/mmu-defs.hh
@@ -84,6 +84,7 @@ enum {
     mmap_small       = 1ul << 5,
     mmap_jvm_balloon = 1ul << 6,
     mmap_file        = 1ul << 7,
+    mmap_stack       = 1ul << 8,
 };
 
 enum {
diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index 66c0a44a..3ae1c2d0 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -1000,6 +1000,7 @@ inline void preempt()
 
 inline void preempt_disable()
 {
+    arch::read_next_stack_page();
     ++preempt_counter;
     barrier();
 }
diff --git a/libc/mman.cc b/libc/mman.cc
index bb573a80..5b0436e2 100644
--- a/libc/mman.cc
+++ b/libc/mman.cc
@@ -44,7 +44,7 @@ unsigned libc_flags_to_mmap(int flags)
         // and did us the courtesy of telling this to ue (via MAP_STACK),
         // let's return the courtesy by returning pre-faulted memory.
         // FIXME: If issue #143 is fixed, this workaround should be 
removed.
-        mmap_flags |= mmu::mmap_populate;
+        mmap_flags |= mmu::mmap_stack;
     }
     if (flags & MAP_SHARED) {
         mmap_flags |= mmu::mmap_shared;
diff --git a/libc/pthread.cc b/libc/pthread.cc
index 44b93b83..ea909d5d 100644
--- a/libc/pthread.cc
+++ b/libc/pthread.cc
@@ -139,7 +139,8 @@ namespace pthread_private {
             return {attr.stack_begin, attr.stack_size};
         }
         size_t size = attr.stack_size;
-        void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, 
mmu::perm_rw);
+        printf("--> Created stack of size: %ld\n", size);
+        void *addr = mmu::map_anon(nullptr, size, mmu::mmap_stack, 
mmu::perm_rw);
         mmu::mprotect(addr, attr.guard_size, 0);
         sched::thread::stack_info si{addr, size};
         si.deleter = free_stack;

I ran unit tests once and all passed (I use the bare metal machine so the 
KVM is enabled) but I have not stressed them much. 

I played a bit more with Java apps to see if we truly see the memory 
savings. Indeed simple Java hello world uses around 10MB less with 12 Java 
application threads. 

I also played with Java httpserver app (./scripts/build -j4 
image=openjdk8-zulu-full,java-httpserver) and stressed it a bit with apache 
bench (ab -k -c 50 -n 1000 http://localhost:8000/). Overall it worked but 
here I got a crash which indicates some shortcomings of our solutions:

(gdb) bt
#0  abort (fmt=fmt@entry=0x40656db8 "Assertion failed: %s (%s: %s: %d)\n") 
at runtime.cc:105
#1  0x000000004023d4cf in __assert_fail (expr=expr@entry=0x4068f1c1 
"sched::preemptable()", file=file@entry=0x40685125 "arch/x64/mmu.cc", 
line=line@entry=37, func=func@entry=0x4068511a "page_fault") at 
runtime.cc:139
#2  0x00000000403aee0c in page_fault (ef=0xffff800000015048) at 
arch/x64/arch-cpu.hh:107
#3  <signal handler called>
#4  0x00000000403606c5 in elf::object::symtab_len (this=<optimized out>) at 
core/elf.cc:896
#5  0x0000000040360780 in elf::object::lookup_addr 
(this=0xffffa000014c9c00, addr=addr@entry=0x100000f25eba 
<os::PlatformEvent::park()+186>) at core/elf.cc:928
#6  0x000000004036094f in elf::program::<lambda(const 
elf::program::modules_list&)>::operator() (__closure=<synthetic pointer>, 
__closure=<synthetic pointer>, ml=...) at core/elf.cc:1464
#7  elf::program::with_modules<elf::program::lookup_addr(void 
const*)::<lambda(const elf::program::modules_list&)> > (f=..., 
this=0xffffa00000091bb0) at include/osv/elf.hh:687
#8  elf::program::lookup_addr (this=0xffffa00000091bb0, 
addr=addr@entry=0x100000f25eba <os::PlatformEvent::park()+186>) at 
core/elf.cc:1461
#9  0x000000004043e540 in osv::lookup_name_demangled 
(addr=addr@entry=0x100000f25eba <os::PlatformEvent::park()+186>, 
buf=buf@entry=0xffff80000247a750 "condvar::wait(lockfree::mutex*, 
sched::timer*)+526", len=len@entry=1024) at core/demangle.cc:47
#10 0x000000004023d2b0 in print_backtrace () at runtime.cc:85
#11 0x000000004023d487 in abort (fmt=fmt@entry=0x40656db8 "Assertion 
failed: %s (%s: %s: %d)\n") at runtime.cc:121
#12 0x000000004023d4cf in __assert_fail (expr=expr@entry=0x4068f1c1 
"sched::preemptable()", file=file@entry=0x40685125 "arch/x64/mmu.cc", 
line=line@entry=37, func=func@entry=0x4068511a "page_fault") at 
runtime.cc:139
#13 0x00000000403aee0c in page_fault (ef=0xffff80000247b068) at 
arch/x64/arch-cpu.hh:107
#14 <signal handler called>
#15 arch::read_next_stack_page () at arch/x64/arch.hh:30
#16 arch::irq_disable () at arch/x64/arch.hh:30
#17 irq_save_lock_type::lock (this=<synthetic pointer>) at 
include/osv/irqlock.hh:31
#18 std::lock_guard<irq_save_lock_type>::lock_guard (__m=<synthetic 
pointer>..., this=<synthetic pointer>) at 
/usr/include/c++/9/bits/std_mutex.h:159
#19 lock_guard_for_with_lock<irq_save_lock_type>::lock_guard_for_with_lock 
(lock=<synthetic pointer>..., this=<synthetic pointer>) at 
include/osv/mutex.h:89
#20 sched::timer_base::cancel (this=0xffff800000016798) at 
core/sched.cc:1493
#21 0x000000004040e00d in sched::cpu::reschedule_from_interrupt 
(this=0xffff800000013040, called_from_yield=called_from_yield@entry=false, 
preempt_after=..., preempt_after@entry=...) at core/sched.cc:320
#22 0x000000004040eda7 in sched::cpu::schedule () at 
include/osv/sched.hh:1311
#23 0x000000004040eea6 in sched::thread::wait 
(this=this@entry=0xffff800002476040) at core/sched.cc:1214
#24 0x00000000403ecf2f in 
sched::thread::do_wait_until<sched::noninterruptible, 
sched::thread::dummy_lock, waiter::wait(sched::timer*) 
const::{lambda()#1}>(sched::thread::dummy_lock&, 
waiter::wait(sched::timer*) const::{lambda()#1}) (pred=..., 
    mtx=<synthetic pointer>...) at include/osv/sched.hh:939
#25 sched::thread::wait_until<waiter::wait(sched::timer*) 
const::{lambda()#1}>(waiter::wait(sched::timer*) const::{lambda()#1}) 
(pred=...) at include/osv/sched.hh:1078
#26 waiter::wait (tmr=0x0, this=0x2000107d0060) at 
include/osv/wait_record.hh:46
#27 condvar::wait (this=0xffffa00002985400, user_mutex=0xffffa00002987d80, 
tmr=<optimized out>) at core/condvar.cc:43
#28 0x0000100000f25ebb in os::PlatformEvent::park() ()
#29 0x0000100000ee17ef in Monitor::ILock(Thread*) ()
#30 0x0000100000ee1e2f in Monitor::lock_without_safepoint_check() ()
#31 0x0000100000fbf63d in SafepointSynchronize::block(JavaThread*) ()
#32 0x0000100001079be8 in 
JavaThread::check_safepoint_and_suspend_for_native_trans(JavaThread*) ()
#33 0x0000100000a111dd in ciEnv::get_field_by_index(ciInstanceKlass*, int) 
()
#34 0x0000100000a36033 in ciBytecodeStream::get_field(bool&) ()
#35 0x000010000096be6a in GraphBuilder::access_field(Bytecodes::Code) ()
#36 0x000010000096d983 in GraphBuilder::iterate_bytecodes_for_block(int) ()
#37 0x000010000096f4cd in GraphBuilder::iterate_all_blocks(bool) ()
#38 0x000010000096fde1 in GraphBuilder::try_inline_full(ciMethod*, bool, 
Bytecodes::Code, Instruction*) ()
#39 0x0000100000970061 in GraphBuilder::try_inline(ciMethod*, bool, 
Bytecodes::Code, Instruction*) ()
#40 0x00001000009712c4 in GraphBuilder::invoke(Bytecodes::Code) ()
#41 0x000010000096d973 in GraphBuilder::iterate_bytecodes_for_block(int) ()
#42 0x000010000096f4cd in GraphBuilder::iterate_all_blocks(bool) ()
#43 0x000010000096fde1 in GraphBuilder::try_inline_full(ciMethod*, bool, 
Bytecodes::Code, Instruction*) ()
#44 0x0000100000970061 in GraphBuilder::try_inline(ciMethod*, bool, 
Bytecodes::Code, Instruction*) ()
#45 0x00001000009712c4 in GraphBuilder::invoke(Bytecodes::Code) ()
#46 0x000010000096d973 in GraphBuilder::iterate_bytecodes_for_block(int) ()
#47 0x000010000096f453 in GraphBuilder::iterate_all_blocks(bool) ()
#48 0x0000100000972abe in GraphBuilder::GraphBuilder(Compilation*, 
IRScope*) ()
#49 0x000010000097917c in IR::IR(Compilation*, ciMethod*, int) ()
#50 0x000010000095999d in Compilation::build_hir() ()
#51 0x000010000095a0e5 in Compilation::compile_java_method() ()
#52 0x000010000095a202 in Compilation::compile_method() ()
#53 0x000010000095a530 in Compilation::Compilation(AbstractCompiler*, 
ciEnv*, ciMethod*, int, BufferBlob*) ()
#54 0x000010000095af43 in Compiler::compile_method(ciEnv*, ciMethod*, int) 
()
#55 0x0000100000aa1636 in 
CompileBroker::invoke_compiler_on_method(CompileTask*) ()
#56 0x0000100000aa3f17 in CompileBroker::compiler_thread_loop() ()
#57 0x000010000107f881 in JavaThread::thread_main_inner() ()
#58 0x0000100000f2e152 in java_start(Thread*) ()
#59 0x000000004047858a in pthread_private::pthread::<lambda()>::operator() 
(__closure=0xffffa00002983600) at libc/pthread.cc:114
--Type <RET> for more, q to quit, c to continue without paging--qq
Quit

(gdb) frame 28
#28 0x0000100000f25ebb in os::PlatformEvent::park() ()
(gdb) info registers rsp
rsp            0x2000107d00b0      0x2000107d00b0
(gdb) frame 15
#15 arch::read_next_stack_page () at arch/x64/arch.hh:30
30     read_next_stack_page();
(gdb) info registers rsp
rsp            0x2000107cff80      0x2000107cff80
 
As you can see we got assertion violation in the page fault handler - it 
wants preemption enabled (as expected). My understanding is that the code 
at frame 15 was ALREADY running with preemption disabled (I think that 
*preempt_disable*() was already called by the scheduler and possibly page 
fault caused or not) and sow when *irq_disable*() was called preemption was 
disabled when read_next_page() triggered a fault.

So there is a logical nesting problem where irq_disable is called after 
preempt_disable and vice versa. We somehow need to make sure that 
read_next_stack_page() is only called once in that case. Some kind of flag 
maybe.

Waldek

On Tuesday, December 10, 2019 at 1:39:58 PM UTC-5, Waldek Kozaczuk wrote:
>
> Hi,
>
> On Tuesday, December 10, 2019 at 1:26:31 PM UTC-5, Matthew Pabst wrote:
>>
>> On Sunday, December 8, 2019 at 1:03:45 PM UTC-6, Waldek Kozaczuk wrote:
>>>
>>>
>>> I think it is more than just "wasting the last page of a small stack". 
>>> Without a check to validate if we are reading within the bounds of the 
>>> stack, we could cause unintended faults that would actually crash the app, 
>>> right? 
>>> If we want to make the logic in `irq_disable`  as efficient as possible, 
>>> we could place the bottom of the stack address in some well-known place 
>>> (thread-local, cpu-local?) on each stack switch and check against it before 
>>> trying to read. But some sort of extra 'if' seems to be unavoidable. 
>>> Otherwise, we will crash the app if the operates on the last page of the 
>>> stack, right?
>>>
>>
>> This is what I was originally getting at, but after thinking about it for 
>> a while I think that if the next page of the stack is not mapped, then the 
>> code may cause a fault regardless. Of course there is the possibility that 
>> the code uses exactly the amount of stack pages allocated and no more, in 
>> which case the extra read will cause a segmentation fault.
>>
>> Here's a more recent diff I've been working on that passes most test 
>> cases initially:
>>
>> diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
>>
>> @@ -148,6 +148,11 @@ void thread::init_stack()
>>
>>      _state.exception_stack = _arch.exception_stack + 
>> sizeof(_arch.exception_stack);
>>
>> +
>>
>> +    // Since thread stacks are lazily allocated and the thread initially 
>> starts
>>
>> +    // running with preemption disabled, we need to pre-fault the first 
>> stack page.
>>
>> +    volatile char r = *((char*)(stacktop-1));
>>
>> +    (void) r; // trick the compiler into thinking that r is used
>>
>>  }
>>
>>  
>>
>>  void thread::setup_tcb()
>>
>> diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh
>>
>> @@ -20,8 +20,15 @@ namespace arch {
>>
>>  
>>
>> +inline void read_next_stack_page() {
>>
>> +    char i;
>>
>> +    asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
>>
>> +}
>>
>> +
>>
>>  inline void irq_disable()
>>
>>  {
>>
>> +    read_next_stack_page();
>>
>>      processor::cli();
>>
>>  }
>>
>>  
>>
>> @@ -30,6 +37,7 @@ inline void irq_disable_notrace();
>>
>>  
>>
>>  inline void irq_disable_notrace()
>>
>>  {
>>
>> +    read_next_stack_page();
>>
>>      processor::cli_notrace();
>>
>>  }
>>
>>
>>  enum {
>>
>> diff --git a/include/osv/sched.hh b/include/osv/sched.hh=
>>
>> @@ -1000,6 +1000,7 @@ inline void preempt()
>>
>>  
>>
>>  inline void preempt_disable()
>>
>>  {
>>
>> +    arch::read_next_stack_page();
>>
>>      ++preempt_counter;
>>
>>      barrier();
>>
>>  }
>>
>> diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh
>>
>> @@ -84,6 +84,7 @@ enum {
>>
>>      mmap_small       = 1ul << 5,
>>
>>      mmap_jvm_balloon = 1ul << 6,
>>
>>      mmap_file        = 1ul << 7,
>>
>> +    mmap_stack       = 1ul << 8,
>>
>>  };
>>
>> diff --git a/libc/mman.cc b/libc/mman.cc
>>
>> @@ -44,7 +44,7 @@ unsigned libc_flags_to_mmap(int flags)
>>
>>          // and did us the courtesy of telling this to ue (via MAP_STACK),
>>
>>          // let's return the courtesy by returning pre-faulted memory.
>>
>>          // FIXME: If issue #143 is fixed, this workaround should be 
>> removed.
>>
>> -        mmap_flags |= mmu::mmap_populate;
>>
>> +        mmap_flags |= mmu::mmap_stack;
>>
>>      }
>>
>>      if (flags & MAP_SHARED) {
>>
>>          mmap_flags |= mmu::mmap_shared;
>>
>> diff --git a/libc/pthread.cc b/libc/pthread.cc=
>>
>> @@ -139,7 +139,7 @@ namespace pthread_private {
>>
>>              return {attr.stack_begin, attr.stack_size};
>>
>>          }
>>
>>          size_t size = attr.stack_size;
>>
>> -        void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, 
>> mmu::perm_rw);
>>
>> +        void *addr = mmu::map_anon(nullptr, size, mmu::mmap_stack, 
>> mmu::perm_rw);
>>
>>          mmu::mprotect(addr, attr.guard_size, 0);
>>
>>          sched::thread::stack_info si{addr, size};
>>          si.deleter = free_stack;  
>>
>> However there's some race condition still, as some tests fail after 
>> running multiple times.
>>
>> tst-yield.so occasionally fails with the following logs: 
>>
>> OSv v0.54.0-50-g2b62eac1
>>
>> eth0: 192.168.122.15
>>
>> Booted up in 0.00 ms
>>
>> Cmdline: /tests/tst-yield.so
>>
>> 562665310100 < 601295415960, now=562665362700
>>
>>
>> [backtrace]
>>
>> 0x00000000403f57d8 <sched::timer_list::fired()+1208>
>>
>> QEMU: Terminated
>>
>> Test tst-yield.so FAILED
>>
>>
> This crash looks similar to the one reported here - 
> https://github.com/cloudius-systems/osv/issues/382#issuecomment-363589062. 
> To best of our understanding, this issue seems to happen when hpet clock is 
> used, which seems to be your case (I see message "booted up in 0.00 ms") 
> which what happens if there is no paravirtual clock enabled. I bet you that 
> when you run with the kvm on the issue disappears.  
>
>
> tst-vfs.so occasionally fails with the following logs:
>>
>> OSv v0.54.0-50-g2b62eac1
>>
>> eth0: 192.168.122.15
>>
>> Booted up in 0.00 ms
>>
>> Cmdline: /tests/tst-vfs.so
>>
>> Running 3 test cases...
>>
>> Running dentry hierarchy tests
>>
>> dentry hierarchy tests succeeded
>>
>> Running concurrent file operation tests
>>
>> test1, with 10 threads
>>
>> /git-repos/osv/tests/tst-vfs.cc(86): info:  passed
>>
>>
>>
>>
>>
>> terminate called after throwing an instance of 'std::length_error'
>>
>>   what():  basic_string::_S_create
>>
>> Aborted
>>
>>
>> [backtrace]
>>
>> 0x00000000404a2f6d <???+1078603629>
>>
>> 0x00000000408944ff <???+1082737919>
>>
>> Test tst-vfs.so FAILED
>>
>> This one is more interesting as I have never seen it before. Can you try 
> to reproduce it and connect with gdb to grab a better stack trace.
>
> My initial thought was that I may need to read more than one stack page 
>> ahead, however implementing that idea didn't resolve the failures. Neither 
>> of the errors above are page faults, so I'm not sure where the problem 
>> could be.
>>
>> Let me know if you have any idea what's going on,
>> Matthew
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/eb4789cb-64bb-4fc1-993e-ca36f6d92a5e%40googlegroups.com.

Re: [osv-dev] Lazily allocating thread stacks WIP

Reply via email to