On Sat, Jan 6, 2018 at 7:22 AM, Waldek Kozaczuk <jwkozac...@gmail.com>
wrote:

> I applied Hawxchen's patch and at first the httpserver app would crash
> with following type of error. It would be quite consistent and sometimes it
> would show golang stack trace but sometimes not like below.stack trace:
> ...
>

> #16 pagecache::access_scanner::run (this=0xc2fca0
> <pagecache::s_access_scanner>)
>     at core/pagecache.cc:614
> #17 0x00000000003e6377 in sched::thread_main_c (t=0xffff8000011e8040)
>     at arch/x64/arch-switch.hh:189
> #18 0x000000000038ab23 in thread_main () at arch/x64/entry.S:113
>
>
> At first I could not figure out why it was crashing so I added all kinds
> of printouts to golang runtime to understand what was going on. Eventually
> I noticed that this specific line:
>
> --> nextFreeIndex *7021803615734293551* 4353039
>
> would contain random huge number that did seem weird and given that I made
> a wild guess that maybe some syscall executions need bigger stack than
> single page - 4096 bytes.
>

Yes, 4096 bytes is probably not enough. In commit
efee5a91bb089315e61a9270d186834f1c088923 we increased the exception stack
from 4K to 16K because we noticed that some of the things we do during
exceptions, namely ZFS work, needs more than 4K on the stack. Definitely we
do even more during system calls (some of them also do ZFS work). You
should probably use at least 16K too.

When I first read your report, something else bothered me. I thought it
*looks like* one of the unrelated built-in threads of OSv,
pagecache::access_scanner::run(), also ran on the small "system call"
stack, and that should not have happened - when we switch to another
thread, we also switch to its stack. However, on second thinking, I
realized that your crash doesn't really prove that access_scanner::run()
itself ran on the tiny stack. What is quite possible is that something else
happened: run()'s stack was adjacent to the system call stack, and when we
overflowed the system call stack, we wrong on the run() thread's stack, and
when we context-switched back to it, it had crap on its stack and crashed.

So good catch.



>
>     if(is_app()) {
>         auto& stack = _attr._syscall_stack;
>
>         stack.size = PAGE_SIZE;
>         stack.begin = malloc(stack.size);
>         stack.deleter = stack.default_deleter;
>         _tcb->syscall_stack_addr = stack.begin + stack.size;
>     }
>
>
> So I simply double the size to 2 pages and it did the trick. Now I was
> able fire 50, 100, 200 concurrent requests and the app would function just
> fine. I verified it work with 1.8.3 and 1.9.2 go version. This seems to
> indicate that some syscalls made by go require bigger stack that 4096. Is
> my thinking correct? Also is there a way to introduce some kind of canary
> type of logic on OSv side to detect this kind of "overflow" of the syscall
> stack?
>

This is excellent.

We could indeed introduce such a canary, just like you said - write at the
end of the system call stack some fixed value (e.g., 0xdeadbeef), and when
returning from the system-call stack switch, check if it still has
0xdeadbeef at the end.
This, however, is not guaranteed to always help - it's possible that the
overflow during the system call handling will already overwrite something
important enough to


> I was very happy with the results (I actually performed 20-30 tests using
> ab) until just as I was writing this email I got 1-2 non-repeatable crashes
> with syscall stack 2 pages large on startup that looks like so:
>
> #0  0x0000000000391152 in processor::cli_hlt () at
> arch/x64/processor.hh:247
> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
> #2  osv::halt () at arch/x64/power.cc:24
> #3  0x0000000000225897 in abort (fmt=fmt@entry=0x719f80 "general
> protection fault\n") at runtime.cc:132
> #4  0x000000000038bf1a in general_protection (ef=0xffff8000038f6088) at
> arch/x64/exceptions.cc:322
> #5  <signal handler called>
> #6  processor::ldmxcsr (addr=4359593) at arch/x64/processor.hh:423
>
#7  sched::thread::switch_to (this=0xffff8000038f1040) at
> arch/x64/arch-switch.hh:76
> #8  0x6f732f65726f6300 in ?? ()
> #9  0x006e6e6f6378616d in ?? ()
> #10 0xffff8000038c80b8 in ?? ()
> #11 0x00000000419e222d in ?? ()
> #12 0x00000000038f8b00 in ?? ()
> #13 0xffff8000014e6190 in ?? ()
> #14 0x00000000003d388c in std_malloc (size=18446620928467668998,
>     alignment=<error reading variable: Cannot access memory at address
> 0x74656e2f737972f7>)
>     at core/mempool.cc:1550
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>
> This makes me think that maybe even 8096 is not big enough. Maybe I should
> experiment with bigger stacks.
>

I think it is likely indeed be related to stack overflow.

The call which crashed is the processor's LDMXCSR instruction, which got
invalid bits, which it shouldn't have: those bits were saved earlier, in
thread::switch_to(), on the thread's stack.
So if the thread's stack was overwritten, we can have this problem.
The fact that GDB said "corrupt stack" at the end also suggests the stack
may have been corrupted...

Try 16K stack, although at this point, it's just than a guess (and I have
no proof that 16K would always be enough).

Anyway, excellent work hunting these bugs! It's great that we can finally
say that "normal" Go programs work on OSv, not just silly "hello world"
ones.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to