On Tue, Jan 9, 2018 at 7:39 AM, Waldek Kozaczuk <jwkozac...@gmail.com>
wrote:

>
> I bumped the syscall stack size to 16K and indeed I stopped seeing the
> crashes that look like the two I gave the examples of in my previous email.
> In general the httpserver app behaves pretty well I and I was able to run
> many tests hitting with 100-200 concurrency levels.
>

Excellent. It's a bit worrying that we have no idea if this 16K would
really be enough for every case, I guess it would be nice to add a canary
like you suggested to at least be able to discover the error before
switching to unrelated thread.


> I also embarked on getting to work even more non-trivial app that uses the
> embedded keystore boltdb implemented in Golang -
> https://github.com/boltdb/bolt. I wrote a simple app that uses bolt to
> create a DB, add a bucket, add key/value, read it and close DB. It seems to
> be fundamentally working apart from couple of issues described below.
>
> Here are two issues that happen randomly and I think there is something in
> common between. So first of Golang uses pselect6 to implement portable
> usleep().
>
> TEXT runtime·usleep(SB),NOSPLIT,$16
>         MOVL    $0, DX
>         MOVL    usec+0(FP), AX
>         MOVL    $1000000, CX
>         DIVL    CX
>         MOVQ    AX, 0(SP)
>         MOVL    $1000, AX       // usec to nsec
>         MULL    DX
>         MOVQ    AX, 8(SP)
>
>         // pselect6(0, 0, 0, 0, &ts, 0)
>         MOVL    $0, DI
>         MOVL    $0, SI
>         MOVL    $0, DX
>         MOVL    $0, R10
>         MOVQ    SP, R8
>         MOVL    $0, R9
>         MOVL    $270, AX
>         SYSCALL
>         RET
>
>
> I had to add it to OSv in linux.cc and it seems to be working except
> sometimes the last 6th argument (sig) receives random incorrect value
> different from 0. The assembly code above shows that Golang always passes 0
> and I could not find any other place in Golang where pselect is used. It is
> also weird that the value of this 6th argument is random across OSv runs
> but the same for all pselect (usleep) calls like in this debug example:
>
> pselect6(): unimplemented with not-null sigmask
> pselect6(): sig [a8291a2200308313]
> pselect6(): unimplemented with not-null sigmask
> pselect6(): sig [a8291a2200308313]
> pselect6(): unimplemented with not-null sigmask
> pselect6(): sig [a8291a2200308313]
> pselect6(): unimplemented with not-null sigmask
> pselect6(): sig [a8291a2200308313]
> pselect6(): unimplemented with not-null sigmask
>
> The second issue is similar in the sense that it involves another syscall
> with 6 arguments and is about last argument as well as pselect above (but
> it all might be a coincidence). It only happens with boltdb example where
> mmap call always passes offset argument as 0 but OSv gets a different value
> sometimes and causes EINVAL. The mmap() syscall is implemented like so in
> OSv (at least in my branch):
>
> // Only void* return value of mmap is type casted, as syscall returns long.
> long long_mmap(void *addr, size_t length, int prot, int flags, int fd,
> off_t offset) {
>     return (long) mmap(addr, length, prot, flags, fd, offset);
> }
> //long long_mmap(void *addr, size_t length, unsigned long prot, unsigned
> long flags, unsigned long fd, off_t offset) {
> //    return (long) mmap(addr, length, (int)prot, (int)flags, (int)fd,
> offset);
> //}
> #define __NR_long_mmap __NR_mmap
> ...
> ...
>    SYSCALL6(long_mmap, void *, size_t, int, int, int, off_t);
>     //SYSCALL6(long_mmap, void *, size_t, unsigned long, unsigned long,
> unsigned long, off_t);
>
> I noticed that three 4th, 5th and 6th arguments in original OSv
> implementation use int where where Linux uses long. I changed it (see my
> commented out lines) but it did not help.
>

In OSv (libc/mman.cc) I see:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t
offset)

In Linux (/usr/include/sys/mman.h) I see
extern void *mmap (void *__addr, size_t __len, int __prot, int __flags, int
__fd, __off_t __offset) __THROW;

How is this different?


>
> So I am wondering if there is some problem with syscall implementation on
> our side that causes the last argument to be corrupt in both cases. Maybe
> things are misaligned?
>
> Here is a fragment of syscall implementation with Hawxchen changes (#if 0,
> etc):
>
> .align 16
> .global syscall_entry
> syscall_entry:
>     .type syscall_entry, @function
>     .cfi_startproc simple
>     .cfi_undefined rcx # was overwritten with rip by the syscall
> instruction
>     .cfi_undefined r11 # was overwritten with rflags by the syscall
> instruction
>     .cfi_register rip, rcx # rcx took previous rip value
>     .cfi_register rflags, r11 # r11 took previous rflags value
>     # There is no ring transition and rflags are left unchanged.
>
> #if 0
>     # Skip the "red zone" allowed by the AMD64 ABI (the caller used a
>     # SYSCALL instruction and doesn't know he called a function):
>     subq $128, %rsp
>
>     # Align the stack to 16 bytes. We align it now because of limitations
> of
>     # the CFI language, but need to ensure it is still aligned before we
> call
>     # syscall_wrapper(), so must ensure that the number of pushes below are
>     # even.
>     # An additional complication is that we need to restore %rsp later
> without
>     # knowing how it was previously aligned. In the following trick,
> without
>     # using an additional register, the two pushes leave the stack with the
>     # same alignment it had originally, and a copy of the original %rsp at
>     # (%rsp) and 8(%rsp). The andq then aligns the stack - if it was
> already
>     # 16 byte aligned nothing changes, if it was 8 byte aligned then it
>     # subtracts 8 from %rsp, meaning that the original %rsp is now at
> 8(%rsp)
>     # and 16(%rsp). In both cases we can restore it below from 8(%rsp).
>     pushq %rsp
>     pushq (%rsp)
>     andq $-16, %rsp
> #else
>     xchg %rsp, %fs:16
> #endif
>
>     .cfi_def_cfa %rsp, 0
>
>     # We need to save and restore the caller's %rbp anyway, so let's also
>     # set it up properly for old-style frame-pointer backtracing to work
>     # (e.g., backtrace_safe()). Also need to push the return address before
>     # the rbp to get a normal frame. Our return address is in rcx.
>     pushq_cfi %rcx
>     .cfi_rel_offset %rip, 0
>     pushq_cfi %rbp
>
>     # Push on the stack the caller's %rsp, before any of our modifications.
>     # We do this just so we can refer to it with CFI and help gdb's DWARF
>     # stack unwinding. This saving not otherwise needed for correct
> operation
>     # (we anyway restore it below by undoing all our modifications).
> #if 0
>     movq 24(%rsp), %rbp
>     addq $128, %rbp
>     pushq %rbp
> #else
>     pushq %fs:16
> #endif
>     .cfi_adjust_cfa_offset 8
>     .cfi_rel_offset %rsp, 0
>
>     # Set rbp (frame pointer, for old-style backtracing) to the %rsp before
>     # the above extra push
>     leaq 8(%rsp), %rbp
>
>     # From http://stackoverflow.com/questions/2535989/what-are-
> the-calling-conventions-for-unix-linux-system-calls-on-x86-64:
>     # "User-level applications use as integer registers for passing the
> sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses
> %rdi, %rsi, %rdx, %r10, %r8 and %r9"
>
>     pushq_cfi %rbx
>     pushq_cfi %rdx
>     pushq_cfi %rsi
>     pushq_cfi %rdi
>     pushq_cfi %r8
>     pushq_cfi %r9
>     pushq_cfi %r10
>     pushq_cfi %r11 # contains rflags before syscall instruction
>     .cfi_rel_offset %rflags, 0
>     pushq_cfi %r12
>     pushq_cfi %r13
>     pushq_cfi %r14
>     pushq_cfi %r15
>
>     # The kernel interface use r10 as fourth argument while the user
> interface use rcx
>     # so overwrite rcx with r10
>     movq %r10, %rcx
>
>     # prepare function call parameter: r9 is on the stack since it's the
> seventh param
>     # because we shift existing params by one to make room for syscall
> number
>

I know where the bug is.

The code you copied here is correct:
Because syscall_wrapper() has a first argument which is the syscall
argument, we need to shift the six registers holding the arguments - rdi,
rsi, rdx, rcx, r8 and r9
by an extra one (and put the syscall number as the new rdi) and this is
what we do below.
The 6th argument to the system call, which was in r9, now becomes the 7th
argument of the function, and the x86 ABI says that this and further
arguments need to
be pushed onto the stack.
So far, so good.

BUT, what does syscall_wrapper() which we call with all these parameters do?

extern "C" long syscall_wrapper(long number, ...)
{
    int errno_backup = errno;
    // syscall and function return value are in rax
    *auto ret = syscall(number);*
    int result = -errno;
    errno = errno_backup;
    if (ret < 0 && ret >= -4096) {
        return result;
    }
    return ret;
}

The emboldened line is unfortunately wrong. It only passes one parameter to
syscall(), but assumes that all the parameter register are unchanged so
syscall() will actually be able to find 6 parameters. But it will *NOT*
correctly find the 7th (i.e., the 6th parameter of the system call),
because after making another function call, it will be in the *wrong* place
in the stack.

One possible but difficult solution is to not have the syscall_wrapper()
function at all. It's not needed. Its simple work of replacing "errno" by a
direct return (like the SYSCALL instruction does) can be done in assembly
language right here in syscall_entry, and we then "callq syscall" directly.
You need to save the old value of errno somewhere (like syscall_wrapper
did), you can use a callee-saved register (which is not overwritten by a
function call like %rbx) or save it on the stack, but then make sure to
push 16 bytes to keep the 16-byte alignment. What worries me about this
solution, however, is how easy it would be to access the thread-local errno
from assembly. I never tried and I think it won't be easy.

Another alternative, simpler (no assembly language) alternative is to not
touch the assembly, just change the C function syscall_wrapper() as follows:

// In x86-64, a SYSCALL instruction has exactly 6 parameters, because this
is the number of registers
// alloted for passing them (additional parameters *cannot* be passed on
the stack). So we can get
// 7 arguments to this function (syscall number plus its 6 parameters).
Because in the x86-64 ABI the
// seventh argument is on the stack, we must pass the arguments explicitly
to the syscall() function
// and can't just call it without any arguments and hope everything will be
passed on
extern "C" long syscall_wrapper(long number, long p1, long p2, long p3,
long p4, long p5, long p6)
{
    int errno_backup = errno;
    // syscall and function return value are in rax
    auto ret = syscall(number, p1, p2, p3, p4, p5, p6);
    int result = -errno;
    errno = errno_backup;
    if (ret < 0 && ret >= -4096) {
        return result;
    }
    return ret;
}

Can you please try the second option? If it works, please submit a patch.
Thanks!




>     pushq_cfi %r9
>     movq %r8, %r9
>     movq %rcx, %r8
>     movq %rdx, %rcx
>     movq %rsi, %rdx
>     movq %rdi, %rsi
>    # syscall number from rax as first argument
>     movq %rax, %rdi
>
>     # Because we pushed an even number of 8 bytes after aligning the stack,
>     # it is still 16-byte aligned and we don't need to adjust it here.
>
>     # FPU save/restore is done inside the wrapper
>     callq syscall_wrapper
>
>     popq_cfi %r9
>     # in Linux user and kernel return value are in rax so we have nothing
> to do for return values
>
>     popq_cfi %r15
>     popq_cfi %r14
>     popq_cfi %r13
>     popq_cfi %r12
>     popq_cfi %r11
>     popq_cfi %r10
>     popq_cfi %r9
>     popq_cfi %r8
>     popq_cfi %rdi
>     popq_cfi %rsi
>     popq_cfi %rdx
>     popq_cfi %rbx
>
>     # skip the caller's %rsp we pushed just for CFI's sake
>     addq $8, %rsp
>     .cfi_adjust_cfa_offset -8
>
>     popq_cfi %rbp
>     popq_cfi %rcx
>    # restore rflags
>     # push the rflag state syscall saved in r11 to the stack
>     pushq %r11
>     # pop the stack value in flag register
>     popfq
>
> #if 0
>     movq 8(%rsp), %rsp # undo alignment (as explained above)
>
>     #If we use the same stack between applicatoin and system call
>     #undo red-zone skip without altering restored flags
>     lea 128(%rsp), %rsp
> #else
>     xchg %rsp, %fs:16
> #endif
>
>     # jump to rcx where the syscall instruction put rip
>     # (sysret would leave rxc cloberred so we have nothing to do to
> restore it)
>     jmpq *%rcx
>    .cfi_endproc
> .size syscall_entry, .-syscall_entry
>
>
> Hawxchen mentioned some issue with wrong CFI interpreted by GDB. Could
> this be related?
>

I can take a look at the CFI stuff later, but it is unrelated (the CFI is
needed for proper debugging, we if we do "backtrace" in a crash inside a
system call
to see what's running the system call, the syscall() function, and also the
code above it who called the SYSCALL instruction - and not junk.


>
> When testing boltdb I have also come across an issue with epoll and Golang
> 1.9.2 but this time it is not random. I will be sending seperate email.
>

Thanks. Good work!

Nadav.


> Waldek
>
> On Sunday, January 7, 2018 at 6:32:30 AM UTC-5, Nadav Har'El wrote:
>
>>
>> On Sat, Jan 6, 2018 at 7:22 AM, Waldek Kozaczuk <jwkoz...@gmail.com>
>> wrote:
>>
>>> I applied Hawxchen's patch and at first the httpserver app would crash
>>> with following type of error. It would be quite consistent and sometimes it
>>> would show golang stack trace but sometimes not like below.stack trace:
>>> ...
>>>
>>
>>> #16 pagecache::access_scanner::run (this=0xc2fca0
>>> <pagecache::s_access_scanner>)
>>>     at core/pagecache.cc:614
>>> #17 0x00000000003e6377 in sched::thread_main_c (t=0xffff8000011e8040)
>>>     at arch/x64/arch-switch.hh:189
>>> #18 0x000000000038ab23 in thread_main () at arch/x64/entry.S:113
>>>
>>>
>>> At first I could not figure out why it was crashing so I added all kinds
>>> of printouts to golang runtime to understand what was going on. Eventually
>>> I noticed that this specific line:
>>>
>>> --> nextFreeIndex *7021803615734293551* 4353039
>>>
>>> would contain random huge number that did seem weird and given that I
>>> made a wild guess that maybe some syscall executions need bigger stack than
>>> single page - 4096 bytes.
>>>
>>
>> Yes, 4096 bytes is probably not enough. In commit
>> efee5a91bb089315e61a9270d186834f1c088923 we increased the exception
>> stack from 4K to 16K because we noticed that some of the things we do
>> during exceptions, namely ZFS work, needs more than 4K on the stack.
>> Definitely we do even more during system calls (some of them also do ZFS
>> work). You should probably use at least 16K too.
>>
>> When I first read your report, something else bothered me. I thought it
>> *looks like* one of the unrelated built-in threads of OSv,
>> pagecache::access_scanner::run(), also ran on the small "system call"
>> stack, and that should not have happened - when we switch to another
>> thread, we also switch to its stack. However, on second thinking, I
>> realized that your crash doesn't really prove that access_scanner::run()
>> itself ran on the tiny stack. What is quite possible is that something else
>> happened: run()'s stack was adjacent to the system call stack, and when we
>> overflowed the system call stack, we wrong on the run() thread's stack, and
>> when we context-switched back to it, it had crap on its stack and crashed.
>>
>> So good catch.
>>
>>
>>
>>>
>>>     if(is_app()) {
>>>         auto& stack = _attr._syscall_stack;
>>>
>>>         stack.size = PAGE_SIZE;
>>>         stack.begin = malloc(stack.size);
>>>         stack.deleter = stack.default_deleter;
>>>         _tcb->syscall_stack_addr = stack.begin + stack.size;
>>>     }
>>>
>>>
>>> So I simply double the size to 2 pages and it did the trick. Now I was
>>> able fire 50, 100, 200 concurrent requests and the app would function just
>>> fine. I verified it work with 1.8.3 and 1.9.2 go version. This seems to
>>> indicate that some syscalls made by go require bigger stack that 4096. Is
>>> my thinking correct? Also is there a way to introduce some kind of canary
>>> type of logic on OSv side to detect this kind of "overflow" of the syscall
>>> stack?
>>>
>>
>> This is excellent.
>>
>> We could indeed introduce such a canary, just like you said - write at
>> the end of the system call stack some fixed value (e.g., 0xdeadbeef), and
>> when returning from the system-call stack switch, check if it still has
>> 0xdeadbeef at the end.
>> This, however, is not guaranteed to always help - it's possible that the
>> overflow during the system call handling will already overwrite something
>> important enough to
>>
>>
>>> I was very happy with the results (I actually performed 20-30 tests
>>> using ab) until just as I was writing this email I got 1-2 non-repeatable
>>> crashes with syscall stack 2 pages large on startup that looks like so:
>>>
>>> #0  0x0000000000391152 in processor::cli_hlt () at
>>> arch/x64/processor.hh:247
>>> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>>> #2  osv::halt () at arch/x64/power.cc:24
>>> #3  0x0000000000225897 in abort (fmt=fmt@entry=0x719f80 "general
>>> protection fault\n") at runtime.cc:132
>>> #4  0x000000000038bf1a in general_protection (ef=0xffff8000038f6088) at
>>> arch/x64/exceptions.cc:322
>>> #5  <signal handler called>
>>> #6  processor::ldmxcsr (addr=4359593) at arch/x64/processor.hh:423
>>>
>> #7  sched::thread::switch_to (this=0xffff8000038f1040) at
>>> arch/x64/arch-switch.hh:76
>>> #8  0x6f732f65726f6300 in ?? ()
>>> #9  0x006e6e6f6378616d in ?? ()
>>> #10 0xffff8000038c80b8 in ?? ()
>>> #11 0x00000000419e222d in ?? ()
>>> #12 0x00000000038f8b00 in ?? ()
>>> #13 0xffff8000014e6190 in ?? ()
>>> #14 0x00000000003d388c in std_malloc (size=18446620928467668998,
>>>     alignment=<error reading variable: Cannot access memory at address
>>> 0x74656e2f737972f7>)
>>>     at core/mempool.cc:1550
>>> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>>>
>>> This makes me think that maybe even 8096 is not big enough. Maybe I
>>> should experiment with bigger stacks.
>>>
>>
>> I think it is likely indeed be related to stack overflow.
>>
>> The call which crashed is the processor's LDMXCSR instruction, which got
>> invalid bits, which it shouldn't have: those bits were saved earlier, in
>> thread::switch_to(), on the thread's stack.
>> So if the thread's stack was overwritten, we can have this problem.
>> The fact that GDB said "corrupt stack" at the end also suggests the stack
>> may have been corrupted...
>>
>> Try 16K stack, although at this point, it's just than a guess (and I have
>> no proof that 16K would always be enough).
>>
>> Anyway, excellent work hunting these bugs! It's great that we can finally
>> say that "normal" Go programs work on OSv, not just silly "hello world"
>> ones.
>>
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to