Your observation was correct. It worked. At least I have not seen same 
errors in any of my tests after the change. Thanks !!!

Please see my other responses in-lined.

In my last email I also forgot to mention that when I was testing boltdb 
app I came across the issue in mmap(). Boltlb mmaps the file 
with MADV_RANDOM advice which is not supported by OSv implementation of 
mmap. I changed following code:

unsigned libc_madvise_to_advise(int advice)
{
    if (advice == MADV_DONTNEED) {
        return mmu::advise_dontneed;
    } else if (advice == MADV_NOHUGEPAGE) {
        return mmu::advise_nohugepage;
    }
    return 0;
}

to

unsigned libc_madvise_to_advise(int advice)
{
    if (advice == MADV_DONTNEED) {
        return mmu::advise_dontneed;
    } else if (advice == MADV_NOHUGEPAGE || advice == MADV_RANDOM) {
        return mmu::advise_nohugepage;
    }
    return 0;
}

It made my example work but I am not sure how much of a hack it is and how 
difficult it would be to support MADV_RANDOM properly.

Waldek

On Tuesday, January 9, 2018 at 5:15:10 AM UTC-5, Nadav Har'El wrote:
>
>
> On Tue, Jan 9, 2018 at 7:39 AM, Waldek Kozaczuk <jwkoz...@gmail.com 
> <javascript:>> wrote:
>
>>
>> I bumped the syscall stack size to 16K and indeed I stopped seeing the 
>> crashes that look like the two I gave the examples of in my previous email. 
>> In general the httpserver app behaves pretty well I and I was able to run 
>> many tests hitting with 100-200 concurrency levels.
>>
>
> Excellent. It's a bit worrying that we have no idea if this 16K would 
> really be enough for every case, I guess it would be nice to add a canary 
> like you suggested to at least be able to discover the error before 
> switching to unrelated thread.
>
> I have no idea how to implement such a mechanism. Can you point me to an 
example? 

I was also wondering that existing implementation of syscall stack makes 
every app thread allocate it upon creation eagerly whether syscall are 
going to be made or not. For example java apps (which I believe use libc 
functions) would be penalized by memory usage. I was wondering how 
difficult it would be to make allocation of syscall stack lazily only when 
first syscall is made. More specifically instead of allocating the stack in 
thread::setup_tcb() it would be done by C ++ function called from assembly 
in entry.S. Obviously it is an optimization which we can implement later.

>
>> I also embarked on getting to work even more non-trivial app that uses 
>> the embedded keystore boltdb implemented in Golang - 
>> https://github.com/boltdb/bolt. I wrote a simple app that uses bolt to 
>> create a DB, add a bucket, add key/value, read it and close DB. It seems to 
>> be fundamentally working apart from couple of issues described below. 
>>
>> Here are two issues that happen randomly and I think there is something 
>> in common between. So first of Golang uses pselect6 to implement portable 
>> usleep(). 
>>
>> TEXT runtime·usleep(SB),NOSPLIT,$16
>>         MOVL    $0, DX
>>         MOVL    usec+0(FP), AX
>>         MOVL    $1000000, CX
>>         DIVL    CX
>>         MOVQ    AX, 0(SP)
>>         MOVL    $1000, AX       // usec to nsec
>>         MULL    DX
>>         MOVQ    AX, 8(SP)
>>
>>         // pselect6(0, 0, 0, 0, &ts, 0)
>>         MOVL    $0, DI
>>         MOVL    $0, SI
>>         MOVL    $0, DX
>>         MOVL    $0, R10
>>         MOVQ    SP, R8
>>         MOVL    $0, R9
>>         MOVL    $270, AX
>>         SYSCALL
>>         RET
>>
>>
>> I had to add it to OSv in linux.cc and it seems to be working except 
>> sometimes the last 6th argument (sig) receives random incorrect value 
>> different from 0. The assembly code above shows that Golang always passes 0 
>> and I could not find any other place in Golang where pselect is used. It is 
>> also weird that the value of this 6th argument is random across OSv runs 
>> but the same for all pselect (usleep) calls like in this debug example:
>>
>> pselect6(): unimplemented with not-null sigmask
>> pselect6(): sig [a8291a2200308313]
>> pselect6(): unimplemented with not-null sigmask
>> pselect6(): sig [a8291a2200308313]
>> pselect6(): unimplemented with not-null sigmask
>> pselect6(): sig [a8291a2200308313]
>> pselect6(): unimplemented with not-null sigmask
>> pselect6(): sig [a8291a2200308313]
>> pselect6(): unimplemented with not-null sigmask
>>
>> The second issue is similar in the sense that it involves another syscall 
>> with 6 arguments and is about last argument as well as pselect above (but 
>> it all might be a coincidence). It only happens with boltdb example where 
>> mmap call always passes offset argument as 0 but OSv gets a different value 
>> sometimes and causes EINVAL. The mmap() syscall is implemented like so in 
>> OSv (at least in my branch):
>>
>> // Only void* return value of mmap is type casted, as syscall returns 
>> long.
>> long long_mmap(void *addr, size_t length, int prot, int flags, int fd, 
>> off_t offset) {
>>     return (long) mmap(addr, length, prot, flags, fd, offset);
>> }
>> //long long_mmap(void *addr, size_t length, unsigned long prot, unsigned 
>> long flags, unsigned long fd, off_t offset) {
>> //    return (long) mmap(addr, length, (int)prot, (int)flags, (int)fd, 
>> offset);
>> //}
>> #define __NR_long_mmap __NR_mmap
>> ...
>> ...
>>    SYSCALL6(long_mmap, void *, size_t, int, int, int, off_t);
>>     //SYSCALL6(long_mmap, void *, size_t, unsigned long, unsigned long, 
>> unsigned long, off_t);
>>
>> I noticed that three 4th, 5th and 6th arguments in original OSv 
>> implementation use int where where Linux uses long. I changed it (see my 
>> commented out lines) but it did not help. 
>>
>
> In OSv (libc/mman.cc) I see:
> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t 
> offset)
>
> In Linux (/usr/include/sys/mman.h) I see
> extern void *mmap (void *__addr, size_t __len, int __prot, int __flags, 
> int __fd, __off_t __offset) __THROW;
>
> How is this different?
>

I think I might have looked at the wrong code. Here is what I compared with:

SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
                unsigned long, prot, unsigned long, flags,
                unsigned long, fd, unsigned long, off)
{
        long error;
        error = -EINVAL;
        if (off & ~PAGE_MASK)
                goto out;

        error = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
out:
        return error;
}


 from 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/sys_x86_64.c

>  
>
>>
>> So I am wondering if there is some problem with syscall implementation on 
>> our side that causes the last argument to be corrupt in both cases. Maybe 
>> things are misaligned?
>>
>> Here is a fragment of syscall implementation with Hawxchen changes (#if 
>> 0, etc):
>>
>> .align 16
>> .global syscall_entry
>> syscall_entry:
>>     .type syscall_entry, @function
>>     .cfi_startproc simple
>>     .cfi_undefined rcx # was overwritten with rip by the syscall 
>> instruction
>>     .cfi_undefined r11 # was overwritten with rflags by the syscall 
>> instruction
>>     .cfi_register rip, rcx # rcx took previous rip value
>>     .cfi_register rflags, r11 # r11 took previous rflags value
>>     # There is no ring transition and rflags are left unchanged.
>>
>> #if 0
>>     # Skip the "red zone" allowed by the AMD64 ABI (the caller used a
>>     # SYSCALL instruction and doesn't know he called a function):
>>     subq $128, %rsp
>>
>>     # Align the stack to 16 bytes. We align it now because of limitations 
>> of
>>     # the CFI language, but need to ensure it is still aligned before we 
>> call
>>     # syscall_wrapper(), so must ensure that the number of pushes below 
>> are
>>     # even.
>>     # An additional complication is that we need to restore %rsp later 
>> without
>>     # knowing how it was previously aligned. In the following trick, 
>> without
>>     # using an additional register, the two pushes leave the stack with 
>> the
>>     # same alignment it had originally, and a copy of the original %rsp at
>>     # (%rsp) and 8(%rsp). The andq then aligns the stack - if it was 
>> already
>>     # 16 byte aligned nothing changes, if it was 8 byte aligned then it
>>     # subtracts 8 from %rsp, meaning that the original %rsp is now at 
>> 8(%rsp)
>>     # and 16(%rsp). In both cases we can restore it below from 8(%rsp).
>>     pushq %rsp
>>     pushq (%rsp)
>>     andq $-16, %rsp
>> #else
>>     xchg %rsp, %fs:16
>> #endif
>>
>>     .cfi_def_cfa %rsp, 0
>>
>>     # We need to save and restore the caller's %rbp anyway, so let's also
>>     # set it up properly for old-style frame-pointer backtracing to work
>>     # (e.g., backtrace_safe()). Also need to push the return address 
>> before
>>     # the rbp to get a normal frame. Our return address is in rcx.
>>     pushq_cfi %rcx
>>     .cfi_rel_offset %rip, 0
>>     pushq_cfi %rbp
>>
>>     # Push on the stack the caller's %rsp, before any of our 
>> modifications.
>>     # We do this just so we can refer to it with CFI and help gdb's DWARF
>>     # stack unwinding. This saving not otherwise needed for correct 
>> operation
>>     # (we anyway restore it below by undoing all our modifications).
>> #if 0
>>     movq 24(%rsp), %rbp
>>     addq $128, %rbp
>>     pushq %rbp
>> #else
>>     pushq %fs:16
>> #endif
>>     .cfi_adjust_cfa_offset 8
>>     .cfi_rel_offset %rsp, 0
>>
>>     # Set rbp (frame pointer, for old-style backtracing) to the %rsp 
>> before
>>     # the above extra push
>>     leaq 8(%rsp), %rbp
>>
>>     # From 
>> http://stackoverflow.com/questions/2535989/what-are-the-calling-conventions-for-unix-linux-system-calls-on-x86-64
>> :
>>     # "User-level applications use as integer registers for passing the 
>> sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses 
>> %rdi, %rsi, %rdx, %r10, %r8 and %r9"
>>
>>     pushq_cfi %rbx
>>     pushq_cfi %rdx
>>     pushq_cfi %rsi
>>     pushq_cfi %rdi
>>     pushq_cfi %r8
>>     pushq_cfi %r9
>>     pushq_cfi %r10
>>     pushq_cfi %r11 # contains rflags before syscall instruction
>>     .cfi_rel_offset %rflags, 0
>>     pushq_cfi %r12
>>     pushq_cfi %r13
>>     pushq_cfi %r14
>>     pushq_cfi %r15
>>
>>     # The kernel interface use r10 as fourth argument while the user 
>> interface use rcx
>>     # so overwrite rcx with r10
>>     movq %r10, %rcx
>>
>>     # prepare function call parameter: r9 is on the stack since it's the 
>> seventh param
>>     # because we shift existing params by one to make room for syscall 
>> number
>>
>
> I know where the bug is.
>
> The code you copied here is correct:
> Because syscall_wrapper() has a first argument which is the syscall 
> argument, we need to shift the six registers holding the arguments - rdi, 
> rsi, rdx, rcx, r8 and r9
> by an extra one (and put the syscall number as the new rdi) and this is 
> what we do below.
> The 6th argument to the system call, which was in r9, now becomes the 7th 
> argument of the function, and the x86 ABI says that this and further 
> arguments need to
> be pushed onto the stack.
> So far, so good.
>
> BUT, what does syscall_wrapper() which we call with all these parameters 
> do?
>
> extern "C" long syscall_wrapper(long number, ...)
> {
>     int errno_backup = errno;
>     // syscall and function return value are in rax
>     *auto ret = syscall(number);*
>     int result = -errno;
>     errno = errno_backup;
>     if (ret < 0 && ret >= -4096) {
>         return result;
>     }
>     return ret;
> }
>
> The emboldened line is unfortunately wrong. It only passes one parameter 
> to syscall(), but assumes that all the parameter register are unchanged so 
> syscall() will actually be able to find 6 parameters. But it will *NOT* 
> correctly find the 7th (i.e., the 6th parameter of the system call), 
> because after making another function call, it will be in the *wrong* place 
> in the stack.
>
> One possible but difficult solution is to not have the syscall_wrapper() 
> function at all. It's not needed. Its simple work of replacing "errno" by a 
> direct return (like the SYSCALL instruction does) can be done in assembly 
> language right here in syscall_entry, and we then "callq syscall" directly. 
> You need to save the old value of errno somewhere (like syscall_wrapper 
> did), you can use a callee-saved register (which is not overwritten by a 
> function call like %rbx) or save it on the stack, but then make sure to 
> push 16 bytes to keep the 16-byte alignment. What worries me about this 
> solution, however, is how easy it would be to access the thread-local errno 
> from assembly. I never tried and I think it won't be easy.
>
> Another alternative, simpler (no assembly language) alternative is to not 
> touch the assembly, just change the C function syscall_wrapper() as follows:
>
> // In x86-64, a SYSCALL instruction has exactly 6 parameters, because this 
> is the number of registers
> // alloted for passing them (additional parameters *cannot* be passed on 
> the stack). So we can get
> // 7 arguments to this function (syscall number plus its 6 parameters). 
> Because in the x86-64 ABI the
> // seventh argument is on the stack, we must pass the arguments explicitly 
> to the syscall() function
> // and can't just call it without any arguments and hope everything will 
> be passed on
> extern "C" long syscall_wrapper(long number, long p1, long p2, long p3, 
> long p4, long p5, long p6)
> {
>     int errno_backup = errno;
>     // syscall and function return value are in rax
>     auto ret = syscall(number, p1, p2, p3, p4, p5, p6);
>     int result = -errno;
>     errno = errno_backup;
>     if (ret < 0 && ret >= -4096) {
>         return result;
>     }
>     return ret;
> }
>
> Can you please try the second option? If it works, please submit a patch. 
> Thanks!
>
> As I stated above - it worked which is awesome. I will be sending the 
patch soon.  

>
>  
>
>>     pushq_cfi %r9
>>     movq %r8, %r9
>>     movq %rcx, %r8
>>     movq %rdx, %rcx
>>     movq %rsi, %rdx
>>     movq %rdi, %rsi
>>    # syscall number from rax as first argument
>>     movq %rax, %rdi
>>
>>     # Because we pushed an even number of 8 bytes after aligning the 
>> stack,
>>     # it is still 16-byte aligned and we don't need to adjust it here.
>>
>>     # FPU save/restore is done inside the wrapper
>>     callq syscall_wrapper
>>
>>     popq_cfi %r9
>>     # in Linux user and kernel return value are in rax so we have nothing 
>> to do for return values
>>
>>     popq_cfi %r15
>>     popq_cfi %r14
>>     popq_cfi %r13
>>     popq_cfi %r12
>>     popq_cfi %r11
>>     popq_cfi %r10
>>     popq_cfi %r9
>>     popq_cfi %r8
>>     popq_cfi %rdi
>>     popq_cfi %rsi
>>     popq_cfi %rdx
>>     popq_cfi %rbx
>>
>>     # skip the caller's %rsp we pushed just for CFI's sake
>>     addq $8, %rsp
>>     .cfi_adjust_cfa_offset -8
>>
>>     popq_cfi %rbp
>>     popq_cfi %rcx
>>    # restore rflags
>>     # push the rflag state syscall saved in r11 to the stack
>>     pushq %r11
>>     # pop the stack value in flag register
>>     popfq
>>
>> #if 0
>>     movq 8(%rsp), %rsp # undo alignment (as explained above)
>>
>>     #If we use the same stack between applicatoin and system call
>>     #undo red-zone skip without altering restored flags
>>     lea 128(%rsp), %rsp
>> #else
>>     xchg %rsp, %fs:16
>> #endif
>>
>>     # jump to rcx where the syscall instruction put rip
>>     # (sysret would leave rxc cloberred so we have nothing to do to 
>> restore it)
>>     jmpq *%rcx
>>    .cfi_endproc
>> .size syscall_entry, .-syscall_entry
>>
>>
>> Hawxchen mentioned some issue with wrong CFI interpreted by GDB. Could 
>> this be related?
>>
>
> I can take a look at the CFI stuff later, but it is unrelated (the CFI is 
> needed for proper debugging, we if we do "backtrace" in a crash inside a 
> system call
> to see what's running the system call, the syscall() function, and also 
> the code above it who called the SYSCALL instruction - and not junk.
>  
>
>>
>> When testing boltdb I have also come across an issue with epoll and 
>> Golang 1.9.2 but this time it is not random. I will be sending seperate 
>> email.
>>
>
> Thanks. Good work!
>
> Nadav.
>
>
>> Waldek
>>
>> On Sunday, January 7, 2018 at 6:32:30 AM UTC-5, Nadav Har'El wrote:
>>
>>>
>>> On Sat, Jan 6, 2018 at 7:22 AM, Waldek Kozaczuk <jwkoz...@gmail.com> 
>>> wrote:
>>>
>>>> I applied Hawxchen's patch and at first the httpserver app would crash 
>>>> with following type of error. It would be quite consistent and sometimes 
>>>> it 
>>>> would show golang stack trace but sometimes not like below.stack trace:
>>>> ...
>>>>
>>>
>>>> #16 pagecache::access_scanner::run (this=0xc2fca0 
>>>> <pagecache::s_access_scanner>)
>>>>     at core/pagecache.cc:614
>>>> #17 0x00000000003e6377 in sched::thread_main_c (t=0xffff8000011e8040)
>>>>     at arch/x64/arch-switch.hh:189
>>>> #18 0x000000000038ab23 in thread_main () at arch/x64/entry.S:113
>>>>
>>>>
>>>> At first I could not figure out why it was crashing so I added all 
>>>> kinds of printouts to golang runtime to understand what was going on. 
>>>> Eventually I noticed that this specific line:
>>>>
>>>> --> nextFreeIndex *7021803615734293551* 4353039 
>>>>
>>>> would contain random huge number that did seem weird and given that I 
>>>> made a wild guess that maybe some syscall executions need bigger stack 
>>>> than 
>>>> single page - 4096 bytes.
>>>>
>>>
>>> Yes, 4096 bytes is probably not enough. In commit 
>>> efee5a91bb089315e61a9270d186834f1c088923 we increased the exception stack 
>>> from 4K to 16K because we noticed that some of the things we do during 
>>> exceptions, namely ZFS work, needs more than 4K on the stack. Definitely we 
>>> do even more during system calls (some of them also do ZFS work). You 
>>> should probably use at least 16K too.
>>>
>>> When I first read your report, something else bothered me. I thought it 
>>> *looks like* one of the unrelated built-in threads of OSv, 
>>> pagecache::access_scanner::run(), also ran on the small "system call" 
>>> stack, and that should not have happened - when we switch to another 
>>> thread, we also switch to its stack. However, on second thinking, I 
>>> realized that your crash doesn't really prove that access_scanner::run() 
>>> itself ran on the tiny stack. What is quite possible is that something else 
>>> happened: run()'s stack was adjacent to the system call stack, and when we 
>>> overflowed the system call stack, we wrong on the run() thread's stack, and 
>>> when we context-switched back to it, it had crap on its stack and crashed.
>>>
>>> So good catch.
>>>
>>>  
>>>
>>>>
>>>>     if(is_app()) {
>>>>         auto& stack = _attr._syscall_stack;
>>>>
>>>>         stack.size = PAGE_SIZE;
>>>>         stack.begin = malloc(stack.size);
>>>>         stack.deleter = stack.default_deleter;
>>>>         _tcb->syscall_stack_addr = stack.begin + stack.size;
>>>>     }
>>>>
>>>>
>>>> So I simply double the size to 2 pages and it did the trick. Now I was 
>>>> able fire 50, 100, 200 concurrent requests and the app would function just 
>>>> fine. I verified it work with 1.8.3 and 1.9.2 go version. This seems to 
>>>> indicate that some syscalls made by go require bigger stack that 4096. Is 
>>>> my thinking correct? Also is there a way to introduce some kind of canary 
>>>> type of logic on OSv side to detect this kind of "overflow" of the syscall 
>>>> stack?
>>>>
>>>
>>> This is excellent.
>>>
>>> We could indeed introduce such a canary, just like you said - write at 
>>> the end of the system call stack some fixed value (e.g., 0xdeadbeef), and 
>>> when returning from the system-call stack switch, check if it still has 
>>> 0xdeadbeef at the end.
>>> This, however, is not guaranteed to always help - it's possible that the 
>>> overflow during the system call handling will already overwrite something 
>>> important enough to 
>>>
>>>
>>>> I was very happy with the results (I actually performed 20-30 tests 
>>>> using ab) until just as I was writing this email I got 1-2 non-repeatable 
>>>> crashes with syscall stack 2 pages large on startup that looks like so:
>>>>
>>>> #0  0x0000000000391152 in processor::cli_hlt () at 
>>>> arch/x64/processor.hh:247
>>>> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>>>> #2  osv::halt () at arch/x64/power.cc:24
>>>> #3  0x0000000000225897 in abort (fmt=fmt@entry=0x719f80 "general 
>>>> protection fault\n") at runtime.cc:132
>>>> #4  0x000000000038bf1a in general_protection (ef=0xffff8000038f6088) at 
>>>> arch/x64/exceptions.cc:322
>>>> #5  <signal handler called>
>>>> #6  processor::ldmxcsr (addr=4359593) at arch/x64/processor.hh:423 
>>>>
>>> #7  sched::thread::switch_to (this=0xffff8000038f1040) at 
>>>> arch/x64/arch-switch.hh:76
>>>> #8  0x6f732f65726f6300 in ?? ()
>>>> #9  0x006e6e6f6378616d in ?? ()
>>>> #10 0xffff8000038c80b8 in ?? ()
>>>> #11 0x00000000419e222d in ?? ()
>>>> #12 0x00000000038f8b00 in ?? ()
>>>> #13 0xffff8000014e6190 in ?? ()
>>>> #14 0x00000000003d388c in std_malloc (size=18446620928467668998, 
>>>>     alignment=<error reading variable: Cannot access memory at address 
>>>> 0x74656e2f737972f7>)
>>>>     at core/mempool.cc:1550
>>>> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>>>>
>>>> This makes me think that maybe even 8096 is not big enough. Maybe I 
>>>> should experiment with bigger stacks.
>>>>
>>>
>>> I think it is likely indeed be related to stack overflow.
>>>
>>> The call which crashed is the processor's LDMXCSR instruction, which got 
>>> invalid bits, which it shouldn't have: those bits were saved earlier, in  
>>> thread::switch_to(), on the thread's stack.
>>> So if the thread's stack was overwritten, we can have this problem.
>>> The fact that GDB said "corrupt stack" at the end also suggests the 
>>> stack may have been corrupted...
>>>
>>> Try 16K stack, although at this point, it's just than a guess (and I 
>>> have no proof that 16K would always be enough).
>>>
>>> Anyway, excellent work hunting these bugs! It's great that we can 
>>> finally say that "normal" Go programs work on OSv, not just silly "hello 
>>> world" ones.
>>>
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to osv-dev+u...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to