Disclaimer: I'm writing this email because I had a neat idea about how to solve a problem which Alex Graf discovered, but I don't have the time to actually implement it :-)
Consider the following guest code, to be run under linux-user mode: ---begin--- #include <stdio.h> #include <errno.h> #include <signal.h> #include <unistd.h> int pipefd[2]; void usr1_handler(int s) { char x = 'x'; write(pipefd[1], &x, 1); } int main(void) { struct sigaction sa; char x; ssize_t r; if (pipe(pipefd) != 0) { perror("pipe"); return 1; } sa.sa_handler = usr1_handler; sa.sa_flags = SA_RESTART; sigemptyset(&sa.sa_mask); if (sigaction(SIGUSR1, &sa, 0) != 0) { perror("sigaction"); return 1; } printf("read()ing pipe...\n"); r = read(pipefd[0], &x, 1); printf("read returned %d\n", r); return 0; } ---endit--- When run natively, this program will block until you send it a SIGUSR1; the signal handler will write to the pipe and cause the read to complete. Run in linux-user mode, we deadlock, because qemu does not run the guest signal handler when in the middle of emulating a system call -- it merely queues it to be run when the syscall finishes. For cases like this where the event that causes the syscall to complete is actually triggered by the guest signal handler, this doesn't work. (There is a real-world instance of this problem in the Boehm garbage collector, where a signal handler posts to a semaphore which is being waited on by the mainline code.) It's not sufficient to simply force all syscalls to be non-restartable (and then to take the signal when the syscall returns EINTR), because of the following race condition: * qemu enters do_syscall on behalf of main thread * do_syscall is about to call the underlying syscall, when... * the signal arrives (and we queue it) * do_syscall then calls the host syscall, which will block. Oops. To fix this I think we need to have linux-user's signal handler wrapper do a siglongjmp if a signal arrives while we're inside do_syscall(). This allows us to properly interrupt whether we'd got to the point of making the host syscall or not. The tricky bit here is in the details; specifically it's painful to write code can cope with being siglongjmp()ed out of at any point. You need to be careful not to call anything that might not like being aborted (no malloc, for instance). This might need some support like an equivalent of critical section macros to prevent the siglongjmp in some places, and/or cleanup routines to be called in the event of the jump occurring to release resources. Luckily we don't have to write the whole of syscall.c like that: a lot of syscalls are non-blocking, so we can continue to deal with them as we do now (queue signal, take it on exit). (Incidentally any code in the implementation of a 'non-blocking' syscall which doesn't retry if it gets an EINTR return value is broken.) Linux's signal(7) manpage has a handy overview of which syscalls have to be interruptible. We also need to properly handle restarting syscalls when we've jumped out of them to run the guest signal handler. For this I think we should use a structure basically the same as the Linux kernel uses itself: do_syscall() returns ERESTARTSYS, and the cpu-specific code then rewinds the PC to before the syscall insn if the signal we're about to deliver is one that was registered with SA_RESTART. A handful of syscalls may need a 'restart handler' (where we both wind back PC and change the syscall number to NR_restartsys so we can invoke a syscall-specific 'resume this' function.) I think to do this properly you'd also want to refactor syscall.c so that instead of being an enormous switch statement it was table-driven, so you just looked up the handler function for the syscall as well as what classification it was (non-blocking vs. having to handle being interrupted). We could roll in the strace table too, which might avoid the problem of people adding new syscall support and forgetting about strace. So have I missed something that would mean this wouldn't work? -- PMM