[Qemu-devel] linux-user: interrupting syscalls

Peter Maydell Sun, 04 Dec 2011 08:30:04 -0800

Disclaimer: I'm writing this email because I had a neat idea about how
to solve a problem which Alex Graf discovered, but I don't have the
time to actually implement it :-)


Consider the following guest code, to be run under linux-user mode:

---begin---
#include <stdio.h>
#include <errno.h>
#include <signal.h>
#include <unistd.h>

int pipefd[2];

void usr1_handler(int s)
{
    char x = 'x';
    write(pipefd[1], &x, 1);
}

int main(void)
{
    struct sigaction sa; char x; ssize_t r;
    if (pipe(pipefd) != 0) {
        perror("pipe"); return 1;
    }
    sa.sa_handler = usr1_handler;
    sa.sa_flags = SA_RESTART;
    sigemptyset(&sa.sa_mask);
    if (sigaction(SIGUSR1, &sa, 0) != 0) {
        perror("sigaction"); return 1;
    }
    printf("read()ing pipe...\n");
    r = read(pipefd[0], &x, 1);
    printf("read returned %d\n", r);
    return 0;
}
---endit---

When run natively, this program will block until you send it a
SIGUSR1; the signal handler will write to the pipe and cause the read
to complete. Run in linux-user mode, we deadlock, because qemu does
not run the guest signal handler when in the middle of emulating a
system call -- it merely queues it to be run when the syscall
finishes. For cases like this where the event that causes the syscall
to complete is actually triggered by the guest signal handler, this
doesn't work.  (There is a real-world instance of this problem in the
Boehm garbage collector, where a signal handler posts to a semaphore
which is being waited on by the mainline code.)

It's not sufficient to simply force all syscalls to be non-restartable
(and then to take the signal when the syscall returns EINTR), because
of the following race condition:
 * qemu enters do_syscall on behalf of main thread
 * do_syscall is about to call the underlying syscall, when...
 * the signal arrives (and we queue it)
 * do_syscall then calls the host syscall, which will block. Oops.

To fix this I think we need to have linux-user's signal handler
wrapper do a siglongjmp if a signal arrives while we're inside
do_syscall(). This allows us to properly interrupt whether we'd
got to the point of making the host syscall or not.

The tricky bit here is in the details; specifically it's painful to
write code can cope with being siglongjmp()ed out of at any
point. You need to be careful not to call anything that might not
like being aborted (no malloc, for instance). This might need some
support like an equivalent of critical section macros to prevent
the siglongjmp in some places, and/or cleanup routines to be called
in the event of the jump occurring to release resources.

Luckily we don't have to write the whole of syscall.c like
that: a lot of syscalls are non-blocking, so we can continue to deal
with them as we do now (queue signal, take it on exit). (Incidentally
any code in the implementation of a 'non-blocking' syscall which
doesn't retry if it gets an EINTR return value is broken.) Linux's
signal(7) manpage has a handy overview of which syscalls have to be
interruptible.

We also need to properly handle restarting syscalls when we've jumped
out of them to run the guest signal handler. For this I think we
should use a structure basically the same as the Linux kernel uses
itself: do_syscall() returns ERESTARTSYS, and the cpu-specific code
then rewinds the PC to before the syscall insn if the signal we're
about to deliver is one that was registered with SA_RESTART. A handful
of syscalls may need a 'restart handler' (where we both wind back PC
and change the syscall number to NR_restartsys so we can invoke a
syscall-specific 'resume this' function.)

I think to do this properly you'd also want to refactor syscall.c so
that instead of being an enormous switch statement it was
table-driven, so you just looked up the handler function for the
syscall as well as what classification it was (non-blocking vs.
having to handle being interrupted). We could roll in the strace table
too, which might avoid the problem of people adding new syscall
support and forgetting about strace.

So have I missed something that would mean this wouldn't work?

-- PMM

[Qemu-devel] linux-user: interrupting syscalls

Reply via email to