Josh Kunz <j...@google.com> writes:
> The `clone` system call can be used to create new processes that share > attributes with their parents, such as virtual memory, file > system location, file descriptor tables, etc. These can be useful to a > variety of guest programs. > > Before this patch, QEMU had support for a limited set of these attributes. > Basically the ones needed for threads, and the options used by fork. > This change adds support for all flag combinations involving CLONE_VM. > In theory, almost all clone options could be supported, but invocations > not using CLONE_VM are likely to run afoul of linux-user's inherently > multi-threaded design. > > To add this support, this patch updates the `qemu_clone` helper. An > overview of the mechanism used to support general `clone` options with > CLONE_VM is described below. > > This patch also enables by-default the `clone` unit-tests in > tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate > exit signals, based on a bug found during development. Which by the way fail on some targets: TEST linux-test on alpha /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death make[2]: *** [../Makefile.target:153: run-linux-test] Error 1 make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2 make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2 Have you managed a clean check-tcg with docker enabled so all the guest architectures get tested? > > !! Overview > > Adding support for CLONE_VM is tricky. The parent and guest process will > share an address space (similar to threads), so the emulator must > coordinate between the parent and the child. Currently, QEMU relies > heavily on Thread Local Storage (TLS) as part of this coordination > strategy. For threads, this works fine, because libc manages the > thread-local data region used for TLS, when we create new threads using > `pthread_create`. Ideally we could use the same mechanism for > "process-local storage" needed to allow the parent/child processes to > emulate in tandem. Unfortunately TLS is tightly integrated into libc. > The only way to create TLS data regions is via the `pthread_create` API > which also spawns a new thread (rather than a new processes, which is > what we want). Worse still, TLS itself is a complicated arch-specific > feature that is tightly integrated into the rest of libc and the dynamic > linker. Re-implementing TLS support for QEMU would likely require a > special dynamic linker / libc. Alternatively, the popular libcs could be > extended, to allow for users to create TLS regions without creating > threads. Even if major libcs decide to add this support, QEMU will still > need a temporary work around until those libcs are widely deployed. It's > also unclear if libcs will be interested in supporting this case, since > TLS image creation is generally deeply integrated with thread setup. > > In this patch, I've employed an alternative approach: spawning a thread > an "stealing" its TLS image for use in the child process. This approach > leaves a dangling thread while the TLS image is in use, but by design > that thread will not become schedulable until after the TLS data is no > longer in-use by the child (as described in a moment). Therefore, it > should cause relatively minimal overhead. When considered in the larger > context, this seems like a reasonable tradeoff. *sharp intake of breath* OK so the solution to the complexity of handling threads is to add more threads? cool cool cool.... > > A major complication of this approach knowing when it is safe to clean up > the stack, and TLS image, used by a child process. When a child is > created with `CLONE_VM` its stack, and TLS data, need to remain valid > until that child has either exited, or successfully called `execve` (on > `execve` the child is given a new VMM by the kernel). One approach would > be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the > child). The problem is that the `wait` family of calls only waits for > termination. The pattern of `clone() ... execve()` for long running > child processes is pretty common. If we waited for child processes to > exit, it's likely we would end up using substantially more memory, and > keep the suspended TLS thread around much longer than necessary. > Instead, in this patch, I've used an "trampoline" process. The real > parent first clones a trampoline, the trampoline then clones the > ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends > the trampoline process until the child has exited, or called `execve`. > Once the trampoline is re-scheduled, we know it is safe to clean up > after the child. This creates one more suspended process, but typically, > the trampoline only exists for a short period of time. > > !! CLONE_VM setup, step by step > > 1. First, the suspended thread whose TLS we will use is created using > `pthread_create`. The thread fetches and returns it's "TLS pointer" > (an arch-specific value given to the kernel) to the parent. It then > blocks on a lock to prevent its TLS data from being cleaned up. > Ultimately the lock will be unlocked by the trampoline once the child > exits. > 2. Once the TLS thread has fetched the TLS pointer, it notifies the real > parent thread, which calls `clone()` to create the trampoline > process. For ease of implementation, the TLS image is set for the > trampoline process during this step. This allows the trampoline to > use functions that require TLS if needed (e.g., printf). TLS location > is inherited when a new child is spawned, so this TLS data will > automatically be inherited by the child. > 3. Once the trampoline has been spawned, it registers itself as a > "hidden" process with the signal subsystem. This prevents the exit > signal from the trampoline from ever being forwarded to the guest. > This is needed due to the way that Linux sets the exit signal for the > ultimate child when `CLONE_PARENT` is set. See the source for > details. > 4. Once setup is complete, the trampoline spawns the final child with > the original clone flags, plus `CLONE_PARENT`, so the child is > correctly parented to the kernel task on which the guest invoked > `clone`. Without this, kernel features like PDEATHSIG, and > subreapers, would not work properly. As previously discussed, the > trampoline also supplies `CLONE_VFORK` so that it is suspended until > the child can be cleaned up. > 5. Once the child is spawned, it signals the original parent thread that > it is running. At this point, the trampoline process is suspended > (due to CLONE_VFORK). > 6. Finally, the call to `qemu_clone` in the parent is finished, the > child begins executing the given callback function in the new child > process. > > !! Cleaning up > > Clean up itself is a multi-step process. Once the child exits, or is > killed by a signal (cleanup is the same in both cases), the trampoline > process becomes schedulable. When the trampoline is scheduled, it frees > the child stack, and unblocks the suspended TLS thread. This cleans up > the child resources, but not the stack used by the trampoline itself. It > is possible for a process to clean up its own stack, but it is tricky, > and architecture-specific. Instead we leverage the TLS manager thread to > clean up the trampoline stack. When the trampoline is cloned (in step 2 > above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID` > flags. The target location for the SET/CLEAR TID is set to a special field > known by the TLS manager. Then, when the TLS manager thread is unsuspended, > it performs an additional `FUTEX_WAIT` on this location. That blocks the > TLS manager thread until the trampoline has fully exited, then the TLS > manager thread frees the trampoline process's stack, before exiting > itself. > > !! Shortcomings of this patch > > * It's complicated. > * It doesn't support any clone options when CLONE_VM is omitted. > * It doesn't properly clean up the CPU queue when the child process > terminates, or calls execve(). > * RCU unregistration is done in the trampoline process (in clone.c), but > registration happens in syscall.c This should be made more explicit. > * The TLS image, and trampoline stack are not cleaned up if the parent > calls `execve` or `exit_group` before the child does. This is because > those cleanup tasks are handled by the TLS manager thread. The TLS > manager thread is in the same thread group as the parent, so it will > be terminated if the parent exits or calls `execve`. > > !! Alternatives considered > > * Non-standard libc extension to allow creating TLS images independent > of threads. This would allow us to just `clone` the child directly > instead of this complicated maneuver. Though we probably would still > need the cleanup logic. For libcs, TLS image allocation is tightly > connected to thread stack allocation, which is also arch-specific. I > do not have enough experience with libc development to know if > maintainers of any popular libcs would be open to supporting such an > API. Additionally, since it will probably take years before a libc > fix would be widely deployed, we need an interim solution anyways. We could consider a custom lib stub that intercepts calls to the guests original libc and replaces it with a QEMU aware one? > * Non-standard, Linux-only, libc extension to allow us to specify the > CLONE_* flags used by `pthread_create`. The processes we are creating > are basically threads in a different thread group. If we could alter > the flags used, this whole processes could become a `pthread_create.` > The problem with this approach is that I don't know what requirements > pthreads has on threads to ensure they function properly. I suspect > that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached > thread state. Since we don't control the child exit reason (Linux only > handles CHILD_CLEARTID on normal, non-signal process termination), we > probably can't use this same tracking mechanism. > * Other mechanisms for detecting child exit so cleanup can happen > besides CLONE_VFORK: > * waitid(WNOWAIT): This can only detect exit, not execve. > * file descriptors with close on exec set: This cannot detect children > cloned with CLONE_FILES. > * System V semaphore adjustments: Cannot detect children cloned with > CLONE_SYSVSEM. > * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally > terminated children. > * Doing the child clone directly in the TLS manager thread: This saves the > need for the trampoline process, but it causes the child process to be > parented to the wrong kernel task (the TLS thread instead of the Main > thread) breaking things like PDEATHSIG. Have you considered a daemon which could co-ordinate between the multiple processes that are sharing some state? > Signed-off-by: Josh Kunz <j...@google.com> > --- > linux-user/clone.c | 415 ++++++++++++++++++++++++++++++- > linux-user/qemu.h | 17 ++ > linux-user/signal.c | 49 ++++ > linux-user/syscall.c | 69 +++-- > tests/tcg/multiarch/linux-test.c | 67 ++++- > 5 files changed, 592 insertions(+), 25 deletions(-) > > diff --git a/linux-user/clone.c b/linux-user/clone.c > index f02ae8c464..3f7344cf9e 100644 > --- a/linux-user/clone.c > +++ b/linux-user/clone.c > @@ -12,6 +12,12 @@ > #include <stdbool.h> > #include <assert.h> > > +/* arch-specifc includes needed to fetch the TLS base offset. */ > +#if defined(__x86_64__) > +#include <asm/prctl.h> > +#include <sys/prctl.h> > +#endif > + > static const unsigned long NEW_STACK_SIZE = 0x40000UL; > > /* > @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c) > pthread_mutex_unlock(&c->mu); > } > > +struct tls_manager { > + void *tls_ptr; > + /* fetched is completed once tls_ptr has been set by the thread. */ > + struct completion fetched; > + /* > + * spawned is completed by the user once the managed_tid > + * has been spawned. > + */ > + struct completion spawned; > + /* > + * TID of the child whose memory is cleaned up upon death. This memory > + * location is used as part of a futex op, and is cleared by the kernel > + * since we specify CHILD_CLEARTID. > + */ > + int managed_tid; > + /* > + * The value to be `free`'d up once the janitor is ready to clean up the > + * TLS section, and the managed tid has exited. > + */ > + void *cleanup; > +}; > + > +/* > + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer > + * should be whatever platform-specific address is used to represent the TLS > + * base address. > + */ > +static void *tls_ptr() This and a number of other prototypes need void args to stop the compiler complaining about missing prototypes. > +{ > + void *ptr; > +#if defined(__x86_64__) > + /* > + * On x86_64, the TLS base is stored in the `fs` segment register, we can > + * fetch it with `ARCH_GET_FS`: > + */ > + (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr); > +#else > + ptr = NULL; > +#endif > + return ptr; > +} > + > +/* > + * clone_vm_supported returns true if clone_vm() is supported on this > + * platform. > + */ > +static bool clone_vm_supported() > +{ > +#if defined(__x86_64__) > + return true; > +#else > + return false; > +#endif > +} <snip> -- Alex Bennée