Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options

Alex Bennée Tue, 16 Jun 2020 09:09:17 -0700


Josh Kunz <j...@google.com> writes:


> The `clone` system call can be used to create new processes that share
> attributes with their parents, such as virtual memory, file
> system location, file descriptor tables, etc. These can be useful to a
> variety of guest programs.
>
> Before this patch, QEMU had support for a limited set of these attributes.
> Basically the ones needed for threads, and the options used by fork.
> This change adds support for all flag combinations involving CLONE_VM.
> In theory, almost all clone options could be supported, but invocations
> not using CLONE_VM are likely to run afoul of linux-user's inherently
> multi-threaded design.
>
> To add this support, this patch updates the `qemu_clone` helper. An
> overview of the mechanism used to support general `clone` options with
> CLONE_VM is described below.
>
> This patch also enables by-default the `clone` unit-tests in
> tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
> exit signals, based on a bug found during development.

Which by the way fail on some targets:

    TEST    linux-test on alpha
  /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not 
receive PDEATHSIG on parent death
  make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
  make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: 
run-guest-tests] Error 2
  make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: 
run-tcg-tests-alpha-linux-user] Error 2

Have you managed a clean check-tcg with docker enabled so all the guest
architectures get tested?

>
> !! Overview
>
> Adding support for CLONE_VM is tricky. The parent and guest process will
> share an address space (similar to threads), so the emulator must
> coordinate between the parent and the child. Currently, QEMU relies
> heavily on Thread Local Storage (TLS) as part of this coordination
> strategy. For threads, this works fine, because libc manages the
> thread-local data region used for TLS, when we create new threads using
> `pthread_create`. Ideally we could use the same mechanism for
> "process-local storage" needed to allow the parent/child processes to
> emulate in tandem. Unfortunately TLS is tightly integrated into libc.
> The only way to create TLS data regions is via the `pthread_create` API
> which also spawns a new thread (rather than a new processes, which is
> what we want). Worse still, TLS itself is a complicated arch-specific
> feature that is tightly integrated into the rest of libc and the dynamic
> linker. Re-implementing TLS support for QEMU would likely require a
> special dynamic linker / libc. Alternatively, the popular libcs could be
> extended, to allow for users to create TLS regions without creating
> threads. Even if major libcs decide to add this support, QEMU will still
> need a temporary work around until those libcs are widely deployed. It's
> also unclear if libcs will be interested in supporting this case, since
> TLS image creation is generally deeply integrated with thread setup.
>
> In this patch, I've employed an alternative approach: spawning a thread
> an "stealing" its TLS image for use in the child process. This approach
> leaves a dangling thread while the TLS image is in use, but by design
> that thread will not become schedulable until after the TLS data is no
> longer in-use by the child (as described in a moment). Therefore, it
> should cause relatively minimal overhead. When considered in the larger
> context, this seems like a reasonable tradeoff.

*sharp intake of breath*

OK so the solution to the complexity of handling threads is to add more
threads? cool cool cool....

>
> A major complication of this approach knowing when it is safe to clean up
> the stack, and TLS image, used by a child process. When a child is
> created with `CLONE_VM` its stack, and TLS data, need to remain valid
> until that child has either exited, or successfully called `execve` (on
> `execve` the child is given a new VMM by the kernel). One approach would
> be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
> child). The problem is that the `wait` family of calls only waits for
> termination. The pattern of `clone() ... execve()` for long running
> child processes is pretty common. If we waited for child processes to
> exit, it's likely we would end up using substantially more memory, and
> keep the suspended TLS thread around much longer than necessary.
> Instead, in this patch, I've used an "trampoline" process. The real
> parent first clones a trampoline, the trampoline then clones the
> ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
> the trampoline process until the child has exited, or called `execve`.
> Once the trampoline is re-scheduled, we know it is safe to clean up
> after the child. This creates one more suspended process, but typically,
> the trampoline only exists for a short period of time.
>
> !! CLONE_VM setup, step by step
>
> 1. First, the suspended thread whose TLS we will use is created using
>    `pthread_create`. The thread fetches and returns it's "TLS pointer"
>    (an arch-specific value given to the kernel) to the parent. It then
>    blocks on a lock to prevent its TLS data from being cleaned up.
>    Ultimately the lock will be unlocked by the trampoline once the child
>    exits.
> 2. Once the TLS thread has fetched the TLS pointer, it notifies the real
>    parent thread, which calls `clone()` to create the trampoline
>    process. For ease of implementation, the TLS image is set for the
>    trampoline process during this step. This allows the trampoline to
>    use functions that require TLS if needed (e.g., printf). TLS location
>    is inherited when a new child is spawned, so this TLS data will
>    automatically be inherited by the child.
> 3. Once the trampoline has been spawned, it registers itself as a
>    "hidden" process with the signal subsystem. This prevents the exit
>    signal from the trampoline from ever being forwarded to the guest.
>    This is needed due to the way that Linux sets the exit signal for the
>    ultimate child when `CLONE_PARENT` is set. See the source for
>    details.
> 4. Once setup is complete, the trampoline spawns the final child with
>    the original clone flags, plus `CLONE_PARENT`, so the child is
>    correctly parented to the kernel task on which the guest invoked
>    `clone`. Without this, kernel features like PDEATHSIG, and
>    subreapers, would not work properly. As previously discussed, the
>    trampoline also supplies `CLONE_VFORK` so that it is suspended until
>    the child can be cleaned up.
> 5. Once the child is spawned, it signals the original parent thread that
>    it is running. At this point, the trampoline process is suspended
>    (due to CLONE_VFORK).
> 6. Finally, the call to `qemu_clone` in the parent is finished, the
>    child begins executing the given callback function in the new child
>    process.
>
> !! Cleaning up
>
> Clean up itself is a multi-step process. Once the child exits, or is
> killed by a signal (cleanup is the same in both cases), the trampoline
> process becomes schedulable. When the trampoline is scheduled, it frees
> the child stack, and unblocks the suspended TLS thread. This cleans up
> the child resources, but not the stack used by the trampoline itself. It
> is possible for a process to clean up its own stack, but it is tricky,
> and architecture-specific. Instead we leverage the TLS manager thread to
> clean up the trampoline stack. When the trampoline is cloned (in step 2
> above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
> flags. The target location for the SET/CLEAR TID is set to a special field
> known by the TLS manager. Then, when the TLS manager thread is unsuspended,
> it performs an additional `FUTEX_WAIT` on this location. That blocks the
> TLS manager thread until the trampoline has fully exited, then the TLS
> manager thread frees the trampoline process's stack, before exiting
> itself.
>
> !! Shortcomings of this patch
>
> * It's complicated.
> * It doesn't support any clone options when CLONE_VM is omitted.
> * It doesn't properly clean up the CPU queue when the child process
>   terminates, or calls execve().
> * RCU unregistration is done in the trampoline process (in clone.c), but
>   registration happens in syscall.c This should be made more explicit.
> * The TLS image, and trampoline stack are not cleaned up if the parent
>   calls `execve` or `exit_group` before the child does. This is because
>   those cleanup tasks are handled by the TLS manager thread. The TLS
>   manager thread is in the same thread group as the parent, so it will
>   be terminated if the parent exits or calls `execve`.
>
> !! Alternatives considered
>
> * Non-standard libc extension to allow creating TLS images independent
>   of threads. This would allow us to just `clone` the child directly
>   instead of this complicated maneuver. Though we probably would still
>   need the cleanup logic. For libcs, TLS image allocation is tightly
>   connected to thread stack allocation, which is also arch-specific. I
>   do not have enough experience with libc development to know if
>   maintainers of any popular libcs would be open to supporting such an
>   API. Additionally, since it will probably take years before a libc
>   fix would be widely deployed, we need an interim solution anyways.

We could consider a custom lib stub that intercepts calls to the guests
original libc and replaces it with a QEMU aware one?

> * Non-standard, Linux-only, libc extension to allow us to specify the
>   CLONE_* flags used by `pthread_create`. The processes we are creating
>   are basically threads in a different thread group. If we could alter
>   the flags used, this whole processes could become a `pthread_create.`
>   The problem with this approach is that I don't know what requirements
>   pthreads has on threads to ensure they function properly. I suspect
>   that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
>   thread state. Since we don't control the child exit reason (Linux only
>   handles CHILD_CLEARTID on normal, non-signal process termination), we
>   probably can't use this same tracking mechanism.
> * Other mechanisms for detecting child exit so cleanup can happen
>   besides CLONE_VFORK:
>   * waitid(WNOWAIT): This can only detect exit, not execve.
>   * file descriptors with close on exec set: This cannot detect children
>     cloned with CLONE_FILES.
>   * System V semaphore adjustments: Cannot detect children cloned with
>     CLONE_SYSVSEM.
>   * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
>     terminated children.
> * Doing the child clone directly in the TLS manager thread: This saves the
>   need for the trampoline process, but it causes the child process to be
>   parented to the wrong kernel task (the TLS thread instead of the Main
>   thread) breaking things like PDEATHSIG.

Have you considered a daemon which could co-ordinate between the
multiple processes that are sharing some state?


> Signed-off-by: Josh Kunz <j...@google.com>
> ---
>  linux-user/clone.c               | 415 ++++++++++++++++++++++++++++++-
>  linux-user/qemu.h                |  17 ++
>  linux-user/signal.c              |  49 ++++
>  linux-user/syscall.c             |  69 +++--
>  tests/tcg/multiarch/linux-test.c |  67 ++++-
>  5 files changed, 592 insertions(+), 25 deletions(-)
>
> diff --git a/linux-user/clone.c b/linux-user/clone.c
> index f02ae8c464..3f7344cf9e 100644
> --- a/linux-user/clone.c
> +++ b/linux-user/clone.c
> @@ -12,6 +12,12 @@
>  #include <stdbool.h>
>  #include <assert.h>
>  
> +/* arch-specifc includes needed to fetch the TLS base offset. */
> +#if defined(__x86_64__)
> +#include <asm/prctl.h>
> +#include <sys/prctl.h>
> +#endif
> +
>  static const unsigned long NEW_STACK_SIZE = 0x40000UL;
>  
>  /*
> @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c)
>      pthread_mutex_unlock(&c->mu);
>  }
>  
> +struct tls_manager {
> +    void *tls_ptr;
> +    /* fetched is completed once tls_ptr has been set by the thread. */
> +    struct completion fetched;
> +    /*
> +     * spawned is completed by the user once the managed_tid
> +     * has been spawned.
> +     */
> +    struct completion spawned;
> +    /*
> +     * TID of the child whose memory is cleaned up upon death. This memory
> +     * location is used as part of a futex op, and is cleared by the kernel
> +     * since we specify CHILD_CLEARTID.
> +     */
> +    int managed_tid;
> +    /*
> +     * The value to be `free`'d up once the janitor is ready to clean up the
> +     * TLS section, and the managed tid has exited.
> +     */
> +    void *cleanup;
> +};
> +
> +/*
> + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer
> + * should be whatever platform-specific address is used to represent the TLS
> + * base address.
> + */
> +static void *tls_ptr()

This and a number of other prototypes need void args to stop the
compiler complaining about missing prototypes.

> +{
> +    void *ptr;
> +#if defined(__x86_64__)
> +    /*
> +     * On x86_64, the TLS base is stored in the `fs` segment register, we can
> +     * fetch it with `ARCH_GET_FS`:
> +     */
> +    (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr);
> +#else
> +    ptr = NULL;
> +#endif
> +    return ptr;
> +}
> +
> +/*
> + * clone_vm_supported returns true if clone_vm() is supported on this
> + * platform.
> + */
> +static bool clone_vm_supported()
> +{
> +#if defined(__x86_64__)
> +    return true;
> +#else
> +    return false;
> +#endif
> +}
<snip>

-- 
Alex Bennée

Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options

Reply via email to