On Thu, Aug 11, 2022 at 03:56:23PM +0200, Claudio Imbrenda wrote: > On Thu, 11 Aug 2022 13:27:44 +0100 > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > On Tue, Aug 09, 2022 at 08:40:24AM +0200, Claudio Imbrenda wrote: > > > This patch adds support for asynchronously tearing down a VM on Linux. > > > > > > When qemu terminates, either naturally or because of a fatal signal, > > > the VM is torn down. If the VM is huge, it can take a considerable > > > amount of time for it to be cleaned up. In case of a protected VM, it > > > might take even longer than a non-protected VM (this is the case on > > > s390x, for example). > > > > > > Some users might want to shut down a VM and restart it immediately, > > > without having to wait. This is especially true if management > > > infrastructure like libvirt is used. > > > > > > This patch implements a simple trick on Linux to allow qemu to return > > > immediately, with the teardown of the VM being performed > > > asynchronously. > > > > > > If the new commandline option -async-teardown is used, a new process is > > > spawned from qemu at startup, using the clone syscall, in such way that > > > it will share its address space with qemu. > > > > > > The new process will have the name "cleanup/<QEMU_PID>". It will wait > > > until qemu terminates, and then it will exit itself. > > > > > > This allows qemu to terminate quickly, without having to wait for the > > > whole address space to be torn down. The teardown process will exit > > > after qemu, so it will be the last user of the address space, and > > > therefore it will take care of the actual teardown. > > > > > > The teardown process will share the same cgroups as qemu, so both > > > memory usage and cpu time will be accounted properly. > > > > > > This feature can already be used with libvirt by adding the following > > > to the XML domain definition to pass the parameter to qemu directly: > > > > > > <commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0"> > > > <arg value='-async-teardown'/> > > > </commandline> > > > > > > More advanced interfaces like pidfd or close_range have intentionally > > > been avoided in order to be more compatible with older kernels. > > > > > > Signed-off-by: Claudio Imbrenda <imbre...@linux.ibm.com> > > > --- > > > include/qemu/async-teardown.h | 22 ++++++ > > > os-posix.c | 6 ++ > > > qemu-options.hx | 17 +++++ > > > util/async-teardown.c | 123 ++++++++++++++++++++++++++++++++++ > > > util/meson.build | 1 + > > > 5 files changed, 169 insertions(+) > > > create mode 100644 include/qemu/async-teardown.h > > > create mode 100644 util/async-teardown.c > > > > > > > > diff --git a/include/qemu/async-teardown.h b/include/qemu/async-teardown.h > > > new file mode 100644 > > > index 0000000000..092e7a37e7 > > > --- /dev/null > > > +++ b/include/qemu/async-teardown.h > > > @@ -0,0 +1,22 @@ > > > +/* > > > + * Asynchronous teardown > > > + * > > > + * Copyright IBM, Corp. 2022 > > > + * > > > + * Authors: > > > + * Claudio Imbrenda <imbre...@linux.ibm.com> > > > + * > > > + * This work is licensed under the terms of the GNU GPL, version 2 or > > > (at your > > > + * option) any later version. See the COPYING file in the top-level > > > directory. > > > + * > > > + */ > > > +#ifndef QEMU_ASYNC_TEARDOWN_H > > > +#define QEMU_ASYNC_TEARDOWN_H > > > + > > > +#include "config-host.h" > > > + > > > +#ifdef CONFIG_LINUX > > > +void init_async_teardown(void); > > > +#endif > > > + > > > +#endif > > > diff --git a/os-posix.c b/os-posix.c > > > index 321fc4bd13..4858650c3e 100644 > > > --- a/os-posix.c > > > +++ b/os-posix.c > > > @@ -39,6 +39,7 @@ > > > > > > #ifdef CONFIG_LINUX > > > #include <sys/prctl.h> > > > +#include "qemu/async-teardown.h" > > > #endif > > > > > > /* > > > @@ -150,6 +151,11 @@ int os_parse_cmd_args(int index, const char *optarg) > > > case QEMU_OPTION_daemonize: > > > daemonize = 1; > > > break; > > > +#if defined(CONFIG_LINUX) > > > + case QEMU_OPTION_asyncteardown: > > > + init_async_teardown(); > > > + break; > > > +#endif > > > default: > > > return -1; > > > } > > > diff --git a/qemu-options.hx b/qemu-options.hx > > > index 3f23a42fa8..d434353159 100644 > > > --- a/qemu-options.hx > > > +++ b/qemu-options.hx > > > @@ -4743,6 +4743,23 @@ HXCOMM Internal use > > > DEF("qtest", HAS_ARG, QEMU_OPTION_qtest, "", QEMU_ARCH_ALL) > > > DEF("qtest-log", HAS_ARG, QEMU_OPTION_qtest_log, "", QEMU_ARCH_ALL) > > > > > > +#ifdef __linux__ > > > +DEF("async-teardown", 0, QEMU_OPTION_asyncteardown, > > > + "-async-teardown enable asynchronous teardown\n", > > > + QEMU_ARCH_ALL) > > > +#endif > > > +SRST > > > +``-async-teardown`` > > > + Enable asynchronous teardown. A new teardown process will be > > > + created at startup, using clone. The teardown process will share > > > + the address space of the main qemu process, and wait for the main > > > + process to terminate. At that point, the teardown process will > > > + also exit. This allows qemu to terminate quickly if the guest was > > > + huge, leaving the teardown of the address space to the teardown > > > + process. Since the teardown process shares the same cgroups as the > > > + main qemu process, accounting is performed correctly. > > > +ERST > > > + > > > DEF("msg", HAS_ARG, QEMU_OPTION_msg, > > > "-msg [timestamp[=on|off]][,guest-name=[on|off]]\n" > > > " control error message format\n" > > > > It occurrs to me that we've got a general goal of getting away from > > adding new top level command line arguments. Most of the time there's > > an obvious existing place to put them, but I'm really not sure > > where this particular option would fit ? > > > > it isn't tied to any aspect of the VM backend configuration nor > > hardware frontends. > > > > The closest match is the lifecycle action option (-no-shutdown) > > which were merged into a -action arg, but even that's a bit of a > > stretch. > > > > I have no preferences as to how the parameter is passed > > > Markus/Paolo: do you have suggestions ? > > > > > > > diff --git a/util/async-teardown.c b/util/async-teardown.c > > > new file mode 100644 > > > index 0000000000..07fe549891 > > > --- /dev/null > > > +++ b/util/async-teardown.c > > > @@ -0,0 +1,123 @@ > > > +/* > > > + * Asynchronous teardown > > > + * > > > + * Copyright IBM, Corp. 2022 > > > + * > > > + * Authors: > > > + * Claudio Imbrenda <imbre...@linux.ibm.com> > > > + * > > > + * This work is licensed under the terms of the GNU GPL, version 2 or > > > (at your > > > + * option) any later version. See the COPYING file in the top-level > > > directory. > > > + * > > > + */ > > > +#include <stdlib.h> > > > +#include <stdio.h> > > > +#include <sys/types.h> > > > +#include <sys/unistd.h> > > > +#include <dirent.h> > > > +#include <sys/prctl.h> > > > +#include <signal.h> > > > +#include <sched.h> > > > + > > > +#include "qemu/async-teardown.h" > > > + > > > +static pid_t the_ppid; > > > + > > > +/* > > > + * Close all open file descriptors. > > > + */ > > > +static void close_all_open_fd(void) > > > +{ > > > + struct dirent *de; > > > + int fd, dfd; > > > + DIR *dir; > > > + > > > + dir = opendir("/proc/self/fd"); > > > + if (!dir) { > > > + return; > > > + } > > > + /* Avoid closing the directory. */ > > > + dfd = dirfd(dir); > > > + > > > + for (de = readdir(dir); de; de = readdir(dir)) { > > > + fd = atoi(de->d_name); > > > + if (fd != dfd) { > > > + close(fd); > > > + } > > > + } > > > + closedir(dir); > > > +} > > > > I know you said you're intentionally avoided close_range, but the > > cost is that this makes 'proc' a mandatory dependancy. > > Running qemu without /proc sounds a little bit weird, but ok
People have been doing increasingly wierd things with containers ;-) > > IMHO, we should build with close_range support if it is available > > in the glibc headers. Only use proc as a fallback if close_range > > reports ENOSYS, or is compiled out. > > sounds reasonable. > > what is the best way to do it? just some #ifdefs to check if the > syscall is available, or a config option to check at config time if the > syscall is defined? > > (and then I guess the ENOSYS check needs to be there anyway) Probably a check for the function in meson.build, and then some #ifdefs, and a catch of ENOSYS. > > > + > > > +static void hup_handler(int signal) > > > +{ > > > + /* Check every second if this process has been reparented. */ > > > + while (the_ppid == getppid()) { > > > + /* sleep() is safe to use in a signal handler. */ > > > + sleep(1); > > > + } > > > + > > > + /* At this point the parent process has terminated completely. */ > > > + _exit(0); > > > +} > > > + > > > +static int async_teardown_fn(void *arg) > > > +{ > > > + struct sigaction sa = { .sa_handler = hup_handler }; > > > + sigset_t hup_signal; > > > + char name[16]; > > > + > > > + /* Set a meaningful name for this process. */ > > > + snprintf(name, 16, "cleanup/%d", the_ppid); > > > + prctl(PR_SET_NAME, (unsigned long)name); > > > + > > > + /* > > > + * Close all file descriptors that might have been inherited from the > > > + * main qemu process when doing clone, needed to make libvirt happy. > > > + * Not using close_range for increased compatibility with older > > > kernels. > > > + */ > > > + close_all_open_fd(); > > > + > > > + /* Set up a handler for SIGHUP and unblock SIGHUP. */ > > > + sigaction(SIGHUP, &sa, NULL); > > > + sigemptyset(&hup_signal); > > > + sigaddset(&hup_signal, SIGHUP); > > > + sigprocmask(SIG_UNBLOCK, &hup_signal, NULL); > > > + > > > + /* Ask to receive SIGHUP when the parent dies. */ > > > + prctl(PR_SET_PDEATHSIG, SIGHUP); > > > > Hmm, I was hoping you could just use SIGKILL to guarantee that this > > gets killed off. Is SIGKILL delivered too soon to allow for the > > main QEMU process to have exited quickly ? > > yes, I tried. qemu has not finished exiting when the signal is > delivered, the cleanup process dies before qemu, which defeats the > purpose Ok, too bad. > > If so I wonder what happens when systemd just delivers SIGKILL to > > all processes in the cgroup - I'm not sure there's a guarantee it > > will SIGKILL the main qemu before it SIGKILLs this helper > > I'm afraid in that case there is no guarantee. > > for what it's worth, both virsh shutdown and destroy seem to do things > properly. Hmm, probably because libvirt tells QEMU to exit before systemd comes along and tells everything in the cgroup to die with SIGKILL. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|