Re: Race in reboot/poweroff path at init?
Hi all, We put together a test to try out the various suggestions on double-forking processes to see if polling for a zombie process being reaped by INIT would help provide a method to synchronize the ready state of the signal handlers for reboot and poweroff paths in Busybox INIT during early IPL. The usermode helper that kicks Busybox /sbin/poweroff (started by PID=2) is very early, even before Busybox INIT is do_execve'd from kernel_init. We found that any process zombies in this early stage of kernel_init get flushed when the do_execve is done for Busybox /sbin/init. Specifically we traced the program flow to proc_flush_task_mnt from linux/fs/proc/base.c. We tested out the following patch and the race is fixed using the abstract socket. >From 1a7ac3aba3e94bd04de48e9619647cca853dca06 Mon Sep 17 00:00:00 2001 From: Deb McLemoreDate: Wed, 25 Oct 2017 08:06:20 -0500 Subject: [PATCH] init: Add handshake to poweroff/reboot for signal handler setup Add an abstract socket to synchronize the readiness of init to receive the SIGUSR2 to catch poweroff/reboot during an IPL phase (e.g. soft power off via BMC). Signed-off-by: Deb McLemore --- init/halt.c | 37 + init/init.c | 23 +++ 2 files changed, 60 insertions(+) diff --git a/init/halt.c b/init/halt.c index c6c857f..510c4d2 100644 --- a/init/halt.c +++ b/init/halt.c @@ -83,6 +83,10 @@ #include "libbb.h" #include "reboot.h" +#ifdef __linux__ +#include +#endif + #if ENABLE_FEATURE_WTMP #include @@ -112,6 +116,12 @@ static void write_wtmp(void) int halt_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE; int halt_main(int argc UNUSED_PARAM, char **argv) { + +#ifdef __linux__ + struct sockaddr_un bb_addr2; + int fdBB2 = 0; + int fdrc = 0; +#endif static const int magic[] = { RB_HALT_SYSTEM, RB_POWER_OFF, @@ -170,6 +180,33 @@ int halt_main(int argc UNUSED_PARAM, char **argv) /* talk to init */ if (!ENABLE_FEATURE_CALL_TELINIT) { /* bbox init assumed */ + + /* Use socket connect to INIT to assure that + * signal handlers are ready + */ + #ifdef __linux__ + memset(_addr2, + 0, + sizeof(struct sockaddr_un)); + bb_addr2.sun_family = AF_UNIX; + strcpy(bb_addr2.sun_path, + "\0bb_signals"); + while (1) { + fdBB2 = socket(AF_UNIX, + SOCK_CLOEXEC | SOCK_DGRAM, 0); + if (fdBB2 != -1) + break; + sleep(1); + } + while (1) { + fdrc = connect(fdBB2, + (struct sockaddr *)_addr2, + sizeof(struct sockaddr_un)); + if (fdrc == 0) + break; + sleep(1); + } + #endif rc = kill(1, signals[which]); } else { /* SysV style init assumed */ diff --git a/init/init.c b/init/init.c index 6f3374e..3308a6e 100644 --- a/init/init.c +++ b/init/init.c @@ -133,6 +133,7 @@ #ifdef __linux__ # include # include +# include #endif #include "reboot.h" /* reboot() constants */ @@ -1046,6 +1047,12 @@ static void sleep_much(void) int init_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE; int init_main(int argc UNUSED_PARAM, char **argv) { + +#ifdef __linux__ + struct sockaddr_un bb_addr; + int fdBB = 0; +#endif + if (argv[1] && strcmp(argv[1], "-q") == 0) { return kill(1, SIGHUP); } @@ -1191,6 +1198,22 @@ int init_main(int argc UNUSED_PARAM, char **argv) sigprocmask_allsigs(SIG_UNBLOCK); } +#ifdef __linux__ + memset(_addr, 0, sizeof(struct sockaddr_un)); + bb_addr.sun_family = AF_UNIX; + strcpy(bb_addr.sun_path, "\0bb_signals"); + /* Create an abstract socket to use for synchronizing poweroff and + * reboot which could be kicked at IPL before the INIT process + * completes signal setup to listen for the signals + * Ignore failures and keep on, this just helps + * close the exposed window when nothing will get signaled + * The abstract socket needs to stay persistent, so no cleanup + */ + fdBB = socket(AF_UNIX, SOCK_CLOEXEC | SOCK_DGRAM, 0); + if (fdBB != -1) + bind(fdBB, (struct sockaddr *)_addr, sizeof(bb_addr)); +#endif + /* Now run everything that needs to be run */ /* First run the sysinit command */ run_actions(SYSINIT); -- 2.7.4 ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
Hi Denys, >> - installing the signal handlers even earlier in init_main(). However, >>this will only reduce the window for lost events, rather than >>eliminating it; or > > Sure, this should be done. How about this: > --- a/init/init.c > +++ b/init/init.c > @@ -1064,6 +1064,12 @@ int init_main(int argc UNUSED_PARAM, char **argv) > #endif > > if (!DEBUG_INIT) { > +/* Some users send poweroff signals to init VERY early. > + * To handle this, mask signals early, > + * and unmask them only after signal handlers are installed. > + */ > +sigprocmask_allsigs(SIG_BLOCK); > + > /* Expect to be invoked as init with PID=1 or be invoked as linuxrc > */ > if (getpid() != 1 > && (!ENABLE_LINUXRC || applet_name[0] != 'l') /* not linuxrc? */ > @@ -1204,6 +1187,8 @@ int init_main(int argc UNUSED_PARAM, char **argv) > + (1 << SIGHUP) /* reread /etc/inittab */ > #endif > , record_signo); > + > +sigprocmask_allsigs(SIG_UNBLOCK); > } > > /* Now run everything that needs to be run */ > > This covers code which opens and parses /etc/inittab, > which can be slow (if storage is slow), and can make > race realistic in real world. > > Can you test whether this change makes the race go away in your case? This looks good, but it's not going to completely fix the race - in some cases, we've seen the 'poweroff' binary started before init. >> - using a synchronous channel to send the shutdown/reboot message >>between the poweroff/reboot helpers, rather than an asynchronous >>signal. Say, have init listening on a socket, allowing the poweroff >>binary to wait and/or retry. >> >> However, before I go down the wrong path here: does anyone have other >> ideas that might help eliminating dropped poweroff/reboot events? > > The test that processes are being reaped is a good idea. OK, sounds like something we should experiment with. Thanks for the input! Cheers, Jeremy ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
On Tue, Oct 10, 2017 at 10:52 AM, Jeremy Kerrwrote: > Hi all, > > I've been debugging an issue where we can't reboot or poweroff a machine > in the early stages of busybox init. Using the poweroff case as an > example: > > - kernel starts /sbin/init > > - kernel receives a poweroff event, so calls __orderly_poweroff. >Effectively, these will just call out to the /sbin/poweroff usermode >helper. > > - /sbin/poweroff just does a: > > kill(1, SIGUSR2); > > - However, /sbin/init has not yet installed a signal handler for >SIGUSR2. Because we're PID 1, this means the signal is ignored, and >so the command to poweroff the machine is dropped. > > - init keeps booting rather than powering off. > > In our particular case, the "poweroff event" is an IPMI soft shutdown > message. However, the same would apply for any other path that involves > orderly_poweroff or orderly_reboot. > > Even though the signal handlers are installed fairly early in init, we > can still hit the race between this and the SIGUSR2 being sent fairly > reliably. > > I see a couple of options for resolving this: > > - installing the signal handlers even earlier in init_main(). However, >this will only reduce the window for lost events, rather than >eliminating it; or Sure, this should be done. How about this: --- a/init/init.c +++ b/init/init.c @@ -1064,6 +1064,12 @@ int init_main(int argc UNUSED_PARAM, char **argv) #endif if (!DEBUG_INIT) { +/* Some users send poweroff signals to init VERY early. + * To handle this, mask signals early, + * and unmask them only after signal handlers are installed. + */ +sigprocmask_allsigs(SIG_BLOCK); + /* Expect to be invoked as init with PID=1 or be invoked as linuxrc */ if (getpid() != 1 && (!ENABLE_LINUXRC || applet_name[0] != 'l') /* not linuxrc? */ @@ -1204,6 +1187,8 @@ int init_main(int argc UNUSED_PARAM, char **argv) + (1 << SIGHUP) /* reread /etc/inittab */ #endif , record_signo); + +sigprocmask_allsigs(SIG_UNBLOCK); } /* Now run everything that needs to be run */ This covers code which opens and parses /etc/inittab, which can be slow (if storage is slow), and can make race realistic in real world. Can you test whether this change makes the race go away in your case? > - using a synchronous channel to send the shutdown/reboot message >between the poweroff/reboot helpers, rather than an asynchronous >signal. Say, have init listening on a socket, allowing the poweroff >binary to wait and/or retry. > > However, before I go down the wrong path here: does anyone have other > ideas that might help eliminating dropped poweroff/reboot events? The test that processes are being reaped is a good idea. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
Laurent Bercot wrote: That is true, I hadn't thought of abstract sockets, and it would work. However, changing the way poweroff signals init is a big change, and in particular, making init listen to a socket is very significant: you know need to multiplex reaping zombies with listening to connections. It's not hard, but it requires patching init significantly, reworking its control flow around an asynchronous event loop; and at that point you're just better off using a better pid 1 process (such as s6-svscan, which already does this - have you considered switching inits? ;)). And changes of that magnitude are pretty dangerous - I'm quite convinced there are some people relying on the "send a signal to init" mechanism and whose systems would break if that mechanism were changed. The "send a signal to init" mechanism is a good one, and does not fundamentally need to be changed; your problem is a readiness detection one, i.e. whatever mechanism you're using to tell init to shutdown, you can't use it before init is ready to handle it. And that's where abstract sockets can be useful: you can use abstract sockets as a simple synchronization mechanism. The easiest way to do it would be: when init has installed its signal handlers, it creates an abstract socket (close-on-exec, of course) and binds it - but does not listen to it and does nothing else with it). And poweroff checks for the presence of this socket by repeatedly attempting connect(). When errno changes from ENOENT to ECONNREFUSED, it means the socket has been created and init is now ready to receive signals. This requires minimal change to init and a small polling addition to poweroff - which I think is much better and safer than a heavy change to busybox's init. I think your idea with wait is better. It doesn't require any additional code to the init process, it is likely more portable, it doesn't consume memory for an open socket and the changes to poweroff are also smaller. I would modify it so that it doesn't check for the presence of the grandchild, as the grandchild may be started from a process that does not do wait. Just check whether the parent PID is 1 or not. Basically: if ((pid1 = fork ()) > 0) exit (0); if ((pid2 = fork ()) > 0) exit (0); while (pid1 > 0 && pid2 > 0 && getppid () != 1) // only wait if both fork calls were successful sleep (1); If fork fails, it should signal init from the parent process. If there are not enough resources to create a new process, we can safely assume that init is ready to receive a signal. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
That's OK, as the helper (/sbin/poweroff) has the opportunity to retry if the connect() fails (because init hasn't established the listening socket yet). The main difference is that the sender can detect failure, and retry if necessary. AF_UNIX sockets in the abstract namespace don't require a path bound to the filesystem, so perhaps they would be available early enough - or have I missed something there? That is true, I hadn't thought of abstract sockets, and it would work. However, changing the way poweroff signals init is a big change, and in particular, making init listen to a socket is very significant: you know need to multiplex reaping zombies with listening to connections. It's not hard, but it requires patching init significantly, reworking its control flow around an asynchronous event loop; and at that point you're just better off using a better pid 1 process (such as s6-svscan, which already does this - have you considered switching inits? ;)). And changes of that magnitude are pretty dangerous - I'm quite convinced there are some people relying on the "send a signal to init" mechanism and whose systems would break if that mechanism were changed. The "send a signal to init" mechanism is a good one, and does not fundamentally need to be changed; your problem is a readiness detection one, i.e. whatever mechanism you're using to tell init to shutdown, you can't use it before init is ready to handle it. And that's where abstract sockets can be useful: you can use abstract sockets as a simple synchronization mechanism. The easiest way to do it would be: when init has installed its signal handlers, it creates an abstract socket (close-on-exec, of course) and binds it - but does not listen to it and does nothing else with it). And poweroff checks for the presence of this socket by repeatedly attempting connect(). When errno changes from ENOENT to ECONNREFUSED, it means the socket has been created and init is now ready to receive signals. This requires minimal change to init and a small polling addition to poweroff - which I think is much better and safer than a heavy change to busybox's init. -- Laurent ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
There's the sigqueue() mechanism out there. From the man page, it seems it's essentially dedicated to send data together with the signal, but it also has a queueing mechanism implemented in the kernel. Wether this allows the message to be kept in the queue until the destination process unmasks it, this isn't written explicitely in the man, but maybe somebody knows it. Anyway your case is a perfect test bench. From what I understand from the POSIX page for sigqueue(), it only makes a difference if the receiver has already installed a signal handler with SA_SIGINFO. If the receiver hasn't installed a signal handler yet, the signal just gets delivered as is. So it's not a mechanism that can be used to defer signals until the receiver has installed a signal handler. -- Laurent ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
Le 11/10/2017 à 04:43, Jeremy Kerr a écrit : Hi Laurent, Thanks for the reply, good to get some conversation going here! - using a synchronous channel to send the shutdown/reboot message between the poweroff/reboot helpers, rather than an asynchronous signal. Say, have init listening on a socket, allowing the poweroff binary to wait and/or retry. That would not work either: you could receive the event before init starts listening to the socket. That's OK, as the helper (/sbin/poweroff) has the opportunity to retry if the connect() fails (because init hasn't established the listening socket yet). The main difference is that the sender can detect failure, and retry if necessary. AF_UNIX sockets in the abstract namespace don't require a path bound to the filesystem, so perhaps they would be available early enough - or have I missed something there? [Non-Linux platforms may not support the same abstract namespace, so I'd need to implement a fallback there, but I don't (yet) know if this same race is relevant on those platforms...] I'd rather not just wear the race, as that means we've missed shutdown events, which is quite user-visible. The signal-after-reaped-grandchild approach seems okay too, if other methods aren't workable. Or even Tito's suggestion of a repeated signal, which has the advantage of a minimal change to code. There's the sigqueue() mechanism out there. From the man page, it seems it's essentially dedicated to send data together with the signal, but it also has a queueing mechanism implemented in the kernel. Wether this allows the message to be kept in the queue until the destination process unmasks it, this isn't written explicitely in the man, but maybe somebody knows it. Anyway your case is a perfect test bench. Didier ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
Hi Laurent, Thanks for the reply, good to get some conversation going here! >> - using a synchronous channel to send the shutdown/reboot message >> between the poweroff/reboot helpers, rather than an asynchronous >> signal. Say, have init listening on a socket, allowing the poweroff >> binary to wait and/or retry. > > That would not work either: you could receive the event before init > starts listening to the socket. That's OK, as the helper (/sbin/poweroff) has the opportunity to retry if the connect() fails (because init hasn't established the listening socket yet). The main difference is that the sender can detect failure, and retry if necessary. AF_UNIX sockets in the abstract namespace don't require a path bound to the filesystem, so perhaps they would be available early enough - or have I missed something there? [Non-Linux platforms may not support the same abstract namespace, so I'd need to implement a fallback there, but I don't (yet) know if this same race is relevant on those platforms...] I'd rather not just wear the race, as that means we've missed shutdown events, which is quite user-visible. The signal-after-reaped-grandchild approach seems okay too, if other methods aren't workable. Or even Tito's suggestion of a repeated signal, which has the advantage of a minimal change to code. Cheers, Jeremy ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
On October 10, 2017 4:52:17 PM GMT+08:00, Jeremy Kerrwrote: >Hi all, > >I've been debugging an issue where we can't reboot or poweroff a >machine >in the early stages of busybox init. Using the poweroff case as an >example: > > - kernel starts /sbin/init > > - kernel receives a poweroff event, so calls __orderly_poweroff. > Effectively, these will just call out to the /sbin/poweroff usermode > helper. > > - /sbin/poweroff just does a: > > kill(1, SIGUSR2); > > - However, /sbin/init has not yet installed a signal handler for > SIGUSR2. Because we're PID 1, this means the signal is ignored, and > so the command to poweroff the machine is dropped. > > - init keeps booting rather than powering off. > >In our particular case, the "poweroff event" is an IPMI soft shutdown >message. However, the same would apply for any other path that involves >orderly_poweroff or orderly_reboot. > >Even though the signal handlers are installed fairly early in init, we >can still hit the race between this and the SIGUSR2 being sent fairly >reliably. > >I see a couple of options for resolving this: > > - installing the signal handlers even earlier in init_main(). However, > this will only reduce the window for lost events, rather than > eliminating it; or > > - using a synchronous channel to send the shutdown/reboot message > between the poweroff/reboot helpers, rather than an asynchronous > signal. Say, have init listening on a socket, allowing the poweroff > binary to wait and/or retry. > >However, before I go down the wrong path here: does anyone have other >ideas that might help eliminating dropped poweroff/reboot events? > >Regards, > > >Jeremy >___ >busybox mailing list >busybox@busybox.net >http://lists.busybox.net/mailman/listinfo/busybox Hi, just a silly idea: would running the poweroff or reboot signal in a loop do any harm or eventually be somehow an improvement? Just my 0,2 cents from vacation after a couple of beers. Ciao, Tito ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Race in reboot/poweroff path at init?
- using a synchronous channel to send the shutdown/reboot message between the poweroff/reboot helpers, rather than an asynchronous signal. Say, have init listening on a socket, allowing the poweroff binary to wait and/or retry. That would not work either: you could receive the event before init starts listening to the socket. There will always be a window where init can't receive events. The kernel starts it barebones, with no channel of communication to other processes; if an event arrives before it starts establishing these channels, you're out of luck. The best you can do is make the window as small as possible. If you need to be 100% safe, then you need to somehow queue the events before init starts processing them. But that's tricky, because it's extremely early - you have nothing but the kernel and the /sbin/poweroff process' memory. You don't even have the guarantee that you can write to a filesystem: you only have the rootfs and it may be read-only. You don't even have a tmpfs yet. You can't be certain you have a devtmpfs mounted. You don't have /dev/shm. You don't have /proc. So it's a matter of finding a way to queue events that don't involve writing to the filesystem at all. That severely restricts your options: for instance, POSIX message queues sound like a perfect fit, but Linux implements them via a virtual filesystem that needs to be mounted first, so it's a no-go. Signals are actually pretty good: all they require is that init has installed a handler, which can be done early. The only issue is that you can't queue them. What I would do is add a check to /sbin/poweroff that init has progressed to a point where its signal handlers are installed, and if it's not there yet, poll until it is (i.e. sleep and retry). What check to use? well at this point it's very hackish. The only thing I can think of that doesn't depend on the contents of /etc/inittab is that when init reaps zombies, we know it has its signal handlers installed. So... I would have poweroff doublefork a process (have the child communicate the pid of the grandchild before dying), the grandchild dies - at this point it's a zombie waiting for init to reap it - and poweroff repeatedly hits the grandchild with kill(), using signal 0 just to be safe. When kill() fails with ESRCH, it means the zombie has disappeared and init is now ready to accept signals. It's really ugly, but it's the best I can come up with that makes no unsafe assumptions. Whether implementing that in /sbin/poweroff is better than simply eating the race condition... that's your call. -- Laurent ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Race in reboot/poweroff path at init?
Hi all, I've been debugging an issue where we can't reboot or poweroff a machine in the early stages of busybox init. Using the poweroff case as an example: - kernel starts /sbin/init - kernel receives a poweroff event, so calls __orderly_poweroff. Effectively, these will just call out to the /sbin/poweroff usermode helper. - /sbin/poweroff just does a: kill(1, SIGUSR2); - However, /sbin/init has not yet installed a signal handler for SIGUSR2. Because we're PID 1, this means the signal is ignored, and so the command to poweroff the machine is dropped. - init keeps booting rather than powering off. In our particular case, the "poweroff event" is an IPMI soft shutdown message. However, the same would apply for any other path that involves orderly_poweroff or orderly_reboot. Even though the signal handlers are installed fairly early in init, we can still hit the race between this and the SIGUSR2 being sent fairly reliably. I see a couple of options for resolving this: - installing the signal handlers even earlier in init_main(). However, this will only reduce the window for lost events, rather than eliminating it; or - using a synchronous channel to send the shutdown/reboot message between the poweroff/reboot helpers, rather than an asynchronous signal. Say, have init listening on a socket, allowing the poweroff binary to wait and/or retry. However, before I go down the wrong path here: does anyone have other ideas that might help eliminating dropped poweroff/reboot events? Regards, Jeremy ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox