On 10/3/23 23:43, Colin Cross wrote: > I caught it running in strace. There's some differences in signal > setup in the forked child, the child exits with 126 in both the good > and bad cases, but the parent exits with 127 after wait4 returns > ECHILD in the bad case.
Beat me to it. (Serves me right for being AFK for a few hours...) > The bad case, child exits with 126, parent exits with 127: > rt_sigaction(SIGCHLD, {sa_handler=0x5fda0c26e4, sa_mask=[], > sa_flags=SA_NOCLDSTOP}, NULL, 8) = 0 I thought the child would remain a zombie if the parent did this. Instead it exits the rest of the way. > clone(child_stack=NULL, > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x7eadf59190) = 9378 > strace: Process 9378 attached > [pid 9378] prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, 0x7eacc52000, > 16384, "stack_and_tls:main") = 0 > [pid 9378] rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], > sa_flags=0}, NULL, 8) = 0 > [pid 9378] setpgid(0, 0) = 0 > [pid 9378] execve("/", ["/"], 0x7feb7bad98 /* 26 vars */) = -1 EACCES > (Permission denied) > [pid 9378] write(2, "timeout: ", 9timeout: ) = 9 > [pid 9378] write(2, "exec /", 6exec /) = 6 > [pid 9378] write(2, ": Permission denied", 19: Permission denied) = 19 > [pid 9378] write(2, "\n", 1 > ) = 1 > [pid 9378] exit_group(126) = ? > [pid 9378] +++ exited with 126 +++ > --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=9378, > si_uid=2000, si_status=126, si_utime=0, si_stime=0} --- Sigchild was delivered, but SA_NOCLDSTOP is apparently equivalent to SIG_IGNORE? > rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 > rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, NULL, 8) > = 0 > wait4(0, 0xe000007feb7baae0, 0, NULL) = -1 ECHILD (No child processes) And then we re-enable SIGCHLD and call wait() but the child exited, and thus it went boing. I need to resequence the signal handling. I remember it got tangled up for -i support. Did I blog about this? Commit ee00e1be0f5b was July 20, 2022... Yup: <p>The reason the sed s///x test hangs with "make tests" but not "make test_sed" is the toybox timeout command is used for the first, and the debian timeout is used out of the $PATH for the second. Which is why I didn't notice it when I checked in the timeout -i support, because I was testing individual commands and "make tests" tests the toybox commands in combination, so finds more weird corner cases.</p> <p>The problem here is a race condition: if SIGCHLD from the child comes in before timeout is ready for it, we miss it and then the full timeout elapses before the command moves on.</p> <p>I set up the SIGCHLD handler AFTER the call to xpopen() because I didn't want the child process to inherit weird signal handler shenanigans. I'm a bit gun-shy there after the <a href=notes-2011.html#24-08-2011>long</a> <a href=notes-2011.html#02-09-2011>ago</a> <a href=notes-2011.html#03-09-2011>multi</a> <a href=notes-2011.html#04-09-2011>day</a> <a href=notes-2011.html#05-09-2011>saga</a> of trying to debug a blocked SIGALRM inherited from PID 1 due to an old bash bug, although that was the child inheriting the signal _mask_ not the signal handlers, so I don't ACTUALLY have to use the xpopen_setup() child callback to reset SIGCHLD (since the child process can't have any children of its own before calling exec). And yes of course I have to set the signal handler before forking the child to avoid a race window with the child exiting. (Even without SMP, it could get scheduled first.)</p> <p>Darn it, there's still a race condition after that's moved, because the SIGCHLD can come in before we record the fd we're reading from into the poll structure, so the signal handler can't close the fd, and there's no other way to tell poll() to exit early EXCEPT closing the fd? Trying NOT to pass the fd into poll always has at least the race of making the poll() function call (copying the fd into the function's local variable as an argument value) and THEN having the signal happen before the actual poll system call starts.</p> <p>I cut the gordian knot with siglongjmp() from the SIGCHLD handler so the loop _doesn't_ exit and just goes "receiving SIGCHLD means timeout can exit" because we only spawn one child process; if that has grandkids those would reparent to PID 1 instead of us so getting the signal means we can exit now. BUT while this is good enough for a standalone timeout, it's not good enough for a bash builtin that doesn't fork: xpopen_both() can create a pipe but not return it yet before the signal handler goes, leaking filehandles. I may have to manually pipe() and VFORK() myself instead of calling letting xpopen() do it, because pipe(TT.pipes) is atomic: either the pipes got written where the signal handler can see them, or they weren't created yet. The kernel either backs out any halfway states before returning to userspace, or else doesn't check for signals until the system call is about to return to userspace anyway.</p> <p>Anyway, that should fix up the "timeout 10 does_not_exit" problem Elliott spotted, and also the test suite slowdown I noticed.</p> Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net