I have been trying to figure out how to handle failures of sub-supervisors in nested supervision trees. Right now, it seems that if a sub-supervisor (like s6-supervise) dies, its supervisor (like s6-svscan) will respawn it, but the respawned s6-supervise won't know about the job it was supposed to spawn. This means that it can either risk spawning a second instance or never restarting it, neither of which is good.
One workaround is for the sub-supervisor and the process it supervises to share a process group. The sub-supervisor and its parent can both send signals to the entire group, and can wait on child processes in that group to finish. The parent can kill the entire process group if the sub-supervisor dies, and wait for all the processes in it to exit before respawning the sub-supervisor. This does mean that the child will wind up sending itself a SIGKILL if it is done with its job. However, this can actually be okay. The parent can re-spawn the sub-supervisor after waiting for all of its children. Unfortunately, this does not work for nested supervision trees. I did figure out a very ugly workaround, but it requires support from init or (possibly) the (Linux-specific) prctl(PR_SET_CHILD_SUBREAPER). How do other projects handle this? The best solution I can think of is to use control groups, which are Linux-specific but are a perfect fit for the job. Non-Linux systems don't allow replacing init and don't provide prctl(PR_SET_CHILD_SUBREAPER), so the trick I came up with doesn't work anyway. -- Sincerely, Demi Marie Obenour (she/her/hers)
OpenPGP_0xB288B55FFF9C22C1.asc
Description: OpenPGP public key
OpenPGP_signature.asc
Description: OpenPGP digital signature
