Hi all,
I recently encountered an issue while using s6-svc to terminate a service
during a restart procedure. I would like to ask for your help in case I have
misconceptions about the intended s6 usage or if you have suggestions on how to
resolve the issue.
The service is a worker node in a cluster of compute nodes. It runs in a docker
container with s6 as the init script.
Occasionally the node will reach an invalid state such as when it can no longer
connect to the rest of the cluster or when the computation encounters OOMs.
We use a python script to poll for these invalid states and call the restart
procedure.
The body of the restart procedure looks like this (error handling is omitted):
def restart_daemon():
# kill demon gracefully
system("/opt/s6/bin/s6-svc -T 180000 -wd -d
/projectname/service/servicename")
# kill demon
system("/opt/s6/bin/s6-svc -k /projectname/service/servicename")
# restart demon
system("/opt/s6/bin/s6-svc -T 180000 -wu -u
/projectname/service/servicename")
The idea is to give the daemon a chance to end itself gracefully (potentially
write out some logs and log out of the cluster) before assuredly killing it.
After killing it a new daemon is started. This usually works fine.
We encountered an issue where the daemon would repeatedly try to restart but
wasn’t able to. It turns out the daemon had left behind a stray child process
which was messing up its starting procedure. This was a rare occurrence but is
not unheard of.
I upgraded s6 to the most recent version 2.13.2.0, which supports the “-K”
option (for kill the whole process group) and changed the corresponding line in
the function.
In order to test the change I replaced the demon with a process that does
nothing but spawn a child process which does nothing except sleep endlessly.
When executing the s6-svc commands in sequence the daemon’s child did not,
however, get killed. It survived and with every restart a new one was made,
which proves that this solution does not resolve the issue.
I haven’t looked into the s6 code yet but I am guessing that the reason is that
if the daemon manages to shutdown gracefully (which is normally the case), then
by the time the process group is ordered to be killed s6 does not remember the
process group that the service used to have.
I considered storing the process group of the daemon in a file somewhere and
sending the signal in the monitoring script but that comes with an array of
problems that s6 was designed to prevent in the first place. I believe that the
issue is that s6-svc does not come with an option that combines the graceful
and graceless kill commands. Ideally we could send SIGTERM with a timeout and
the send SIGKILL to the whole process group in one operation so that the
process group is not forgotten. Do you have any suggestions?
Thanks,
Tom