Re: s6-rc transition failures

Laurent Bercot Thu, 15 Jun 2017 12:43:02 -0700

I am facing questions regarding the way to correctly handle
transition failures with s6-rc. The new permanent failure feature
already clarifies some scenarios but I still have doubts about
some cases. Below are two concrete examples. I would
be happy to have remarks or suggestions about how to cope
with them clean and nice :).


 First, thank you for your mail. This is exactly the kind of feedback
that I'm looking for regarding s6-rc: identifying pain points and
usability issues.

1. I start a longrun service with "s6-rc -u change svc". This
service hangs and never reaches readiness notification. After
timeout s6-rc will declare the transition a failure. But the process
is actually running and I have no way to stop it through s6-rc.
The only way is to issue "s6-svc -d /path/to/svc". But then I have
the feeling I am doing something in the back of s6-rc to unblock
the situation because s6-rc cannot handle it.


 That is a fair point. Normally, you should adjust the s6-rc
timeouts (both the global one and the service-specific one) to
make sure s6-rc does *not* time out before the service is ready -
but if there's an unexpected significant delay, the situation can
happen.

 In general it does not matter that s6-rc is unaware that a service
is up: when s6-rc reports a (temporary) transition failure, the
expected user action is to run the command again. s6-rc then picks
up the correct service states during its second execution.
 But if you're running a s6-rc -d change operation right after a
transition failure, it is true that states could become inconsistent.

 What I can do is add an option to s6-rc to make it explicitly send
a s6-svc -d to a service that times out before reaching readiness:
ensure that a service is either ready in time, or definitely down.
Would that help?
 The annoying thing is it can't be symmetrical: when a down
transition times out, there's no way I'm going to start the service
again. :) But generally, a down transition timing out signifies a
badly written finish script, or badly calibrated timeouts, and
it can be easily solved by running s6-rc -d change again.

2. Slightly related, I have an issue with system shutdown. I am
working on a buildroot system and specifically I use the
/etc/rc.tini which can be found here [1] and which is executed
as part of the shutdown sequence of the system. The problem
is with the invocation of "s6-rc -b -da change" (I added the -b).
If there is already an s6-rc ongoing, the shutdown sequence will
be blocked until the first s6-rc times out. And this kind of timeout
is of the order of minutes as I have slow services depending
on each other. I currently think the best thing to do is to is to
"killall s6-rc" before calling "s6-rc -ad change".


 Yes. Since the state is global, it makes sense to refuse to start
a state change while another one is taking place, unless you're willing
to abort the ongoing operation by explicitly killing the running
s6-rc process.

 This leaves a little
race condition possible, but more importantly, I have concerns
about killing an ongoing s6-rc. This will leave longrun services
in the middle of a state transition - there is the connection with
the first scenario - and I expect the final effect is that the
finish script will not be executed before the system goes down,
which is precisely what I want to happen when I call
"s6-rc -ad change". Secondly, I do not know what effect this will
have on oneshots. I fear "/etc/init.d/S98xxx start" will still be
running and "/etc/init.d/S98xxx stop" will be executed - the thought
of which horrifies me beyond reasoning.


 And that's exactly why there's a lock preventing several state
changes from running concurrently. :)
 What I can do is add a bit of signal handling to s6-rc, so that if
it gets interrupted, say with a SIGINT or SIGTERM, it exits ASAP,
while still ensuring consistency of the service states.

 Unfortunately, for oneshots it would mean waiting for the current
transitions to finish before exiting - s6-rc has no way to interrupt
a running oneshot, and adding one (making s6rc-oneshot-runner kill
all its children) would not help, because until the oneshot script
exits, it is not visible from the outside whether it has accomplished
its transition or not - so the state would still be undetermined.

 Also, state consistency cannot be 100% ensured, because s6-rc could
still receive a SIGKILL - but if you kill -9 s6-rc, you deserve
trouble.

 What do you think?

--
 Laurent

Re: s6-rc transition failures

Reply via email to