I am facing questions regarding the way to correctly handle transition failures with s6-rc. The new permanent failure feature already clarifies some scenarios but I still have doubts about some cases. Below are two concrete examples. I would be happy to have remarks or suggestions about how to cope with them clean and nice :).
First, thank you for your mail. This is exactly the kind of feedback that I'm looking for regarding s6-rc: identifying pain points and usability issues.
1. I start a longrun service with "s6-rc -u change svc". This service hangs and never reaches readiness notification. After timeout s6-rc will declare the transition a failure. But the process is actually running and I have no way to stop it through s6-rc. The only way is to issue "s6-svc -d /path/to/svc". But then I have the feeling I am doing something in the back of s6-rc to unblock the situation because s6-rc cannot handle it.
That is a fair point. Normally, you should adjust the s6-rc timeouts (both the global one and the service-specific one) to make sure s6-rc does *not* time out before the service is ready - but if there's an unexpected significant delay, the situation can happen. In general it does not matter that s6-rc is unaware that a service is up: when s6-rc reports a (temporary) transition failure, the expected user action is to run the command again. s6-rc then picks up the correct service states during its second execution. But if you're running a s6-rc -d change operation right after a transition failure, it is true that states could become inconsistent. What I can do is add an option to s6-rc to make it explicitly send a s6-svc -d to a service that times out before reaching readiness: ensure that a service is either ready in time, or definitely down. Would that help? The annoying thing is it can't be symmetrical: when a down transition times out, there's no way I'm going to start the service again. :) But generally, a down transition timing out signifies a badly written finish script, or badly calibrated timeouts, and it can be easily solved by running s6-rc -d change again.
2. Slightly related, I have an issue with system shutdown. I am working on a buildroot system and specifically I use the /etc/rc.tini which can be found here [1] and which is executed as part of the shutdown sequence of the system. The problem is with the invocation of "s6-rc -b -da change" (I added the -b). If there is already an s6-rc ongoing, the shutdown sequence will be blocked until the first s6-rc times out. And this kind of timeout is of the order of minutes as I have slow services depending on each other. I currently think the best thing to do is to is to "killall s6-rc" before calling "s6-rc -ad change".
Yes. Since the state is global, it makes sense to refuse to start a state change while another one is taking place, unless you're willing to abort the ongoing operation by explicitly killing the running s6-rc process.
This leaves a little race condition possible, but more importantly, I have concerns about killing an ongoing s6-rc. This will leave longrun services in the middle of a state transition - there is the connection with the first scenario - and I expect the final effect is that the finish script will not be executed before the system goes down, which is precisely what I want to happen when I call "s6-rc -ad change". Secondly, I do not know what effect this will have on oneshots. I fear "/etc/init.d/S98xxx start" will still be running and "/etc/init.d/S98xxx stop" will be executed - the thought of which horrifies me beyond reasoning.
And that's exactly why there's a lock preventing several state changes from running concurrently. :) What I can do is add a bit of signal handling to s6-rc, so that if it gets interrupted, say with a SIGINT or SIGTERM, it exits ASAP, while still ensuring consistency of the service states. Unfortunately, for oneshots it would mean waiting for the current transitions to finish before exiting - s6-rc has no way to interrupt a running oneshot, and adding one (making s6rc-oneshot-runner kill all its children) would not help, because until the oneshot script exits, it is not visible from the outside whether it has accomplished its transition or not - so the state would still be undetermined. Also, state consistency cannot be 100% ensured, because s6-rc could still receive a SIGKILL - but if you kill -9 s6-rc, you deserve trouble. What do you think? -- Laurent