Subject: bcfg2-server init scripts are a complete mess Package: bcfg2-server Version: 1.0.1-2 Justification: renders package unusable Severity: grave Tags: patch
*** Please type your report below this line *** Hi ! While being in the process of evaluating bcfg2, I came across its Debian init scripts, which are IMHO a complete mess. Sorry to come to use these terms, but I am really being honest... I hardly believe such bugs (really release critical, I guess, as I can't imagine even remotely using bcfg2, or, more simply, can't use it, with init scripts in this state) have not been reported yet. ***** To start with the "start" (sic) stanza : - well it works quite right on a standard machine, as well on boot as manually. But I have no interest in running it on a standard machine : I need it to be in an OpenVZ container (as long as LXC is still rough, with a lack of commodity executables and other scripts, hardware sharing, ressource management, and so on - until all of that is better in LXC, OpenVZ still rocks a lot - a lot more than KVM and such, which I deem as stupidly used most of the time, while containers do most jobs better, more securely, and consuming less resources - I like KVM, mind you... at most, on top of container-ization, or for networks simulations - but for segregating services ? As ridiculous as an H-bomb so to kill a mosquito). And here comes the problem : it will start manually in an OpenVZ container, but not on its start-up (this one is a special case, of a "not so frequent to come to figure out"-kind, but still... annoying). Fiddling a bit, I came accross the fact that it seems to run too early. Maybe every and all things start quicker, or slower, or in a better way, or in a luckily worse way, outside of an OpenVZ container (I don't know, and honestly, I don't really care), but inside, it is not the case (what I know is that in the test container I set up, a "ls /etc/rc2.d|grep 01" only shows "S01bcfg2", "S01bcfg2-server", S01bootlogs" and "S01rsyslog" as being launched that early : so common sense hints to bcfg2-server requiring rsyslog to be launched before it's spawned). Activating its debug mode, I realized bcfg2-server would try to start, but fail, and be killed immediately - except if it was to be run once the syslog has been started (verified through an "invoke-rc.d rsyslog stop && invoke-rc.d bcfg2-server start" : bcfg2-server indeed fails to start) : hence, I added a $syslog in the "Required-start", as (new) Squeeze (containers) use the dependency based init system to determine the running order at the boot, and after an "update-rc.d bcfg2-server defaults" and a container restart, voilà ! This way, bcfg2-server is run just fine at the boot/init of an OpenVZ container. - well... almost "voilà". In the event the "stop" stanza would work (wich it is'nt... but I'll explain it a bit farther), problem could arise. Indeed, there is nothing in your script to protect against launching the bcfg2-server if it already is running. Or, well, if you do this while it already is, it seems to write bogus information in the PID file : try this, starting with a stopped server, and through an "invoke-rc.d bcfg2-server start && cat /var/run/bcfg2-server.pid && invoke-rc.d bcfg2-server start && cat /var/run/bcfg2-server.pid && pidof -x /usr/sbin/bcfg2-server", and you'll see the PID file gets renewed, but the PID of the running "bcfg2-server" actually is... : surprise, it is not what is in the PID file, but instead is the one which was initially existing (I guess the idiotic "start_daemon" functions from LSB starts a new server instance, overwrites its new PID in the PID file, even though this server seems to die for whatever reason, only leaving the original one running... or something like that) - which is highly problematic if one would then want to stop the server (which will necessarily happen, whenever, as stopping the server requires a correct PID file...)... I guess this is another example, amongst a freaking multitude, of why the LSB "init-functions" is a pile of buggy and, mostly, useless crap... but, well, whatever : you seem to want to use it, so, let us try to cope with what you want... problem is simply resolved by testing wether the server is running before trying to launch it, and silently tell everything went OK but not do anything, if it was already running (this also implies setting the PID variable outside of the "status" function of the init script, as it now makes it also be used in the "start" function). Two problems solved... but this is far from ending there. ***** Now, for the "stop" stanza : - there seems to be a recurring bug in bcfg2-server (see http://trac.mcs.anl.gov/projects/bcfg2/ticket/709 - bug opened, then closed, then re-opened, then re-closed, ... on and on), where the server doesn't fully respond to a SIGTERM, because the fam/gamin/whatever worker thread doesn't stop, while the server waits indefinately for it to terminate... which results in the "stop" stanza not working. This also can be observed while running the server undaemonized, and trying to stop it with a ^C : the server will hang there, and wait forever for the file alteration monitor to stop, which never happens. I guess this is a case for another bug report, but well : for now, using SIGKILL instead of SIGINT in the stop stanza still does the trick. The fam worker thread will still remain active once the "bcfg2-server" is killed, but this is a relatively minor problem, as it will be killed, and another spawned in his stead, if the bcfg2-server is relaunched thereafter : so until you correct the bug of this worker sub-process never obeying the SIGTERM, the SIGKILL seems an appropriate workaround to me - not an ideal solution, but still better than a stop script that doesn't work. Especially if we consider that as the "stop" stanza leaves a process half-dead forever, and as the "restart" stanza calls the "stop" one, a "restart" will result in two living "bcfg2-server" threads concurrently running, and will add one each time it is reissued, which can seriously wreak havocs, as the PID file will get overwritten, the logs filled with incoherent messages, and so on (it is by the way already solved with my patch of the "start" function, which prevents several instances of bcfg2-server running at once, but this wreakage happens with your unpatched script : try it, not funny at all)... Plus it only takes adding a "9" at the end of the "killproc" call (you got to read the LSB "init-functions" to figure this out, as it is mostly undocumented... sigh). It is by the way quite interesting to observe the difference between issuing several time the "start" function, without calling the "stop" one, and with calling the "stop" one : I think that issuing the "start" one several time without the "stop" one, the initially running "bcfg2-server" makes it so the additional ones you try to spawn kill themselves realizing there already is one of them serving. But with the "stop" function, the initially running server passes in a state where it begins to shut down (observed in its logs, if you activate it), which allows new ones to be spawned, as the initially running one is not really serving anything anymore, but only waits for its worker script to stop (which as I said never happens). Half-dead additional processes would then be processes indefinately waiting for file alteration monitoring processes to stop, but not serving anything - zombies, in other terms... - now, this implies one problem : the "killproc" function will only remove the PID file if a TERM signal has been requested : would a KILL signal be requested, it will not clean the PID file... Seriously, I can't believe how this LSB "init-functions" is a pile of stupid, ridicule and useless crap ! If a TERM signal isn't send, the daemon will not be stopped as it is now, and the PID file will not be cleaned. But if a KILL signal is send, there is a logical test inside this "init-functions" stupid crap that will prevent the PID file from being cleaned... yeah, "right, Indian companion" : very useful wrappers indeed (or, most definately, not)... well, up to you, the maintainer, to choose : as the PID file will never got cleaned, do you want the "stop" stanza to effectively stop the daemon, or never do it ? I guess the SIGKILL is still the least obstrusive way to sort things out for now : once you'll have "bcfg2-server" patched so it manages to pass the SIGTERM to its FAM/gamin/... worker thread, you could revert to using the SIGTERM instead of SIGKILL by removing the "9" I added to the killproc, and the PID file will be managed proprely, but for now, one way or the other, it is impossible, so... - other problem is that if the server was not running when the "stop" stanza is called, an error message from the "start-stop-daemon" spawned by the "init-functions" wrapper will be issued in the term, but somehow will not result in an error code, as the killproc function will have exited cleanly (told you : "init-functions" is a pile of crappy stupidity...), which is a bit incoherent : wether it should not display this error message, or wether it should log it. So I think it would be a bit better to do some more tests, so to know if the server is running before trying to stop it, and not spawn an error if it was already terminated (which seems the usual way to do in Debian). I do this in a similar way to the one I used to resolve the second problem for the "start" function. ***** As for the "status" stanza : - well, it just doesn't work, or works too well : whatever happens, it always tell bcfg2-server is running, even though it obviously is not. Guess why ? Plain and simple : you use /bin/pidof to find a process whose base name is "bcfg2-server". And guess how this init script is called ? You're right : "bcfg2-server"... ask a script wether a process, "more or less" named as the very same script is, is running : he'll of course always answer one is, as IT is running... seriously ? Using the daemon full path in the pidof instead of its base name simply solves the problem... - otherwise, with the (absolutely necessary, for now !) SIGKILL to stop the server, the "$BINARY dead but pid file exists..." message is plain useless, as this file will not be cleaned at all without a SIGTERM request. So, until you patch the server to work with a SIGTERM, I commented it. - now, it doesn't mean it is not buggy either : to check this, you test if the PID file exists (good), and wether a process having this PID is running (bad) - you should test wether it is NOT running, should you need to test anything. But actually, as you already previously tested that, in the lines just above, and should bcfg2-server have been running, the function would already have exited with a 0 signal, so there is no use in testing that again : it is implicitely already a known fact. So I removed this "[ -n $PID ]" test, that should rather have been a "[ -z $PID ]" instead, but is useless anyway, even if the server responded appropriately to a SIGTERM... ***** And finally, for the "restart" stanza : - minor thing : the 5 second sleep is not useful, especially not with a SIGKILL that doesn't leave any chance for needing to wait anytime - but not useful anymore with a SIGTERM that works correctly. If it works, it works : no need for Black Arts, then. So I removed this useless sleep time. Hope you'll patch all of this, as I really think (most of) this is really serious (basically, not any of the defined init stanza works correctly - "start" runs too early and wreaks havoc in the PID file if used sequentially, "stop" never works, and, as "restart", only makes it possible to stack server processes one above the others, while "status" is always reporting the same thing whatever the circumstances) : a diff is of course attached (half the size of the original file ! Ahemmm...). Haven't had the time yet to check much into the bcfg2 client, but what I saw made me loose any hope about its init scripts - I may file another bug for those, but, well, I'll start telling you this here : there is not any support anymore for a persistent agent mode in bcfg2, starting from version 1.0... maybe you did not noticed this, but, well : mostly, your client init scripts are now pretty much useless, and un-working (appart from the init "one-time" run... but I think this could as well be done using custom thingies in rc.local - you could as well remove them, leaving you more time to maintain the server ones : [friendly] pun intended :p ) : the prefered ways to run the client seem now to use SSH triggering or cron (don't know yet what I'll do). And, I finally hope that you will not resent the tone (a bit grumpy, I confess) of my message, but, the init scripts you maintain (I don't know if you wrote those, and, well, whatever... : I am pointing finger at scripts, not at people) are a real mess (I've been trying to understand what I was doing wrong for a day, while, well... I was mostly right, I guess), and the use of those LSB debilitated and undocumented "init-functions" can really get on the nerves of anyone needing to look at it... which is a bit of a shame, as bcfg2 seems so much goodie (I have come to purely hate cfengine and puppet : using those, try managing the repository ACLs for the various unique private keys to be pulled by hundreds and hundreds of clients... not to speak about proprietary extensions providing the database pull I need so much while extending the abilities of the software is a real pain, or the ruby behemoth ressource consumptions while most of the work has to be done inside each and every client - strategy and design flaws incarnate : such a payoff, for _management_ apps... what I've read about bcfg2 really makes it look like heaven to me [appart from the need to touch XML, OK, right...], and I really think it deserves a better advertisement, or at least chance, which its Debian init scripts certainly do not offer, as they are now). Farewell. -- System Information: Debian Release: squeeze/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-5-openvz-amd64 (SMP w/4 CPU cores) Shell: /bin/sh linked to /bin/dash Versions of packages bcfg2-server depends on: ii bcfg2 1.0.1-2 Configuration management client ii gamin 0.1.10-2+b1 File and directory monitoring syst ii libxml2-utils 2.7.7.dfsg-4 XML utilities ii lsb-base 3.2-23.1 Linux Standard Base 3.2 init scrip ii openssl 0.9.8o-1 Secure Socket Layer (SSL) binary a ii python 2.6.5-5 An interactive high-level object-o ii python-fam 1.1.1-2.2+b1 Python interface to FAM ii python-gamin 0.1.10-2+b1 Python binding for the gamin clien ii python-lxml 2.2.6-1 pythonic binding for the libxml2 a ii python-support 1.0.9 automated rebuilding support for P ii ucf 3.0025 Update Configuration File: preserv Versions of packages bcfg2-server recommends: pn graphviz <none> (no description available) Versions of packages bcfg2-server suggests: pn mail-transport-agent <none> (no description available) pn python-cheetah <none> (no description available) pn python-django <none> (no description available) pn python-genshi <none> (no description available) pn python-profiler <none> (no description available) pn sqlalchemy <none> (no description available)
10,11c10,11 < # Required-Start: $network $remote_fs $named < # Required-Stop: $network $remote_fs $named --- > # Required-Start: $network $remote_fs $named $syslog > # Required-Stop: $network $remote_fs $named $syslog 43a44 > PID=$(pidof -x ${DAEMON}) 47,48c48,53 < start_daemon ${DAEMON} ${PARAMS} ${BCFG2_SERVER_OPTIONS} < STATUS=$? --- > if [ -z ${PID} ]; then > start_daemon ${DAEMON} ${PARAMS} ${BCFG2_SERVER_OPTIONS} > STATUS=$? > else > STATUS=0 > fi 61,62c66,71 < killproc -p $PIDFILE ${BINARY} < STATUS=$? --- > if [ ! -z ${PID} ]; then > killproc -p $PIDFILE ${BINARY} 9 > STATUS=$? > else > STATUS=0 > fi 74d82 < PID=$(pidof -x $BINARY) 80,85c88,95 < if [ -f $PIDFILE ]; then < if [ -n "$PID" ]; then < log_failure_msg "$BINARY dead but pid file exists..." < return 1 < fi < fi --- > # This section will only be useful when the bug, making the server never fully > # responding to a SIGTERM, but only to a SIGKILL, will be corrected... but for > # no wit only spawns unnecessary errors, and returns already expected > # information... > # if [ -f $PIDFILE ]; then > # log_failure_msg "$BINARY dead but pid file exists..." > # return 1 > # fi 103d112 < sleep 5