Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Thu, 2010-12-23 at 22:07 +, ingo wrote: I took you literally and canged all [!2345], not the others: The remaining now are: fgrep stop on runlevel /etc/init/*.conf /etc/init/rc.conf:stop on runlevel [!$RUNLEVEL] /etc/init/rcS.conf:stop on runlevel [!S] /etc/init/rc-sysinit.conf:stop on runlevel /etc/init/tty2.conf:stop on runlevel [!23] /etc/init/tty3.conf:stop on runlevel [!23] /etc/init/tty4.conf:stop on runlevel [!23] /etc/init/tty5.conf:stop on runlevel [!23] /etc/init/tty6.conf:stop on runlevel [!23] /etc/init/ufw.conf:stop on runlevel [!023456] I still get the orphaned inodes. Shall I also convert the tty's? You can, but I doubt they're the problem. Can you paste the output of lsof -n |grep deleted After the reinstall? Thanks. -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Thu, 2010-12-23 at 12:40 +, ingo wrote: @Clint: I did test your proposal in Maverick. Before editing the stop scripts: fgrep stop on runlevel /etc/init/*.conf /etc/init/acpid.conf:stop on runlevel [!2345] /etc/init/anacron.conf:stop on runlevel [!2345] /etc/init/apport.conf:stop on runlevel [!2345] /etc/init/atd.conf:stop on runlevel [!2345] /etc/init/cron.conf:stop on runlevel [!2345] /etc/init/cups.conf:stop on runlevel [016] /etc/init/dbus.conf:stop on runlevel [06] /etc/init/failsafe-x.conf:stop on runlevel [06] /etc/init/gdm.conf:stop on runlevel [016] /etc/init/irqbalance.conf:stop on runlevel [!2345] /etc/init/mountall-shell.conf:stop on runlevel [06] /etc/init/rc.conf:stop on runlevel [!$RUNLEVEL] /etc/init/rcS.conf:stop on runlevel [!S] /etc/init/rc-sysinit.conf:stop on runlevel /etc/init/rsyslog.conf:stop on runlevel [06] /etc/init/tty1.conf:stop on runlevel [!2345] /etc/init/tty2.conf:stop on runlevel [!23] /etc/init/tty3.conf:stop on runlevel [!23] /etc/init/tty4.conf:stop on runlevel [!23] /etc/init/tty5.conf:stop on runlevel [!23] /etc/init/tty6.conf:stop on runlevel [!23] /etc/init/udev.conf:stop on runlevel [06] /etc/init/ufw.conf:stop on runlevel [!023456] After editing the stop scripts: fgrep stop on starting /etc/init/*.conf /etc/init/cups.conf:stop on starting rc RUNLEVEL=[016] /etc/init/dbus.conf:stop on starting rc RUNLEVEL=[06] /etc/init/failsafe-x.conf:stop on starting rc RUNLEVEL=[06] /etc/init/gdm.conf:stop on starting rc RUNLEVEL=[016] /etc/init/mountall.conf:stop on starting rcS /etc/init/mountall-shell.conf:stop on starting rc RUNLEVEL=[06] /etc/init/rsyslog.conf:stop on starting rc RUNLEVEL=[06] /etc/init/udev.conf:stop on starting rc RUNLEVEL=[06] Then execute apt-get install --reinstall libc6 and reboot: I still get the 8 orphaned inodes as reported already. Did I miss to change the other scrips as well like this? stop on runlevel [!2345] - stop on stop on starting rc RUNLEVEL=[016] Yes, and some of those are probably the most likely to have libc open. If doing the same to all of the !2345's does not fix the corruption, can you do: apt-get install --reinstall libc6 lsof -n |grep deleted initctl list And paste or upload the output of that here? -- You received this bug notification because you are a direct subscriber of the bug. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) Status in “mysql-5.1” package in Ubuntu: Triaged Status in “sysvinit” package in Ubuntu: Triaged Bug description: I'm using mysql-server-5.1 on a 10.04 LTS installation. The mysql db is around 27GB and on a separate partition mounted as /var/lib/mysql. On shutdown I get the following error message: Checking for running unattended-upgrades: * Asking all remaining processes to terminate... [80G [74G[ OK ] * All processes ended within 1 seconds [80G [74G[ OK ] * Deconfiguring network interfaces... [80G [74G[ OK ] * Deactivating swap... [80G [74G[ OK ] * Unmounting local filesystems... [80G umount2: Device or resource busy umount: /var/lib/mysql: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) umount2: Device or resource busy umount2: Device or resource busy umount: /tmp: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) umount2: Device or resource busy [74G[ [31mfail [39;49m] mount: / is busy * Will now restart [ 3369.429751] Restarting system. On the next reboot the file system is corrupt and need to be fsck-ed. I think the problem is, that mysql uses an upstart job (/etc/init/mysql.conf) and has stop on runlevel [016] The rc.conf job is also triggered on runlevel 0 and 6, so they basically run at the same time.As When /etc/rc0.d/S20sendsigs is run, it deliberatly does not wait or kill any upstart jobs. As my mysqld process takes some time to shutdown, S40umountfs and S60umountroot are run before the mysqld has quit. Leading to the fs not being properly unmounted. It is event possible that mysqld is forcefully killed by halt in S90halt if it hasn't stopped by then. This is a serious issue, as it can (and will) lead to data loss. Other upstart jobs, like rsyslog.conf, use the same stop on runlevel [016] stanza, so they are probably affected too. ProblemType: Bug DistroRelease: Ubuntu 10.10 Package: mysql-server-5.1 5.1.49-1ubuntu8.1 Uname: Linux 2.6.32-5-686 i686 NonfreeKernelModules: michael_mic arc4 ecb lib80211_crypt_tkip aes_i586 aes_generic lib80211_crypt_ccmp sco bnep rfcomm l2cap binfmt_misc acpi_cpufreq ppdev lp cpufreq_userspace cpufreq_stats vboxnetadp cpufreq_powersave vboxnetflt cpufreq_conservative vboxdrv fuse pcmcia snd_intel8x0m snd_intel8x0
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Thu, 2010-12-23 at 22:07 +, ingo wrote: I took you literally and canged all [!2345], not the others: The remaining now are: fgrep stop on runlevel /etc/init/*.conf /etc/init/rc.conf:stop on runlevel [!$RUNLEVEL] /etc/init/rcS.conf:stop on runlevel [!S] /etc/init/rc-sysinit.conf:stop on runlevel /etc/init/tty2.conf:stop on runlevel [!23] /etc/init/tty3.conf:stop on runlevel [!23] /etc/init/tty4.conf:stop on runlevel [!23] /etc/init/tty5.conf:stop on runlevel [!23] /etc/init/tty6.conf:stop on runlevel [!23] /etc/init/ufw.conf:stop on runlevel [!023456] I still get the orphaned inodes. Shall I also convert the tty's? You can, but I doubt they're the problem. Can you paste the output of lsof -n |grep deleted After the reinstall? Thanks. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Tue, 2010-12-21 at 12:41 +, Scott James Remnant wrote: On 20/12/10 18:22, Clint Byrum wrote: In a message to ubuntu-devel I suggested that we have an abstract job, 'network-services', which most normal (non boot-critical) services should follow. https://lists.ubuntu.com/archives/ubuntu-devel/2010-December/032254.html General note: ubuntu-devel is *NOT* the correct list to discuss Upstart changes unless they're unique to Ubuntu. Thanks, Scott In this case, I don't know if this would be unique to Ubuntu or not. I am not suggesting a code change in upstart with that message, but rather a change in the way upstart is used and packaged in Ubuntu. Though, it would be rather nice if everybody used upstart the same way. -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Tue, 2010-12-21 at 12:41 +, Scott James Remnant wrote: On 20/12/10 18:22, Clint Byrum wrote: In a message to ubuntu-devel I suggested that we have an abstract job, 'network-services', which most normal (non boot-critical) services should follow. https://lists.ubuntu.com/archives/ubuntu-devel/2010-December/032254.html General note: ubuntu-devel is *NOT* the correct list to discuss Upstart changes unless they're unique to Ubuntu. Thanks, Scott In this case, I don't know if this would be unique to Ubuntu or not. I am not suggesting a code change in upstart with that message, but rather a change in the way upstart is used and packaged in Ubuntu. Though, it would be rather nice if everybody used upstart the same way. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/21 ingo 688...@bugs.launchpad.net: On Tue, 2010-12-21 at 12:41 +, Scott James Remnant wrote: General note: ubuntu-devel is *NOT* the correct list to discuss Upstart changes unless they're unique to Ubuntu. Wouldn't it be fair to inform Debian about those problems before they release Squeeze? (tough I never observed it on Squeeze till now) This doesn't affect Debian as the upstart package in Debian still uses plain sysv compat and there are no native upstart jobs yet. Michael (upstart maintainer in Debian) -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/20 James Hunt 688...@bugs.launchpad.net: 3) Modify all upstart configs for services which are slow to stop such that they stop on unmount-filesystem, rather than stop on runlevel [016]. - What about single user mode? I guess when switching to runlevel 1 we want to stop services like mysql? - How do you decide if a service is 'slow to stop' ? Imho that highly depends on the given hardware, local configuration and the amount of data you are dealing with. A general approach would be preferable. -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Mon, 2010-12-20 at 12:50 +, James Hunt wrote: After discussion with Scott, the best short-term solution would seem to be: 1) Modify /etc/init.d/umountfs to call the following in do_stop before calling umount/swapoff: initctl emit unmount-filesystem 2) Modify /etc/init.d/umountroot to call the following in do_stop before calling umount: initctl emit unmount-root-filesystem 3) Modify all upstart configs for services which are slow to stop such that they stop on unmount-filesystem, rather than stop on runlevel [016]. 4) Test! The overall effect of this being that when /etc/init.d/umountfs emits the unmount-filesystem event, it will block until any Upstart jobs which stop on those events have completed. Thus, /etc/init.d/umountfs will wait for the mysql Upstart job to finish before unmounting its filesystems. Not much happens between rc-sysinit starting and sendsigs/umountfs. Is slow even 1 second between SIGTERM and exiting? Shouldn't we just make sure everything that is 'stop on runlevel [!2345]' or 'stop on runlevel [016]' stops before we umount? bug #672177 may very well be caused simply by killing the last service that had the deleted libc.so.6 open, causing the fs to need to finish the deletion right then, which could be waiting on a sync and many other files being flushed/etc. on a busy rotational disk. This will cause something very tiny to take a second to die. I think we must transition *everything* that stops on runlevel [016] to 'stop on unmounting-filesystems', or get clever and find a way to wait until upstart is done stopping everything it already wants to stop. I do think that initctl list is flawed for this task, but it might be the best chance at catching stragglers that we have. In a message to ubuntu-devel I suggested that we have an abstract job, 'network-services', which most normal (non boot-critical) services should follow. https://lists.ubuntu.com/archives/ubuntu-devel/2010-December/032254.html By taking this approach, we can at least ammend this fix if it has unintended consequences. There's also still the issue (which probably should be its own bug report) that sendsigs will kill the children of already stopping jobs, which it shouldn't do, and which it would still do in the suggested fix since sendsigs runs before umountfs. -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/20 James Hunt 688...@bugs.launchpad.net: 3) Modify all upstart configs for services which are slow to stop such that they stop on unmount-filesystem, rather than stop on runlevel [016]. - What about single user mode? I guess when switching to runlevel 1 we want to stop services like mysql? - How do you decide if a service is 'slow to stop' ? Imho that highly depends on the given hardware, local configuration and the amount of data you are dealing with. A general approach would be preferable. -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Mon, 2010-12-20 at 12:50 +, James Hunt wrote: After discussion with Scott, the best short-term solution would seem to be: 1) Modify /etc/init.d/umountfs to call the following in do_stop before calling umount/swapoff: initctl emit unmount-filesystem 2) Modify /etc/init.d/umountroot to call the following in do_stop before calling umount: initctl emit unmount-root-filesystem 3) Modify all upstart configs for services which are slow to stop such that they stop on unmount-filesystem, rather than stop on runlevel [016]. 4) Test! The overall effect of this being that when /etc/init.d/umountfs emits the unmount-filesystem event, it will block until any Upstart jobs which stop on those events have completed. Thus, /etc/init.d/umountfs will wait for the mysql Upstart job to finish before unmounting its filesystems. Not much happens between rc-sysinit starting and sendsigs/umountfs. Is slow even 1 second between SIGTERM and exiting? Shouldn't we just make sure everything that is 'stop on runlevel [!2345]' or 'stop on runlevel [016]' stops before we umount? bug #672177 may very well be caused simply by killing the last service that had the deleted libc.so.6 open, causing the fs to need to finish the deletion right then, which could be waiting on a sync and many other files being flushed/etc. on a busy rotational disk. This will cause something very tiny to take a second to die. I think we must transition *everything* that stops on runlevel [016] to 'stop on unmounting-filesystems', or get clever and find a way to wait until upstart is done stopping everything it already wants to stop. I do think that initctl list is flawed for this task, but it might be the best chance at catching stragglers that we have. In a message to ubuntu-devel I suggested that we have an abstract job, 'network-services', which most normal (non boot-critical) services should follow. https://lists.ubuntu.com/archives/ubuntu-devel/2010-December/032254.html By taking this approach, we can at least ammend this fix if it has unintended consequences. There's also still the issue (which probably should be its own bug report) that sendsigs will kill the children of already stopping jobs, which it shouldn't do, and which it would still do in the suggested fix since sendsigs runs before umountfs. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/16 Clint Byrum cl...@fewbar.com: /etc/init.d/sendsigs has this code: # Upstart jobs have their own stop on clauses that sends # SIGTERM/SIGKILL just like this, so if they're still running, # they're supposed to be for pid in $(initctl list | sed -n -e /process [0-9]/s/.*process //p); do OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid done It uses this to determine which pids not to kill because, presumably, upstart should be managing them. However, this code is flawed. killall5 will kill the children of all of these if they are multi process daemons or scripts running things. This observation is correct. On the other hand, isn't this exactly what the sendsigs script is for: clean up any remaining, stray processes which have not been stopped by its corresponding sysv init script or upstart job (or have been e.g. started by the user)? But I guess you are right, we should first stop all upstart jobs, give them time to finish stopping, and then let sendsigs clean up anything remaining afterwards. However, this technique can actually be used to determine if there are still jobs that are supposed to be stopped, but haven't finished stopping yet. Since they should be listed as stop/(pre-stop|post- stop|killed), we can determine exactly which pids we expect to go away. Since upstart has its own idea of how long to wait before it kills these, we should actually wait indefinitely. I'm attaching a debdiff that solves the race as far as I can tell, though I think it needs a good long look, since it could mean shutdowns hang for a long time waiting (I'm especially curious if the pre-stop /post-stop's are subject to kill timeout) This code is still racy, afaics. What about upstart jobs, which are not stopped by stop on runlevel [016]? They could receive their stop signal at a point when your loop has already been run. If you don't want to change existing jobs, we probably have to pick up Ante's suggestion, and do the following in sendsigs: 1) run a for loop to wait for *all* running upstart jobs to stop. upstart jobs which need to keep running past sendsigs (e.g. plymouth) need to signal that using a similar mechanism like the killall5 sendsigs.d omit interface. I'd at least give upstart jobs 60secs time to stop, so big databases etc have enough time to cleanly shutdown 2.) run a for loop and send SIGTERM all remaining processes, but do *not* add upstart pids to $OMITPIDS 3.) send a final SIGKILL if any processes are left. Regarding 1.), it would be nice to have a native C implementation in upstart, instead of running initctl, grep and sleep manually. -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Thu, 2010-12-16 at 15:45 +, Michael Biebl wrote: 2010/12/16 Clint Byrum cl...@fewbar.com: I'm attaching a debdiff that solves the race as far as I can tell, though I think it needs a good long look, since it could mean shutdowns hang for a long time waiting (I'm especially curious if the pre-stop /post-stop's are subject to kill timeout) This code is still racy, afaics. What about upstart jobs, which are not stopped by stop on runlevel [016]? They could receive their stop signal at a point when your loop has already been run. Indeed, there is still a race I think now that I dig through upstart's code a bit. If any of the jobs in the stop/!waiting state have 'stop on stopped' jobs that will be stopped after they stop, the event isn't emitted until *after* the transition to stop/waiting. thread A (upstart job foo): start/running - stop/pre-stop sends TERM to owned process stop/pre-stop - stop/killed process dies stop/killed - stop/waiting emit stopped JOB=foo thread B (upstart job baz) start/running - stop/pre-stop sends kill to owned process stop/pre-stop - stop/killed process dies stop/killed - stop/waiting thread C (sleep loop) runs initctl list greps sleeps runs initctl list greps sleeps list is handled by doing a get all jobs command first, and then individual status commands for each job, so its entirely possible that we will ask for the status of baz and it will say start/running, and then foo finishes its transition, then we ask for foo's status and it is stop/waiting, and we think we're done. This race would probably be solved by having a list all jobs with status command, as long as the stopped event is guaranteed to be consumed before any commands, which, I believe it will. One delicate issue is that if an upstart managed process dies for any other reason than being stopped, upstart will try to respawn it, so we can't just go sending SIGTERM/SIGKILL to all pids, as upstart will fight us on those. We actually have to stop everything. If you don't want to change existing jobs, we probably have to pick up Ante's suggestion, and do the following in sendsigs: 1) run a for loop to wait for *all* running upstart jobs to stop. upstart jobs which need to keep running past sendsigs (e.g. plymouth) need to signal that using a similar mechanism like the killall5 sendsigs.d omit interface. I'd at least give upstart jobs 60secs time to stop, so big databases etc have enough time to cleanly shutdown IMO, leaving out a valid stop on that gets it stopped at or before runlevel [016] is the equivilent of the omit interface. You've started it, saying exactly when upstart should or should not stop it. However, if you've wandered into the scenario mentioned above with stop on stopped foo, then we need to handle that. 2.) run a for loop and send SIGTERM all remaining processes, but do *not* add upstart pids to $OMITPIDS See above, you'd have to send 'stop' commands to upstart for them, instead of omitting them. 3.) send a final SIGKILL if any processes are left. I'd say let upstart do that.. but how do we know when we can continue on to unmounting? I suppose after a lengthy timeout (60s does seem long enough, though mysql can take longer) this makes sense. Regarding 1.), it would be nice to have a native C implementation in upstart, instead of running initctl, grep and sleep manually. I agree, but I'm having trouble envisioning exactly what one would ask for. Block until all current goals are reached. Would work maybe. -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/16 Clint Byrum cl...@fewbar.com: /etc/init.d/sendsigs has this code: # Upstart jobs have their own stop on clauses that sends # SIGTERM/SIGKILL just like this, so if they're still running, # they're supposed to be for pid in $(initctl list | sed -n -e /process [0-9]/s/.*process //p); do OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid done It uses this to determine which pids not to kill because, presumably, upstart should be managing them. However, this code is flawed. killall5 will kill the children of all of these if they are multi process daemons or scripts running things. This observation is correct. On the other hand, isn't this exactly what the sendsigs script is for: clean up any remaining, stray processes which have not been stopped by its corresponding sysv init script or upstart job (or have been e.g. started by the user)? But I guess you are right, we should first stop all upstart jobs, give them time to finish stopping, and then let sendsigs clean up anything remaining afterwards. However, this technique can actually be used to determine if there are still jobs that are supposed to be stopped, but haven't finished stopping yet. Since they should be listed as stop/(pre-stop|post- stop|killed), we can determine exactly which pids we expect to go away. Since upstart has its own idea of how long to wait before it kills these, we should actually wait indefinitely. I'm attaching a debdiff that solves the race as far as I can tell, though I think it needs a good long look, since it could mean shutdowns hang for a long time waiting (I'm especially curious if the pre-stop /post-stop's are subject to kill timeout) This code is still racy, afaics. What about upstart jobs, which are not stopped by stop on runlevel [016]? They could receive their stop signal at a point when your loop has already been run. If you don't want to change existing jobs, we probably have to pick up Ante's suggestion, and do the following in sendsigs: 1) run a for loop to wait for *all* running upstart jobs to stop. upstart jobs which need to keep running past sendsigs (e.g. plymouth) need to signal that using a similar mechanism like the killall5 sendsigs.d omit interface. I'd at least give upstart jobs 60secs time to stop, so big databases etc have enough time to cleanly shutdown 2.) run a for loop and send SIGTERM all remaining processes, but do *not* add upstart pids to $OMITPIDS 3.) send a final SIGKILL if any processes are left. Regarding 1.), it would be nice to have a native C implementation in upstart, instead of running initctl, grep and sleep manually. -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
On Thu, 2010-12-16 at 15:45 +, Michael Biebl wrote: 2010/12/16 Clint Byrum cl...@fewbar.com: I'm attaching a debdiff that solves the race as far as I can tell, though I think it needs a good long look, since it could mean shutdowns hang for a long time waiting (I'm especially curious if the pre-stop /post-stop's are subject to kill timeout) This code is still racy, afaics. What about upstart jobs, which are not stopped by stop on runlevel [016]? They could receive their stop signal at a point when your loop has already been run. Indeed, there is still a race I think now that I dig through upstart's code a bit. If any of the jobs in the stop/!waiting state have 'stop on stopped' jobs that will be stopped after they stop, the event isn't emitted until *after* the transition to stop/waiting. thread A (upstart job foo): start/running - stop/pre-stop sends TERM to owned process stop/pre-stop - stop/killed process dies stop/killed - stop/waiting emit stopped JOB=foo thread B (upstart job baz) start/running - stop/pre-stop sends kill to owned process stop/pre-stop - stop/killed process dies stop/killed - stop/waiting thread C (sleep loop) runs initctl list greps sleeps runs initctl list greps sleeps list is handled by doing a get all jobs command first, and then individual status commands for each job, so its entirely possible that we will ask for the status of baz and it will say start/running, and then foo finishes its transition, then we ask for foo's status and it is stop/waiting, and we think we're done. This race would probably be solved by having a list all jobs with status command, as long as the stopped event is guaranteed to be consumed before any commands, which, I believe it will. One delicate issue is that if an upstart managed process dies for any other reason than being stopped, upstart will try to respawn it, so we can't just go sending SIGTERM/SIGKILL to all pids, as upstart will fight us on those. We actually have to stop everything. If you don't want to change existing jobs, we probably have to pick up Ante's suggestion, and do the following in sendsigs: 1) run a for loop to wait for *all* running upstart jobs to stop. upstart jobs which need to keep running past sendsigs (e.g. plymouth) need to signal that using a similar mechanism like the killall5 sendsigs.d omit interface. I'd at least give upstart jobs 60secs time to stop, so big databases etc have enough time to cleanly shutdown IMO, leaving out a valid stop on that gets it stopped at or before runlevel [016] is the equivilent of the omit interface. You've started it, saying exactly when upstart should or should not stop it. However, if you've wandered into the scenario mentioned above with stop on stopped foo, then we need to handle that. 2.) run a for loop and send SIGTERM all remaining processes, but do *not* add upstart pids to $OMITPIDS See above, you'd have to send 'stop' commands to upstart for them, instead of omitting them. 3.) send a final SIGKILL if any processes are left. I'd say let upstart do that.. but how do we know when we can continue on to unmounting? I suppose after a lengthy timeout (60s does seem long enough, though mysql can take longer) this makes sense. Regarding 1.), it would be nice to have a native C implementation in upstart, instead of running initctl, grep and sleep manually. I agree, but I'm having trouble envisioning exactly what one would ask for. Block until all current goals are reached. Would work maybe. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/14 Clint Byrum cl...@fewbar.com: I do think the appropriate fix is to have umountfs emit an 'unmounting- filesystems' event and anything that does a 'start on local-filesystems' or 'start on filesystem' should also 'stop on unmounting-filesystems', What do you do about services which have start on runlevel [2345] and the binary is in /usr? There are quite a few examples here: acpid, atd, cron, irqbalance, etc which all have: start on runlevel [2345] stop on runlevel [!2345] Either those jobs are buggy to not specify the start on (local-)filesystems dependency or your criteria is not sufficient. Imho the major problem here is, that there is a mixup between dependencies that need to be satisfied to be able to run a job and when (in which runlevels) to start a job. Michael -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/14 Clint Byrum cl...@fewbar.com: I do think the appropriate fix is to have umountfs emit an 'unmounting- filesystems' event and anything that does a 'start on local-filesystems' or 'start on filesystem' should also 'stop on unmounting-filesystems', What do you do about services which have start on runlevel [2345] and the binary is in /usr? There are quite a few examples here: acpid, atd, cron, irqbalance, etc which all have: start on runlevel [2345] stop on runlevel [!2345] Either those jobs are buggy to not specify the start on (local-)filesystems dependency or your criteria is not sufficient. Imho the major problem here is, that there is a mixup between dependencies that need to be satisfied to be able to run a job and when (in which runlevels) to start a job. Michael -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/10 Ante Karamatić iv...@grad.hr: Suggestion: make umountfs wait for all upstart jobs to finish. Doesn't that conflict though with what is written in /etc/init.d/sendsigs: # Upstart jobs have their own stop on clauses that sends # SIGTERM/SIGKILL just like this, so if they're still running, # they're supposed to be for pid in $(initctl list | sed -n -e /process [0-9]/s/.*process //p); do OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid done or # did an upstart job start since we last polled initctl? check # again on each loop and add any new jobs (e.g., plymouth) to # the list. If we did miss one starting up, this beats waiting # 10 seconds before shutting down. for pid in $(initctl list | sed -n -e /process [0-9]/s/.*process //p); do OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid done -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to mysql-5.1 in ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)
2010/12/10 Ante Karamatić iv...@grad.hr: Suggestion: make umountfs wait for all upstart jobs to finish. Doesn't that conflict though with what is written in /etc/init.d/sendsigs: # Upstart jobs have their own stop on clauses that sends # SIGTERM/SIGKILL just like this, so if they're still running, # they're supposed to be for pid in $(initctl list | sed -n -e /process [0-9]/s/.*process //p); do OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid done or # did an upstart job start since we last polled initctl? check # again on each loop and add any new jobs (e.g., plymouth) to # the list. If we did miss one starting up, this beats waiting # 10 seconds before shutting down. for pid in $(initctl list | sed -n -e /process [0-9]/s/.*process //p); do OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid done -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/688541 Title: race condition on shutdown (leads to corrupted fs) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs