[Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

Clint Byrum Mon, 12 Dec 2011 17:27:44 -0800

Fix is waiting in the oneiric-proposed queue.

** Description changed:


+ == SRU JUSTIFICATION ==
+ 
+ IMPACT: potential data loss or extension of downtime. MySQL, for
+ example, if sent a SIGKILL before it is done flushing its buffers into
+ MyISAM tables, will lose that data. If using InnoDB, the transactions
+ will have to be replayed from the transaction log at startup, which can
+ take far longer than completing the flush procedure which the 300 second
+ kill timeout in its job file allows for.
+ 
+ TEST CASE:
+ 
+ 1. create a script, /usr/local/bin/15seconds.py with this as the
+ content:
+ 
+ ##### BEGIN COPY/PASTE #####
+ #!/usr/bin/python
+ 
+ import time
+ import signal
+ import logging
+ import sys
+ 
+ logging.basicConfig(level=logging.INFO,format="TEST: %(asctime)s:
+ %(message)s")
+ 
+ def shutdown_process(sig, frame):
+       logging.info("sleeping 15 seconds...")
+       time.sleep(15)
+       logging.info("now exitting...")
+       sys.exit(0)
+ 
+ signal.signal(signal.SIGTERM, shutdown_process)
+ 
+ logging.info("Entering infinite loop")
+ while True:
+       time.sleep(1)
+ ##### END COPY/PASTE #####
+ 
+ chmod +x /usr/local/bin/15seconds.py
+ 
+ 2. Create an upstart job file, /etc/init/15sec.conf to run this:
+ 
+ ##### BEGIN COPY/PASTE #####
+ start on runlevel [2345]
+ stop on runlevel [016]
+ 
+ respawn
+ 
+ kill timeout 17
+ 
+ console output
+ 
+ exec /usr/local/bin/15seconds.py
+ ##### END COPY/PASTE #####
+ 
+ 3. sudo initctl start 15sec
+ 4. sudo shutdown -h now
+ 
+ On an affected system, the job will be sent SIGKILL before the 15 second
+ kill timeout, so your shutdown log will look something like this:
+ 
+ Checking for running unattended-upgrades: 
+ TEST: 2011-12-13 01:02:54,638: sleeping 15 seconds...
+  * Asking all remaining processes to terminate...                             
  TEST: 2011-12-13 01:02:54,818: now exitting...
+ TEST: 2011-12-13 01:02:54,819: sleeping 15 seconds...
+                                                                          [ OK 
]
+  * Killing all remaining processes...                                    
[fail] 
+  * Deconfiguring network interfaces...                                   [ OK 
] 
+  * Deactivating swap...                                                  [ OK 
] 
+  * Will now halt
+ [   68.020383] System halted.
+ 
+ An unaffected system will look like this:
+ 
+ Checking for running unattended-upgrades: 
+ TEST: 2011-12-13 00:52:30,476: sleeping 15 seconds...
+  * Asking all remaining processes to terminate...                        [ OK 
] 
+ TEST: 2011-12-13 00:52:45,497: now exitting...
+  * All processes ended within 16 seconds....                             [ OK 
] 
+  * Deconfiguring network interfaces...                                   [ OK 
] 
+  * Deactivating swap...                                                  [ OK 
] 
+  * Will now halt
+ [  356.481556] System halted.
+ 
+ Note that the 15sec job is waited on once the bug is fixed, where in the
+ unpatched version it is killed immediately.
+ 
+ DEV FIX: The sendsigs script has not been changed in precise other than
+ for this patch.
+ 
+ REGRESSION POTENTIAL: There may be scenarios and jobs that have very
+ high kill timeouts which will cause system shutdowns to wait for up to
+ 300 seconds instead of the pervious 10. This is considered a good
+ balance between waiting long enough for any reasonable application to
+ flush its buffers and short enough that we won't run up against any
+ battery backup systems running out of battery power.
+ 
+ ======
+ 
  I'm using mysql-server-5.1 on a 10.04 LTS installation.
  The mysql db is around 27GB and on a separate partition mounted as 
/var/lib/mysql.
  
  On shutdown I get the following error message:
  
- Checking for running unattended-upgrades:  * Asking all remaining processes 
to terminate...       
- [80G 
+ Checking for running unattended-upgrades:  * Asking all remaining processes 
to terminate...
+ [80G
  [74G[ OK ]
-  * All processes ended within 1 seconds....       
- [80G 
+  * All processes ended within 1 seconds....
+ [80G
  [74G[ OK ]
-  * Deconfiguring network interfaces...       
- [80G 
+  * Deconfiguring network interfaces...
+ [80G
  [74G[ OK ]
-  * Deactivating swap...       
- [80G 
+  * Deactivating swap...
+ [80G
  [74G[ OK ]
-  * Unmounting local filesystems...       
+  * Unmounting local filesystems...
  [80G umount2: Device or resource busy
  umount: /var/lib/mysql: device is busy.
-         (In some cases useful info about processes that use
-          the device is found by lsof(8) or fuser(1))
+         (In some cases useful info about processes that use
+          the device is found by lsof(8) or fuser(1))
  umount2: Device or resource busy
  umount2: Device or resource busy
  umount: /tmp: device is busy.
-         (In some cases useful info about processes that use
-          the device is found by lsof(8) or fuser(1))
+         (In some cases useful info about processes that use
+          the device is found by lsof(8) or fuser(1))
  umount2: Device or resource busy
  [74G[
  [31mfail
  [39;49m]
  mount: / is busy
-  * Will now restart
+  * Will now restart
  [ 3369.429751] Restarting system.
- 
  
  On the next reboot the file system is corrupt and need to be fsck-ed.
  
  I think the problem is, that mysql uses an upstart job (/etc/init/mysql.conf) 
and has
  stop on runlevel [016]
  
  The rc.conf job is also triggered on runlevel 0 and 6, so they basically
  run at the same time.As
  
  When /etc/rc0.d/S20sendsigs is run, it deliberatly does not wait or kill
  any upstart jobs.
  
  As my mysqld process takes some time to shutdown, S40umountfs and
  S60umountroot are run before the mysqld has quit.
  
  Leading to the fs not being properly unmounted. It is event possible
  that mysqld is forcefully killed by halt in S90halt if it hasn't stopped
  by then.
  
  This is a serious issue, as it can (and will) lead to data loss.
  
  Other upstart jobs, like rsyslog.conf, use the same "stop on runlevel
  [016]" stanza, so they are probably affected too.
  
  ProblemType: Bug
  DistroRelease: Ubuntu 10.10
  Package: mysql-server-5.1 5.1.49-1ubuntu8.1
  Uname: Linux 2.6.32-5-686 i686
  NonfreeKernelModules: michael_mic arc4 ecb lib80211_crypt_tkip aes_i586 
aes_generic lib80211_crypt_ccmp sco bnep rfcomm l2cap binfmt_misc acpi_cpufreq 
ppdev lp cpufreq_userspace cpufreq_stats vboxnetadp cpufreq_powersave 
vboxnetflt cpufreq_conservative vboxdrv fuse pcmcia snd_intel8x0m snd_intel8x0 
snd_ac97_codec btusb bluetooth rfkill ac97_bus yenta_socket ipw2200 snd_pcm 
8139too firewire_ohci snd_seq 8139cp firewire_core sg uhci_hcd snd_timer 
rsrc_nonstatic libipw snd_seq_device pcmcia_core crc_itu_t parport_pc 
smsc_ircc2 ehci_hcd mii joydev lib80211 sr_mod parport i2c_i801 irda snd 
usbcore wbsd soundcore shpchp mmc_core pcspkr rng_core cdrom psmouse container 
crc_ccitt snd_page_alloc serio_raw pci_hotplug ac battery nls_base processor 
evdev ppp_generic slhc loop autofs4 ext4 mbcache jbd2 crc16 dm_mod sd_mod 
crc_t10dif radeon ttm ata_generic drm_kms_helper ata_piix drm i2c_algo_bit 
libata video thermal i2c_core scsi_mod output thermal_sys button
  Architecture: i386
  Date: Fri Dec 10 13:41:52 2010
  ProcEnviron:
-  PATH=(custom, no user)
-  LANG=de_DE.utf8
-  SHELL=/bin/bash
+  PATH=(custom, no user)
+  LANG=de_DE.utf8
+  SHELL=/bin/bash
  SourcePackage: mysql-5.1

** Changed in: sysvinit (Ubuntu Oneiric)
       Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in Ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mysql-5.1/+bug/688541/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs

[Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

Reply via email to