Package: smartmontools Version: 5.41+svn3365-1~bpo60+1 Severity: important On a high IO server a periodic smart short-test sometimes is unable to complete within 6+ hours which otherwise completes in under 5 minutes. The server in question has complex disk layout with several RAID levels on same set of HDD with LVM over RAIDs and LVM over bare partitions. Server has 3 identical drives: Western Digital RE3 Serial ATA (WD1002FBYS-02A6B0) with firmware 03.00C06.
Two (sda, sdb) of those 3 drives have significantly higher load than the third (sdc). Short self tests are configures to run once a week. Here is some history: # zgrep -in self-test /var/log/daemon.log* /var/log/daemon.log.1:33902:Nov 24 02:23:20 axf smartd[3749]: Device: /dev/sda [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.1:33904:Nov 24 02:23:20 axf smartd[3749]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.1:33905:Nov 24 02:23:20 axf smartd[3749]: Device: /dev/sdc [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.1:34011:Nov 24 02:53:33 axf smartd[3749]: Device: /dev/sda [SAT], self-test in progress, 10% remaining /var/log/daemon.log.1:34014:Nov 24 02:53:44 axf smartd[3749]: Device: /dev/sdb [SAT], self-test in progress, 10% remaining /var/log/daemon.log.1:34015:Nov 24 02:53:49 axf smartd[3749]: Device: /dev/sdc [SAT], previous self-test completed without error /var/log/daemon.log.1:34958:Nov 24 06:23:20 axf smartd[3749]: Device: /dev/sda [SAT], previous self-test was aborted by the host /var/log/daemon.log.1:34960:Nov 24 06:23:20 axf smartd[3749]: Device: /dev/sdb [SAT], previous self-test was aborted by the host /var/log/daemon.log.2.gz:37189:Nov 17 02:23:20 axf smartd[3749]: Device: /dev/sda [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.2.gz:37190:Nov 17 02:23:21 axf smartd[3749]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.2.gz:37191:Nov 17 02:23:21 axf smartd[3749]: Device: /dev/sdc [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.2.gz:37248:Nov 17 02:53:20 axf smartd[3749]: Device: /dev/sda [SAT], previous self-test completed without error /var/log/daemon.log.2.gz:37250:Nov 17 02:53:20 axf smartd[3749]: Device: /dev/sdb [SAT], previous self-test completed without error /var/log/daemon.log.2.gz:37251:Nov 17 02:53:20 axf smartd[3749]: Device: /dev/sdc [SAT], previous self-test completed without error /var/log/daemon.log.3.gz:33914:Nov 10 02:29:18 axf smartd[11775]: Device: /dev/sda [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.3.gz:33915:Nov 10 02:29:18 axf smartd[11775]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.3.gz:33916:Nov 10 02:29:18 axf smartd[11775]: Device: /dev/sdc [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.3.gz:34068:Nov 10 02:59:27 axf smartd[11775]: Device: /dev/sda [SAT], self-test in progress, 10% remaining /var/log/daemon.log.3.gz:34071:Nov 10 02:59:59 axf smartd[11775]: Device: /dev/sdb [SAT], self-test in progress, 10% remaining /var/log/daemon.log.3.gz:34072:Nov 10 03:00:01 axf smartd[11775]: Device: /dev/sdc [SAT], previous self-test completed without error /var/log/daemon.log.3.gz:35483:Nov 10 08:29:18 axf smartd[11775]: Device: /dev/sda [SAT], previous self-test was aborted by the host /var/log/daemon.log.3.gz:35484:Nov 10 08:29:18 axf smartd[11775]: Device: /dev/sdb [SAT], previous self-test was aborted by the host /var/log/daemon.log.4.gz:32738:Nov 3 02:29:18 axf smartd[11775]: Device: /dev/sda [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.4.gz:32739:Nov 3 02:29:18 axf smartd[11775]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.4.gz:32740:Nov 3 02:29:18 axf smartd[11775]: Device: /dev/sdc [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.4.gz:32801:Nov 3 02:59:18 axf smartd[11775]: Device: /dev/sda [SAT], previous self-test completed without error /var/log/daemon.log.4.gz:32802:Nov 3 02:59:18 axf smartd[11775]: Device: /dev/sdb [SAT], previous self-test completed without error /var/log/daemon.log.4.gz:32804:Nov 3 02:59:18 axf smartd[11775]: Device: /dev/sdc [SAT], previous self-test completed without error /var/log/daemon.log.5.gz:33087:Oct 27 01:12:33 axf smartd[23962]: Device: /dev/sda [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.5.gz:33088:Oct 27 01:12:33 axf smartd[23962]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.5.gz:33089:Oct 27 01:12:33 axf smartd[23962]: Device: /dev/sdc [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.5.gz:33271:Oct 27 01:42:37 axf smartd[23962]: Device: /dev/sda [SAT], self-test in progress, 10% remaining /var/log/daemon.log.5.gz:33273:Oct 27 01:42:52 axf smartd[23962]: Device: /dev/sdb [SAT], self-test in progress, 10% remaining /var/log/daemon.log.5.gz:33288:Oct 27 01:42:52 axf smartd[23962]: Device: /dev/sdc [SAT], previous self-test completed without error /var/log/daemon.log.5.gz:33411:Oct 27 02:12:33 axf smartd[23962]: Device: /dev/sda [SAT], previous self-test completed without error /var/log/daemon.log.5.gz:33412:Oct 27 02:12:33 axf smartd[23962]: Device: /dev/sdb [SAT], previous self-test completed without error /var/log/daemon.log.6.gz:29186:Oct 20 01:12:33 axf smartd[23962]: Device: /dev/sda [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.6.gz:29187:Oct 20 01:12:33 axf smartd[23962]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.6.gz:29188:Oct 20 01:12:33 axf smartd[23962]: Device: /dev/sdc [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.6.gz:29310:Oct 20 01:42:33 axf smartd[23962]: Device: /dev/sda [SAT], previous self-test completed without error /var/log/daemon.log.6.gz:29311:Oct 20 01:42:33 axf smartd[23962]: Device: /dev/sdb [SAT], previous self-test completed without error /var/log/daemon.log.6.gz:29312:Oct 20 01:42:33 axf smartd[23962]: Device: /dev/sdc [SAT], previous self-test completed without error /var/log/daemon.log.7.gz:23353:Oct 13 01:12:33 axf smartd[23962]: Device: /dev/sda [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.7.gz:23354:Oct 13 01:12:34 axf smartd[23962]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.7.gz:23355:Oct 13 01:12:34 axf smartd[23962]: Device: /dev/sdc [SAT], starting scheduled Short Self-Test. /var/log/daemon.log.7.gz:23416:Oct 13 01:42:33 axf smartd[23962]: Device: /dev/sda [SAT], previous self-test completed without error /var/log/daemon.log.7.gz:23417:Oct 13 01:42:33 axf smartd[23962]: Device: /dev/sdb [SAT], previous self-test completed without error /var/log/daemon.log.7.gz:23418:Oct 13 01:42:33 axf smartd[23962]: Device: /dev/sdc [SAT], previous self-test completed without error The drive with smaller load (sdc) did not have problems completing short-test, which is not the case with rest two drives (sda, sdb). As far as I can see when the IO is very high (~80% disk utilization) the test is getting extremelly slow or even restarts or hangs causing disks to become extremelly slow and renders the whole system unusable as simple disk operations take >30 seconds to complete and load average jumps over 80. Also this cause samba clients to restart connections due to timeout which causes new smbd processes to start which grows the total number of smbd processes running system out of memory. Swapping in this situation adding to an effect. According to atop the normal average io is <5ms, during stuck selftest the average io jumps to ~100ms, this is ~10 writes/second ~10k each! Tested on: linux-image-3.2.0-0.bpo.2-amd64 (3.2.20-1~bpo60+1), smartmontools: 5.39.1+svn3124-2 and 5.41+svn3365-1~bpo60+1 The results are the same. There are bugreports describing this situation: https://bugzilla.redhat.com/show_bug.cgi?id=503344 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=503439 http://www.mail-archive.com/freebsd-hackers@freebsd.org/msg67741.html http://pl.digipedia.org/usenet/thread/19509/1268/#post1206 -- Package-specific info: Output of /usr/share/bug/smartmontools: -- System Information: Debian Release: 6.0.6 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 3.2.0-0.bpo.2-amd64 (SMP w/8 CPU cores) Locale: LANG=ru_UA.UTF-8, LC_CTYPE=ru_UA.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages smartmontools depends on: ii debianutils 3.4 Miscellaneous utilities specific t ii libc6 2.11.3-4 Embedded GNU C Library: Shared lib ii libcap-ng0 0.6.4-1 An alternate posix capabilities li ii libgcc1 1:4.4.5-8 GCC support library ii libselinux1 2.0.96-1 SELinux runtime shared libraries ii libstdc++6 4.4.5-8 The GNU Standard C++ Library v3 ii lsb-base 3.2-23.2squeeze1 Linux Standard Base 3.2 init scrip Versions of packages smartmontools recommends: ii bsd-mailx [mailx] 8.1.2-0.20100314cvs-1 simple mail user agent ii heirloom-mailx [ma 12.4-2 feature-rich BSD mail(1) Versions of packages smartmontools suggests: pn gsmartcontrol <none> (no description available) pn smart-notifier <none> (no description available) -- Configuration Files: /etc/default/smartmontools changed: start_smartd=yes smartd_opts="--interval=1800" /etc/smartd.conf changed: DEVICESCAN -s (S/../../6/02) -m root -M exec /usr/share/smartmontools/smartd-runner -- no debconf information -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org