smartd long self-test causes drives to hang

2008-11-24 Thread Jo Rhett
I've spent about 3 months tracing down what was causing my personal  
colo box to start getting "sluggish" right around dawn every Saturday  
morning.  It took so long because some mornings I simply couldn't pull  
my head out of my tail enough to do proper debugging.


The cause was *really slow* filesystem response time.  No cron jobs in  
that period.  No specific process ran any slower than another,  
although I eventually learned that ones which did no file i/o were  
fine.  And finally I realized that just "ls -la" was very slow (~1  
minute) even after I had killed off every disk-using process in the  
system.  SMTP and HTTP in particular were basically fubar.


No data loss, just *real slow*.  Nothing other than a soft reboot ever  
solved the problem.Even leaving it running only minimal processes  
for 24 hours didn't bring it back to normal.


Finally I was browsing through Jeremy Chadwick's list of known ATA  
problems and spotted his comments about smartd self-tests causing  
problems.  Sure enough, my long self test was scheduled for 5am on  
Saturday mornings.  Rechecking the observed slow-down periods  
confirmed that the problem never became visible before 5am.   
(sometimes it took up to 45 minutes before things slowed down enough  
to set off monitoring alarms)


So, long story short, if you're having weirdness in system time  
response - check the smartd configuration, and try disabling the self  
tests.  The short self test I was running daily didn't appear to  
affect anything, but the long test was just bringing the system to  
just shuddering and limping at best.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: smartd long self-test causes drives to hang

2008-11-24 Thread Jo Rhett
On re-reading the message I realized that my message was in danger of  
being content-free.


gmirror whole-disk mirror of seagate 300gb drives

$ atacontrol list
ATA channel 0:
Master:  ad0  ATA/ATAPI revision 7
Slave:   ad1  ATA/ATAPI revision 7

$ gmirror list
Geom name: gm0
State: COMPLETE
Components: 2
Balance: round-robin
Slice: 4096
Flags: NONE
GenID: 0
SyncID: 1
ID: 575427344
Providers:
1. Name: mirror/gm0
   Mediasize: 300069051904 (279G)
   Sectorsize: 512
   Mode: r5w5e6
Consumers:
1. Name: ad0
   Mediasize: 300069052416 (279G)
   Sectorsize: 512
   Mode: r1w1e1
   State: ACTIVE
   Priority: 0
   Flags: DIRTY
   GenID: 0
   SyncID: 1
   ID: 3917165570
2. Name: ad1
   Mediasize: 300069052416 (279G)
   Sectorsize: 512
   Mode: r1w1e1
   State: ACTIVE
   Priority: 0
   Flags: DIRTY
   GenID: 0
   SyncID: 1
   ID: 3874187635


On Nov 24, 2008, at 12:48 PM, Jo Rhett wrote:
I've spent about 3 months tracing down what was causing my personal  
colo box to start getting "sluggish" right around dawn every  
Saturday morning.  It took so long because some mornings I simply  
couldn't pull my head out of my tail enough to do proper debugging.


The cause was *really slow* filesystem response time.  No cron jobs  
in that period.  No specific process ran any slower than another,  
although I eventually learned that ones which did no file i/o were  
fine.  And finally I realized that just "ls -la" was very slow (~1  
minute) even after I had killed off every disk-using process in the  
system.  SMTP and HTTP in particular were basically fubar.


No data loss, just *real slow*.  Nothing other than a soft reboot  
ever solved the problem.Even leaving it running only minimal  
processes for 24 hours didn't bring it back to normal.


Finally I was browsing through Jeremy Chadwick's list of known ATA  
problems and spotted his comments about smartd self-tests causing  
problems.  Sure enough, my long self test was scheduled for 5am on  
Saturday mornings.  Rechecking the observed slow-down periods  
confirmed that the problem never became visible before 5am.   
(sometimes it took up to 45 minutes before things slowed down enough  
to set off monitoring alarms)


So, long story short, if you're having weirdness in system time  
response - check the smartd configuration, and try disabling the  
self tests.  The short self test I was running daily didn't appear  
to affect anything, but the long test was just bringing the system  
to just shuddering and limping at best.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED] 
"


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"