on hard drive failures

Bill McGonigle Fri, 06 Jan 2006 12:40:13 -0800

This seems to be the week for hard drive failures for me and myclients. Some things I've noticed have got me thinking:

9 out of 10 hard drives I've recovered have failed on the first fewsectors. This is especially problematic for boot loaders andfilesystems which lay out their superblocks and journals there. So,questions that come to mind:

* Is that part of the hard drive especially weak due to geometry? Thatwould suggest placing superblocks elsewhere.

* Does having the essential filesystem bits there cause the drive to'use up' that part of the disk first? That would suggest spreadingaround filesystem information.

Then once I dd_rescue as much of the drive as possible it's time torecover the filesystems. ext3 seems especially fragile to having thefirst block of the drive go kaput. So:

* are there filesystems where recovery has been designed in that areless susceptible to bad block damage? An ideal filesystem would allowme to lose all the files on those blocks but be able to recover therest of the disk.

* are there any maintenance routines that could be run to replicateessential filesystem information? For instance, where the backupsuperblocks are stored, inode tables, etc. I can't think of anyserver I'm running that doesn't have enough spare cycles to dosomething like this in a nightly cron job.

So, then beyond hardware, I'm looking for suggestions as to what arelikely causes of software-based filesystem corruption. My primaryserver lost its disk last night to filesystem corruption. There are nobad blocks on the disk (badblocks r/w test and SMART extended self testcheck out OK) and it's running the latest 2.4 kernel. My only theoriesare undetected memory errors or kernel bugs. Neither of which arelogged. Short of Linux-HA what's the right way to deal with this?

RAID is certainly an answer where one has possession of the machine forthe first set of problems. For the no-bad-blocks problem the samething would have occurred with the errors propagated across two disksso short of RAD-hardening the system I'm at a loss for what I mighthave done better. Having consistent filesystems seems like anessential foundation for reliable computing but clearly I'm not thereyet.


-Bill

-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
[EMAIL PROTECTED]           Cell: 603.252.2606
http://www.bfccomputing.com/    Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf

_______________________________________________
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss

on hard drive failures

Reply via email to