This seems to be the week for hard drive failures for me and my clients. Some things I've noticed have got me thinking:

9 out of 10 hard drives I've recovered have failed on the first few sectors. This is especially problematic for boot loaders and filesystems which lay out their superblocks and journals there. So, questions that come to mind:

* Is that part of the hard drive especially weak due to geometry? That would suggest placing superblocks elsewhere.

* Does having the essential filesystem bits there cause the drive to 'use up' that part of the disk first? That would suggest spreading around filesystem information.

Then once I dd_rescue as much of the drive as possible it's time to recover the filesystems. ext3 seems especially fragile to having the first block of the drive go kaput. So:

* are there filesystems where recovery has been designed in that are less susceptible to bad block damage? An ideal filesystem would allow me to lose all the files on those blocks but be able to recover the rest of the disk.

* are there any maintenance routines that could be run to replicate essential filesystem information? For instance, where the backup superblocks are stored, inode tables, etc. I can't think of any server I'm running that doesn't have enough spare cycles to do something like this in a nightly cron job.

So, then beyond hardware, I'm looking for suggestions as to what are likely causes of software-based filesystem corruption. My primary server lost its disk last night to filesystem corruption. There are no bad blocks on the disk (badblocks r/w test and SMART extended self test check out OK) and it's running the latest 2.4 kernel. My only theories are undetected memory errors or kernel bugs. Neither of which are logged. Short of Linux-HA what's the right way to deal with this?

RAID is certainly an answer where one has possession of the machine for the first set of problems. For the no-bad-blocks problem the same thing would have occurred with the errors propagated across two disks so short of RAD-hardening the system I'm at a loss for what I might have done better. Having consistent filesystems seems like an essential foundation for reliable computing but clearly I'm not there yet.

-Bill

-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
[EMAIL PROTECTED]           Cell: 603.252.2606
http://www.bfccomputing.com/    Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf

_______________________________________________
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss

Reply via email to