Bill McGonigle <[EMAIL PROTECTED]> writes: > So, then beyond hardware, I'm looking for suggestions as to what are > likely causes of software-based filesystem corruption. My primary > server lost its disk last night to filesystem corruption. There are > no bad blocks on the disk (badblocks r/w test and SMART extended self > test check out OK) and it's running the latest 2.4 kernel.
Beware of SMART. There are LOTS of things we've found that can go wrong with a disk that SMART never detects and suddenly the disk goes kaput. By same token, we've found that SMART will detect lots of error, and running the disk through the manufacturer's disk test software claims it's fine. That being said, we regularly run SMART testing in the background on all our systems and have it kick the machine out of production if it detects errors. We re-install the OS since re-formatting the drive will vector around bad blocks. Obviously, this is a test environment. A production system wouldn't have this luxury. Though, you could build the system with a 3-way software RAID mirror. Upon a SMART error detection, you could remove the bad drive, re-format it and re-join it to the mirror. The 3-way mirror is to make sure you still have a RAID mirror while repairing one disk. This way you're never completely at-risk. Or... you could just not use IDE [PS]ATA drives and invest in SCSI (or fiber channel) which are *still* higher quality drives than IDE and not worry about this nearly as much. > My only theories are undetected memory errors or kernel bugs. > Neither of which are logged. Short of Linux-HA what's the right way > to deal with this? IMO, SCSI or FC drives if this is for a production server system. -- Seeya, Paul _______________________________________________ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss