> At work I've got a server with an LSI MegaRAID (dmesg below) that > suddenly seems to be killing hard drives. Last Thursday I had one drive > fail, and the system didn't begin rebuilding onto the hot spare until I > rebooted.
I would hope that the controller isn't killing drives. Can we presume the system has clean power, temps are ok, no vibration, etc. ? > Hitachi's drive testing tool seems to be windows only, so are there any > drive checking utilities that can check an individual drive when it's a > part of a RAID1? Or is it safe to assume that if the drive fails in the > RAID it is really dead. I'm trying to make sure I'm not seeing some > kind of problem with the enclosure or the megaraid card before I start > shipping drives back to Hitachi. Can you get the SMART data from the drives? Interpreting SMART data is another problem, but maybe you can find a clue there. Is it possible that the drives just took "too long" to read or write and the RAID marked them bad? Maybe remapping a bad sector takes too long... Maybe hook them to a different controller (no RAID) and do a simple test with dd over the entire drive, something like dd if=/dev/suspect_disk of=/dev/null bs=1m dd if=/dev/zero of=/dev/suspect_disk bs=1m and see if you get any errors from dd or in dmesg.