On Dec 10, 2009, at 8:36 AM, Mark Grant wrote:

From what I remember the problem with the hardware RAID controller is that the long delay before the drive responds causes the drive to be dropped from the RAID and then if you get another error on a different drive while trying to repair the RAID then that disk is also marked failed and your whole filesystem is gone even though most of the data is still readable on the disks; odds are you could have recovered 100% of the data using what is still readable on the complete set of drives, since the bad sectors on the two failed drives probably wouldn't be in the same place. The end result is worse than not using RAID because you lose everything rather than just the files with bad sectors (though if you're using mirroring rather than parity then you could presumably recover most of the data eventually).

Certainly if the disk was taking that long to respond I'd be replacing it ASAP, but ASAP may not be fast enough if a second drive has bad sectors too. And I have seen a consumer SATA drive repeatedly lock up a system for a minute doing retries when there was no indication at all beforehand that the drive had problems.

For the Solaris sd(7d) driver, the default timeout is 60 seconds with 3 or 5 retries, depending on the hardware. Whether you notice this at the application level depends on other factors: reads vs writes, etc. You can tune this, of
course, and you have access to the source.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to