... which sounds very similar to issues I've raised many times. ZFS should have the ability to double check what a drive is doing, and speculatively time out a device that appears to be failing in order to maintain pool performance.
If a single drive in a redundant pool can be seen to be responding 10-50x slower than others, or to have hundreds of oustanding IOs, ZFS should be able to flag it as 'possibly faulty' and return data from the rest of the pool without that one device blocking it. It should not block an entire redundant pool when just one device is behaving badly. And I don't care what the driver says. If the performance figures indicate there's a problem, that's a driver bug, and it's possible for ZFS to spot that. I've no problems with Sun's position that this should be done at the driver level, I agree that in theory that is where it should be dealt with, I just feel that in the real world bugs occur, and this extra sanity check could be useful in ensuring that ZFS still performs well despite problems in the device drivers. There have been reports to this forum now of single disk timeout errors have caused whole pool problems for devices connected via iscsi, usb, sas and sata. I've had personal experience of it on a test whitebox server using an AOC-SAT2-MV8, and similar problems have been reported on a Sun x4540. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss