Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

Ross Tue, 11 Aug 2009 03:20:29 -0700

... which sounds very similar to issues I've raised many times.  ZFS should 
have the ability to double check what a drive is doing, and speculatively time 
out a device that appears to be failing in order to maintain pool performance.


If a single drive in a redundant pool can be seen to be responding 10-50x 
slower than others, or to have hundreds of oustanding IOs, ZFS should be able 
to flag it as 'possibly faulty' and return data from the rest of the pool 
without that one device blocking it.  It should not block an entire redundant 
pool when just one device is behaving badly.

And I don't care what the driver says.  If the performance figures indicate 
there's a problem, that's a driver bug, and it's possible for ZFS to spot that.

I've no problems with Sun's position that this should be done at the driver 
level, I agree that in theory that is where it should be dealt with, I just 
feel that in the real world bugs occur, and this extra sanity check could be 
useful in ensuring that ZFS still performs well despite problems in the device 
drivers.

There have been reports to this forum now of single disk timeout errors have 
caused whole pool problems for devices connected via iscsi, usb, sas and sata.  
I've had personal experience of it on a test whitebox server using an 
AOC-SAT2-MV8, and similar problems have been reported on a Sun x4540.
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

Reply via email to